Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH reuse parent histogram in HGBT #26189

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

lorentzenchr
Copy link
Member

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

This PR reuses the parent's histogram for the feature that was split on. This saves a little time.

Any other comments?

The implementation is no beauty. If we want to include this improvement, suggestions or alternative PRs are welcome.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Apr 15, 2023

Simple benchmark

On the higgs dataset, this PR gives ~4% speed-up, which is about the expected amount (1 / 28 features = 3.5%).

Running

python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100

on main gives

Fit 100 trees in 68.848 s, (3100 total leaves)
Time spent computing histograms: 39.301s
Time spent finding best splits:  0.259s
Time spent applying splits:      8.559s
Time spent predicting:           1.956s
fitted in 69.034s
predicted in 8.757s, ROC AUC: 0.8228, ACC: 0.7415

On this PR:

Fit 100 trees in 65.878 s, (3100 total leaves)
Time spent computing histograms: 38.698s
Time spent finding best splits:  0.236s
Time spent applying splits:      8.407s
Time spent predicting:           1.964s
fitted in 66.047s
predicted in 8.215s, ROC AUC: 0.8228, ACC: 0.7415

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! I left a first pass review.

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/histogram.pyx Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/histogram.pyx Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/histogram.pyx Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/grower.py Outdated Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Sep 6, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5f13884. Link to the linter CI: here

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I do not think this change is too complex, it still adds more complexity to the codebase. Overall, do you think the complexity is worth the 1-1/n_features performance bump?

Comment on lines 584 to +591
if n_samples_left < n_samples_right:
smallest_child = left_child_node
largest_child = right_child_node
is_left_child = True
else:
smallest_child = right_child_node
largest_child = left_child_node
is_left_child = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

is_left_child = n_samples_left < n_samples_right
if is_left_child:
    smallest_child = left_child_node
    largest_child = right_child_node
else:
    smallest_child = right_child_node
    largest_child = left_child_node

unsigned int split_bin_end
unsigned char is_categorical
BITSET_INNER_DTYPE_C [:] left_cat_bitset
bint has_parent_hist = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bint has_parent_hist = False
bint has_parent_hist = parent_split_info is not None

Comment on lines +170 to +171
if parent_split_info is not None:
has_parent_hist = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if parent_split_info is not None:
has_parent_hist = True
if has_parent_hist:

@@ -141,8 +156,30 @@ cdef class HistogramBuilder:
dtype=HISTOGRAM_DTYPE
)
bint has_interaction_cst = allowed_features is not None
# Feature index of the feature that the parent node was split on.
int split_feature_idx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int split_feature_idx
int parent_split_feature_idx

Comment on lines +162 to +164
unsigned int split_bin_start
# End (+1) of the bin indices belonging to the feature that was split on.
unsigned int split_bin_end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are also for the parent.

Suggested change
unsigned int split_bin_start
# End (+1) of the bin indices belonging to the feature that was split on.
unsigned int split_bin_end
unsigned int parent_split_bin_start
# End (+1) of the bin indices belonging to the feature that was split on.
unsigned int parent_split_bin_end

@@ -422,6 +422,13 @@ Changelog
In effect, less memory has to be allocated and deallocated.
:pr:`27865` by :user:`Christian Lorentzen <lorentzenchr>`.

- |Efficiency| :class:`ensemble.HistGradientBoostingClassifier` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be moved to 1.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants