Skip to content

Conversation

@lorentzenchr
Copy link
Member

Reference Issues/PRs

Suggested in #28064 (comment).

What does this implement/fix? Explain your changes.

HGBT's _BinMapper searches for bin thresholds. This calls np.unique and then percentile which both internally use np.sort.
This PR first applies np.sort such that the subsequent internal calls to np.sort are applied on an already sorted array.

Any other comments?

The binning part of HGBT is very small compared to the actual fitting of trees (5s out of 60s for Higgs with 100 trees).

@github-actions
Copy link

github-actions bot commented Jan 11, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5aebdcd. Link to the linter CI: here

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Especially since:

In [1]: import numpy as np

In [2]: a = np.random.rand(10_000_000)

In [3]: %timeit np.unique(a)
847 ms ± 7.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: b = np.sort(a)

In [5]: %timeit np.unique(b)
177 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@adrinjalali adrinjalali added the Quick Review For PRs that are quick to review label Jan 23, 2024
@adrinjalali adrinjalali added this to the 1.5 milestone Jan 23, 2024
@thomasjpfan thomasjpfan merged commit 6baa7a0 into scikit-learn:main Feb 2, 2024
@lorentzenchr lorentzenchr deleted the hgbt_binning_sort_first branch February 2, 2024 16:49
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ensemble Performance Quick Review For PRs that are quick to review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants