New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Fixes performance regression in trees #23404
Conversation
I also tried the low cardinality benchmark:
So this is almost a 2x slowdown performance regression on low cardinality data. I am not sure the small perf gain we get on the high cardinality case is worth it. EDIT: The 2x slowdown factor stays when I switched n_samples = 50k to 200k in the low cardinality benchmark. |
I have tried to re-implement the old way to do the partitioning in quick sort as part of thomasjpfan#109. However this degrades the performance even more. The only thing that I have not tried to reimplement the manual tail-call elimination optimisation of the second recursive call. |
:mod:`sklearn.tree` | ||
................... | ||
|
||
- |Fix| Fixes performance regression :class:`tree.DecisionTreeClassifier`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- |Fix| Fixes performance regression :class:`tree.DecisionTreeClassifier`, | |
- |Fix| Fixes performance regression with low cardinality features for | |
:class:`tree.DecisionTreeClassifier`, |
I explored the hypothesis of using tail call elimination in thomasjpfan#110 and this is the cause of the slowdown either. I also tried to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try some profiling to try to identify the discrepancy.
We could also try to reimplement this on top of a fork, prior to the switch to typed memory views. Not sure if it's related or not.
In the mean time here are nitpicks.
if use_introsort == 1: | ||
_simultaneous_sort(values, indices, size, 2 * <int>log2(size), 1) | ||
else: | ||
_simultaneous_sort(values, indices, size, -1, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type is int
but we never return anything (0 is implicit in this case I guess).
Let's switch to void
?
size - pivot_idx - 1) | ||
_simultaneous_sort(values + pivot_idx + 1, | ||
indices + pivot_idx + 1, | ||
size - pivot_idx - 1, max_depth - 1, use_introsort) | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for the private helper function: the return type should be void.
|
||
cdef inline void heapsort( | ||
floating* values, | ||
ITYPE_t* samples, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be renamed to indices
to use a consistent notation with _simultaneous_sort
.
I used the current sorting with pointers (instead of the typed memory views) and I get the following bench: with 1.0.2:
with the current PR:
Using pointers instead of typed memoryviews:
So it is not coming from the memoryviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what can explain de remaining perf difference. I tried to do some profiling with py-spy record --format=speedscope -o out.speedscope --native -- python bench_script.py
and indeed the difference seem to happen below the sorting calls in BestSplitter
but since all the code is inline I cannot see below that level.
Maybe linux perf
could help. Alternatively one could instrument the code with counters to record the number of times each of the outer quicksort and inner heapsort functions are called in both branches.
However I don't have the time to do it now, so I am fine with merging this PR because its already fixing most of the regression.
cdef int simultaneous_sort( | ||
cdef inline void sift_down( | ||
floating* values, | ||
ITYPE_t* samples, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
samples
=> indices
to be renamed here as well.
FWIW reverting #22868 goes back to 1.0.2 performance for me, so it does indicate that #22868 is the only thing causing the performance regression (i.e. that there is no other changes at play): # reverts https://github.com/scikit-learn/scikit-learn/pull/22868
git revert 4cf932d98
make in
# run your benchmarks here |
@lesteve can you please do a side-PR that does this on top of the current |
I opened #23410. And indeed there were some conflicts to fix, I was probably navigating the history too much so I reverted for somewhere else than main and the revert did not have any conflicts when I posted my previous message ... I get the same performance as 1.0.2 in #23410. |
Closing in favor of #23410 that still requires a custom backport. |
Reference Issues/PRs
Fixes #23397
What does this implement/fix? Explain your changes.
This PR adds the heapsort part of introsort back into
simultaneous_sort
as a flag.Using the benchmark for low cardinality, I get
3.24 s
onmain
,0.11 s
with this PR, and0.07 s
on1.0.X
.This PR makes the performance about the same compared to
main
, but still much faster compared to1.0.1
.Original with high cardinality benchmark
This PR
main
1.0.1