New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH: Improves speed of one hot encoding #15762
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :)
indptr[0] = 0 | ||
np.sum(X_mask, axis=1, out=indptr[1:]) | ||
np.cumsum(indptr[1:], out=indptr[1:]) | ||
data = np.ones(indptr[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last line is the only significant change, right? Or do the other changes also matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They all matter. If I revert the last line, I get:
wall_time cpu_time
n_samples n_features
100000 50 0.575167 1.999209
100 1.244284 2.790776
200000 50 1.424922 2.935314
100 3.083203 4.711901
500000 50 3.762984 5.473710
100 7.914155 10.137310
Thanks @thomasjpfan |
This PR improves the time it takes to run one hot encoding by about 24%.
Benchmark script
This PR
Master
CC @jorisvandenbossche