[WIP] cd_fast speedup #15931

Tjalling-a-a · 2019-12-19T14:47:58Z

Rewriting implementation to remove redundant calculations.
Old implementation:

R = R + w_ii * X[:,ii]
tmp = X[:,ii] dot R
f(tmp,...)
R = R - w[ii] * X[:,ii]

New implementation:
substitute R from 1 into 2 ->
tmp = X[:,ii] dot (R + w_ii * X[:,ii])
tmp = X[:,ii] dot R + w_ii * X[:,ii] dot X[:,ii]
2. tmp = X[:,ii] dot R + w_ii * norm_cols_X[ii]
Then to update R:
4. R = R + (w_ii - w[ii]) * X[:,ii]

This removes step one, and rewrites step 2 and 4, improving loop speed.
The method here is probably also extendable to the other 3 functions.

Sadly my python skills are not good enough to build and test this, I did however test this in C++, so everything should work.
Help with building, testing, and extending is very much appreciated.

Tasklist:

Build and validate results.
Extend to sparse, gram, and multi_task

Rewriting implementation to remove redundant calculations. Old implementation: 1. R += w_ii * X[:,ii] 2. tmp = X[:,ii] dot R 3. f(tmp,...) 4. R -= w[ii] * X[:,ii] New implementation: substitute R from 1 into 2 -> tmp = X[:,ii] dot (R + w_ii * X[:,ii]) tmp = X[:,ii] dot R + w_ii * X[:,ii] dot X[:,ii] 2. tmp = X[:,ii] dot R + w_ii * norm_cols_X[ii] Then to update R: 4. R += (w_ii - w[ii]) * X[:,ii] This removes step one, and rewrites step 2 and 4, improving loop speed. The method here is probably also extendable to the other 3 functions. Sadly my python skills are not good enough to build and test this, I did however test this in C++, so everything should work.

Fixes bug in the residual update

jnothman · 2019-12-19T20:51:24Z

The tests at least seem to pass with your changes.

Could you please benchmark the speed difference? There might be something in the benchmarks directory to help

Micky774 · 2022-03-18T23:11:25Z

Benchmarked using this. Results by shape of input data. Mean and standard deviation calculated over 10 trials.

X.shape=(500000, 200)
On main: 13.22+-0.05
On branch: 13.22+-0.21

X.shape=(5000, 20000)
On main: 18.81+-0.34
On branch: 18.32+-0.15

X.shape=(500, 200000)
On main: 20.61+-0.64
On branch: 19.15+-0.26

Overall I don't notice any significant speedup w/ this change. Please let me know if perhaps the benchmark file seems off, but if these numbers are accurate (which I believe they are) then it's probably not worth incorporating these changes.

Tjalling-a-a · 2022-03-19T11:23:38Z

Hi Micky774, thanks for reminding me this pull request still exists and running a benchmark.

I do believe that in the 2nd case a speedup of roughly 2.5% has been achieved, and in the 3d case a speedup of roughly 7% has been achieved. This could be quite significant for some users.

I would also be interested to see tests with longer run times, since the shorter the run time, the more time is consumed by other functions.

Perhaps these provide different results?
https://github.com/scikit-learn/scikit-learn/blob/main/benchmarks/bench_plot_lasso_path.py
https://github.com/scikit-learn/scikit-learn/blob/main/benchmarks/bench_lasso.py

It has been a while, but I do recall that I tested it on a mostly sparse dataset, perhaps that has an influence as well? Your test is a dense dataset if I'm not mistaken?

Micky774 · 2022-03-21T00:59:24Z

Okay so I re-ran the tests on a more consistent machine and this time did 30 trials and performed Welch's t-test for the code on main and this feature branch using scipy stats:

from scipy.stats import ttest_ind
ttest_ind(main_times, branch_times, equal_var=False)

In this case main_times and branch_times were generated the same way as my previous gist, but now repeated over 30 trials and I increased the iteration limit of the lasso estimator to 500 to get a longer computation per trial.

For X.shape=(500, 200000)
There is indeed a statistically significant p<1e-5 difference between the two with branch_times being faster, however it seems that it is only on average a 1.7% speedup on dense data. Specifically the times I observed were:

main: 50.66+-0.38
branch: 49.80+-0.82

For X.shape=(5000, 20000)
There was not a statistically significant change.

Overall, I would have imagined that there was a more significant speedup given by the proposed optimization, but this is what I've observed so far.

I tried using the two benchmarking files you suggested, but truthfully the difference got lost in the noise since they don't repeat the trials sufficiently and indeed the error margin was small.

I'd still be interested in seeing if there's a change in performance on sparse data, and I may try to evaluate that later, but for now the results are less dramatic than expected.

Tjalling-a-a · 2022-03-21T01:27:03Z

Hm, I did this optimization as part of a GPU implementation.
The result of this rewrite reduced 4 GPU kernels to 3, which makes a huge difference for GPU speed.

It might make sense that the optimization has much less influence on a (single thread?) CPU implementation.
Although I did think it would have some speedup since it basically trades a store for an addition, while also removing one "blocking" operation (namely 2 cannot start before 1). But perhaps the compiler is better than I thought...

Ps: I don't know how Cython compilation works, perhaps there are some optimization features turned off for you? This ofcourse would change both timings.

Micky774 · 2022-03-22T03:44:22Z

@Tjalling-a-a Do you think you could use the sample benchmark I provided to run it on your own machine as well for another point of data? If the numbers differ drastically it would justify a deeper investigation.

Tjalling-a-a · 2022-03-22T20:56:54Z

I can yes, if you could either explain, or point to a manual on how to rebuild the cython file :)

Micky774 · 2022-03-23T03:06:06Z

I can yes, if you could either explain, or point to a manual on how to rebuild the cython file :)

If you already have scikit-learn built locally, then running python setup.py develop should suffice. That will automatically detect the altered source (.pyx) file and re-compile that cython component. If you want/need to rebuild scikit-learn then see the advanced insatallation instructions for more details. Let me know if you have any other questions or run into any issues with it -- thanks!

Tjalling-a-a · 2022-03-23T10:05:48Z

Thanks, I will try and take a look at it this weekend!

Tjalling-a-a added 2 commits December 19, 2019 15:31

bugfix in residual update

0d5fec9

Fixes bug in the residual update

Tjalling-a-a changed the title ~~Cd fast speedup~~ [WIP] Cd fast speedup Dec 19, 2019

Tjalling-a-a changed the title ~~[WIP] Cd fast speedup~~ [WIP] cd fast speedup Dec 19, 2019

Tjalling-a-a changed the title ~~[WIP] cd fast speedup~~ [WIP] cd_fast speedup Dec 19, 2019

github-actions bot added the module:linear_model label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:51

thomasjpfan added Needs Benchmarks A tag for the issues and PRs which require some benchmarks cython labels Apr 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] cd_fast speedup #15931

[WIP] cd_fast speedup #15931

Tjalling-a-a commented Dec 19, 2019

jnothman commented Dec 19, 2019

Micky774 commented Mar 18, 2022

Tjalling-a-a commented Mar 19, 2022 •

edited

Micky774 commented Mar 21, 2022

Tjalling-a-a commented Mar 21, 2022 •

edited

Micky774 commented Mar 22, 2022

Tjalling-a-a commented Mar 22, 2022

Micky774 commented Mar 23, 2022

Tjalling-a-a commented Mar 23, 2022

[WIP] cd_fast speedup #15931

Are you sure you want to change the base?

[WIP] cd_fast speedup #15931

Conversation

Tjalling-a-a commented Dec 19, 2019

jnothman commented Dec 19, 2019

Micky774 commented Mar 18, 2022

Tjalling-a-a commented Mar 19, 2022 • edited

Micky774 commented Mar 21, 2022

Tjalling-a-a commented Mar 21, 2022 • edited

Micky774 commented Mar 22, 2022

Tjalling-a-a commented Mar 22, 2022

Micky774 commented Mar 23, 2022

Tjalling-a-a commented Mar 23, 2022

Tjalling-a-a commented Mar 19, 2022 •

edited

Tjalling-a-a commented Mar 21, 2022 •

edited