Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] cd_fast speedup #15931

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Tjalling-a-a
Copy link

Rewriting implementation to remove redundant calculations.
Old implementation:

  1. R = R + w_ii * X[:,ii]
  2. tmp = X[:,ii] dot R
  3. f(tmp,...)
  4. R = R - w[ii] * X[:,ii]

New implementation:
substitute R from 1 into 2 ->
tmp = X[:,ii] dot (R + w_ii * X[:,ii])
tmp = X[:,ii] dot R + w_ii * X[:,ii] dot X[:,ii]
2. tmp = X[:,ii] dot R + w_ii * norm_cols_X[ii]
Then to update R:
4. R = R + (w_ii - w[ii]) * X[:,ii]

This removes step one, and rewrites step 2 and 4, improving loop speed.
The method here is probably also extendable to the other 3 functions.

Sadly my python skills are not good enough to build and test this, I did however test this in C++, so everything should work.
Help with building, testing, and extending is very much appreciated.

Tasklist:

  • Build and validate results.
  • Extend to sparse, gram, and multi_task

Rewriting implementation to remove redundant calculations.
Old implementation:
1. R += w_ii * X[:,ii]
2. tmp = X[:,ii] dot R
3. f(tmp,...)
4. R -= w[ii] * X[:,ii]

New implementation:
substitute R from 1 into 2 ->
tmp = X[:,ii] dot (R + w_ii * X[:,ii]) 
tmp = X[:,ii] dot R + w_ii * X[:,ii] dot X[:,ii]
2. tmp = X[:,ii] dot R + w_ii * norm_cols_X[ii]
Then to update R:
4. R += (w_ii - w[ii]) * X[:,ii]

This removes step one, and rewrites step 2 and 4, improving loop speed.
The method here is probably also extendable to the other 3 functions.

Sadly my python skills are not good enough to build and test this, I did however test this in C++, so everything should work.
Fixes  bug in the residual update
@Tjalling-a-a Tjalling-a-a changed the title Cd fast speedup [WIP] Cd fast speedup Dec 19, 2019
@Tjalling-a-a Tjalling-a-a changed the title [WIP] Cd fast speedup [WIP] cd fast speedup Dec 19, 2019
@Tjalling-a-a Tjalling-a-a changed the title [WIP] cd fast speedup [WIP] cd_fast speedup Dec 19, 2019
@jnothman
Copy link
Member

The tests at least seem to pass with your changes.

Could you please benchmark the speed difference? There might be something in the benchmarks directory to help

Base automatically changed from master to main January 22, 2021 10:51
@thomasjpfan thomasjpfan added Needs Benchmarks A tag for the issues and PRs which require some benchmarks cython labels Apr 8, 2021
@Micky774
Copy link
Contributor

Benchmarked using this. Results by shape of input data. Mean and standard deviation calculated over 10 trials.

X.shape=(500000, 200)
On main: 13.22+-0.05
On branch: 13.22+-0.21

X.shape=(5000, 20000)
On main: 18.81+-0.34
On branch: 18.32+-0.15

X.shape=(500, 200000)
On main: 20.61+-0.64
On branch: 19.15+-0.26

Overall I don't notice any significant speedup w/ this change. Please let me know if perhaps the benchmark file seems off, but if these numbers are accurate (which I believe they are) then it's probably not worth incorporating these changes.

@Tjalling-a-a
Copy link
Author

Tjalling-a-a commented Mar 19, 2022

Hi Micky774, thanks for reminding me this pull request still exists and running a benchmark.

I do believe that in the 2nd case a speedup of roughly 2.5% has been achieved, and in the 3d case a speedup of roughly 7% has been achieved. This could be quite significant for some users.

I would also be interested to see tests with longer run times, since the shorter the run time, the more time is consumed by other functions.

Perhaps these provide different results?
https://github.com/scikit-learn/scikit-learn/blob/main/benchmarks/bench_plot_lasso_path.py
https://github.com/scikit-learn/scikit-learn/blob/main/benchmarks/bench_lasso.py

It has been a while, but I do recall that I tested it on a mostly sparse dataset, perhaps that has an influence as well? Your test is a dense dataset if I'm not mistaken?

@Micky774
Copy link
Contributor

Okay so I re-ran the tests on a more consistent machine and this time did 30 trials and performed Welch's t-test for the code on main and this feature branch using scipy stats:

from scipy.stats import ttest_ind
ttest_ind(main_times, branch_times, equal_var=False)

In this case main_times and branch_times were generated the same way as my previous gist, but now repeated over 30 trials and I increased the iteration limit of the lasso estimator to 500 to get a longer computation per trial.

For X.shape=(500, 200000)
There is indeed a statistically significant p<1e-5 difference between the two with branch_times being faster, however it seems that it is only on average a 1.7% speedup on dense data. Specifically the times I observed were:

  • main: 50.66+-0.38
  • branch: 49.80+-0.82

For X.shape=(5000, 20000)
There was not a statistically significant change.

Overall, I would have imagined that there was a more significant speedup given by the proposed optimization, but this is what I've observed so far.

I tried using the two benchmarking files you suggested, but truthfully the difference got lost in the noise since they don't repeat the trials sufficiently and indeed the error margin was small.

I'd still be interested in seeing if there's a change in performance on sparse data, and I may try to evaluate that later, but for now the results are less dramatic than expected.

@Tjalling-a-a
Copy link
Author

Tjalling-a-a commented Mar 21, 2022

Hm, I did this optimization as part of a GPU implementation.
The result of this rewrite reduced 4 GPU kernels to 3, which makes a huge difference for GPU speed.

It might make sense that the optimization has much less influence on a (single thread?) CPU implementation.
Although I did think it would have some speedup since it basically trades a store for an addition, while also removing one "blocking" operation (namely 2 cannot start before 1). But perhaps the compiler is better than I thought...

Ps: I don't know how Cython compilation works, perhaps there are some optimization features turned off for you? This ofcourse would change both timings.

@Micky774
Copy link
Contributor

@Tjalling-a-a Do you think you could use the sample benchmark I provided to run it on your own machine as well for another point of data? If the numbers differ drastically it would justify a deeper investigation.

@Tjalling-a-a
Copy link
Author

I can yes, if you could either explain, or point to a manual on how to rebuild the cython file :)

@Micky774
Copy link
Contributor

I can yes, if you could either explain, or point to a manual on how to rebuild the cython file :)

If you already have scikit-learn built locally, then running python setup.py develop should suffice. That will automatically detect the altered source (.pyx) file and re-compile that cython component. If you want/need to rebuild scikit-learn then see the advanced insatallation instructions for more details. Let me know if you have any other questions or run into any issues with it -- thanks!

@Tjalling-a-a
Copy link
Author

Thanks, I will try and take a look at it this weekend!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython module:linear_model Needs Benchmarks A tag for the issues and PRs which require some benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants