-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] cd_fast speedup #15931
base: main
Are you sure you want to change the base?
[WIP] cd_fast speedup #15931
Conversation
Rewriting implementation to remove redundant calculations. Old implementation: 1. R += w_ii * X[:,ii] 2. tmp = X[:,ii] dot R 3. f(tmp,...) 4. R -= w[ii] * X[:,ii] New implementation: substitute R from 1 into 2 -> tmp = X[:,ii] dot (R + w_ii * X[:,ii]) tmp = X[:,ii] dot R + w_ii * X[:,ii] dot X[:,ii] 2. tmp = X[:,ii] dot R + w_ii * norm_cols_X[ii] Then to update R: 4. R += (w_ii - w[ii]) * X[:,ii] This removes step one, and rewrites step 2 and 4, improving loop speed. The method here is probably also extendable to the other 3 functions. Sadly my python skills are not good enough to build and test this, I did however test this in C++, so everything should work.
Fixes bug in the residual update
The tests at least seem to pass with your changes. Could you please benchmark the speed difference? There might be something in the benchmarks directory to help |
Benchmarked using this. Results by shape of input data. Mean and standard deviation calculated over 10 trials.
Overall I don't notice any significant speedup w/ this change. Please let me know if perhaps the benchmark file seems off, but if these numbers are accurate (which I believe they are) then it's probably not worth incorporating these changes. |
Hi Micky774, thanks for reminding me this pull request still exists and running a benchmark. I do believe that in the 2nd case a speedup of roughly 2.5% has been achieved, and in the 3d case a speedup of roughly 7% has been achieved. This could be quite significant for some users. I would also be interested to see tests with longer run times, since the shorter the run time, the more time is consumed by other functions. Perhaps these provide different results? It has been a while, but I do recall that I tested it on a mostly sparse dataset, perhaps that has an influence as well? Your test is a dense dataset if I'm not mistaken? |
Okay so I re-ran the tests on a more consistent machine and this time did 30 trials and performed Welch's t-test for the code on main and this feature branch using scipy stats: from scipy.stats import ttest_ind
ttest_ind(main_times, branch_times, equal_var=False) In this case For
For Overall, I would have imagined that there was a more significant speedup given by the proposed optimization, but this is what I've observed so far. I tried using the two benchmarking files you suggested, but truthfully the difference got lost in the noise since they don't repeat the trials sufficiently and indeed the error margin was small. I'd still be interested in seeing if there's a change in performance on sparse data, and I may try to evaluate that later, but for now the results are less dramatic than expected. |
Hm, I did this optimization as part of a GPU implementation. It might make sense that the optimization has much less influence on a (single thread?) CPU implementation. Ps: I don't know how Cython compilation works, perhaps there are some optimization features turned off for you? This ofcourse would change both timings. |
@Tjalling-a-a Do you think you could use the sample benchmark I provided to run it on your own machine as well for another point of data? If the numbers differ drastically it would justify a deeper investigation. |
I can yes, if you could either explain, or point to a manual on how to rebuild the cython file :) |
If you already have scikit-learn built locally, then running |
Thanks, I will try and take a look at it this weekend! |
Rewriting implementation to remove redundant calculations.
Old implementation:
New implementation:
substitute R from 1 into 2 ->
tmp = X[:,ii] dot (R + w_ii * X[:,ii])
tmp = X[:,ii] dot R + w_ii * X[:,ii] dot X[:,ii]
2. tmp = X[:,ii] dot R + w_ii * norm_cols_X[ii]
Then to update R:
4. R = R + (w_ii - w[ii]) * X[:,ii]
This removes step one, and rewrites step 2 and 4, improving loop speed.
The method here is probably also extendable to the other 3 functions.
Sadly my python skills are not good enough to build and test this, I did however test this in C++, so everything should work.
Help with building, testing, and extending is very much appreciated.
Tasklist: