-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gap encoder speedups #680
Gap encoder speedups #680
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice implem, thanks!
On top of the below comments, it also needs a proper test :)
Thanks for the review! What kind of test do you have in mind? A test to check that |
Maybe test that passing |
Very nice. I just gave it a quick try: On main:
On this branch:
6 seconds has a very very different feel than 22 seconds. and if I print X and just look at it with my eyes, the results seem exactly the same (so the difference is small). Very nice!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Good for merge as far as I am concerned. Thanks!!
I added a test to check that it doesn't break with max_no_improvement=None, but it seems hard to check the number of iteration (we don't have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments and this is good to go!
HW = sp.coo_matrix((dot_vals, (ii, jj)), shape=X.shape) | ||
# in sklearn, it was return WH.tocsr(), but it breaks the code in our case | ||
# I'm not sure why | ||
return HW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need further investigation in another PR
Also, why is the parquet file necessary? |
It contains the result of the hyperparameters benchmark |
Yes but still why would we need this file? It's usually not recommended to commit data in repository, the point of having benchmark files is also to reproduce this data. |
Yeah I'm not sure. We've been doing this for the previous benchmarks: it's
pretty small files, and it makes it straightforward for other people to
analyse the results. But I see your point, I'm kinda pro pushing the
results but I don't have a strong opinion.
Le lun. 7 août 2023 à 23:19, Vincent M ***@***.***> a écrit :
… Yes but still why would we need this file? It's usually not recommended to
commit data in repository, the point of having benchmark files is also to
reproduce this data.
—
Reply to this email directly, view it on GitHub
<#680 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AK46V2CXHOYE7ZTDPJOIAW3XUFLUBANCNFSM6AAAAAA23Q7GYE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Ok, let's keep this file and move fast. We'll discuss these benchmark files later :) |
Hey @LeoGrin, could you add a test with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @LeoGrin thanks, I think this is ready to be merged after you merge with main.
There's an error that seems unrelated to your PR. I fixed it, could you rebase on main to see if this is still there? |
Co-authored-by: Lilian <lilian@boulard.fr>
731ab37
to
c0295c9
Compare
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Thanks @LeoGrin !! This will be such an improvement .. |
where do you see this @GaelVaroquaux ? |
I go to the history of commit (from the main github page), and I click on the green "tick" for a given commit, then follow the circleci link. |
Speedup the GapEncoder.
Implement a better early stopping for the GapEncoder, based on the score. Benchmarked in Benchmark gap encoder early stopping #681. Based on Benchmark Early Stopping for the GapEncoder #663 and Benchmark
GapEncoder
divergence #593.W_change
, as it was never reached. Thus remove the parameterstol
.min_iter
parameter.Speedup each loop by taking profit of the sparsity of the count matrix. More precisely, compute H @ W only when X is non-zero (this is adapted from sklearn's MinibatchNMF). I sped up the W update which was the bottleneck. The H update was already better optimized for sparsity, but maybe it can also be sped-up using
_special_sparse_dot
(I won't have time to do it in the near future.)Benchmark hyperparameters
batch_size
,max_iter_e_step
andmax_no_improvements
after these changes. They now default to 1024, 1 and 5. Note: the benchmark should be rerun when we deal with id columns (Handle id columns differently #585), because the hp are chosen to mostly speed up theGapEncoder
on these columns, as they are currently the bottleneck.max_iter_e_step
should also perhaps be increased if we manage to speedup the H update.Quick benchmark done on my laptop comparing the speed of this implementation to the current main branch, on the columns of
traffic_violations
(restricted to 100K rows):