Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gap encoder speedups #680

Merged
merged 24 commits into from
Aug 31, 2023
Merged

Conversation

LeoGrin
Copy link
Contributor

@LeoGrin LeoGrin commented Jul 28, 2023

Speedup the GapEncoder.

  • Implement a better early stopping for the GapEncoder, based on the score. Benchmarked in Benchmark gap encoder early stopping #681. Based on Benchmark Early Stopping for the GapEncoder #663 and Benchmark GapEncoder divergence #593.

    • Use an Exponentially weighted average of the batch score instead of computing the full score, based on sklearn's MinibatchNMF's code.
    • Remove the early stopping logic based on W_change, as it was never reached. Thus remove the parameters tol.
    • Remove the min_iter parameter.
  • Speedup each loop by taking profit of the sparsity of the count matrix. More precisely, compute H @ W only when X is non-zero (this is adapted from sklearn's MinibatchNMF). I sped up the W update which was the bottleneck. The H update was already better optimized for sparsity, but maybe it can also be sped-up using _special_sparse_dot (I won't have time to do it in the near future.)

  • Benchmark hyperparameters batch_size, max_iter_e_step and max_no_improvements after these changes. They now default to 1024, 1 and 5. Note: the benchmark should be rerun when we deal with id columns (Handle id columns differently #585), because the hp are chosen to mostly speed up the GapEncoder on these columns, as they are currently the bottleneck. max_iter_e_step should also perhaps be increased if we manage to speedup the H update.

Quick benchmark done on my laptop comparing the speed of this implementation to the current main branch, on the columns of traffic_violations (restricted to 100K rows):

image

@LeoGrin LeoGrin changed the title Gap encoder early stopping Gap encoder speedups Jul 28, 2023
Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice implem, thanks!
On top of the below comments, it also needs a proper test :)

benchmarks/bench_gap_encoder_hp.py Show resolved Hide resolved
skrub/_gap_encoder.py Outdated Show resolved Hide resolved
skrub/_gap_encoder.py Show resolved Hide resolved
@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jul 31, 2023

Nice implem, thanks! On top of the below comments, it also needs a proper test :)

Thanks for the review! What kind of test do you have in mind? A test to check that update_multiplicative_w is the same function before and after the change?

@LilianBoulard
Copy link
Member

Maybe test that passing None runs the maximum number of iterations, and with an early stopping value, that it doesn't (if possible).

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 2, 2023

Very nice. I just gave it a quick try:

On main:

In [1]: from skrub import datasets, TableVectorizer
   ...: 

In [2]: data = datasets.fetch_employee_salaries()

In [3]: tab_vec = TableVectorizer()

In [4]: %time X = tab_vec.fit_transform(data.X)
CPU times: user 36.5 s, sys: 2min 9s, total: 2min 46s
Wall time: 22.3 s

On this branch:

CPU times: user 11 s, sys: 30.9 s, total: 41.9 s
Wall time: 6.22 s

6 seconds has a very very different feel than 22 seconds.

and if I print X and just look at it with my eyes, the results seem exactly the same (so the difference is small). Very nice!!

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good for merge as far as I am concerned. Thanks!!

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Aug 4, 2023

Maybe test that passing None runs the maximum number of iterations, and with an early stopping value, that it doesn't (if possible).

I added a test to check that it doesn't break with max_no_improvement=None, but it seems hard to check the number of iteration (we don't have a n_iter_ attribute). Do you have an idea how it could be done simply? Otherwise, are you fine with merging without this test?

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments and this is good to go!

skrub/_gap_encoder.py Outdated Show resolved Hide resolved
skrub/_gap_encoder.py Outdated Show resolved Hide resolved
Comment on lines +1030 to +1052
HW = sp.coo_matrix((dot_vals, (ii, jj)), shape=X.shape)
# in sklearn, it was return WH.tocsr(), but it breaks the code in our case
# I'm not sure why
return HW
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need further investigation in another PR

skrub/_gap_encoder.py Show resolved Hide resolved
@Vincent-Maladiere
Copy link
Member

Also, why is the parquet file necessary?

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Aug 7, 2023

Also, why is the parquet file necessary?

It contains the result of the hyperparameters benchmark

@Vincent-Maladiere
Copy link
Member

Yes but still why would we need this file? It's usually not recommended to commit data in repository, the point of having benchmark files is also to reproduce this data.

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Aug 7, 2023 via email

@Vincent-Maladiere
Copy link
Member

Ok, let's keep this file and move fast. We'll discuss these benchmark files later :)

@Vincent-Maladiere
Copy link
Member

Hey @LeoGrin, could you add a test with verbose=True? That would help with the coverage! We'll merge then.

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @LeoGrin thanks, I think this is ready to be merged after you merge with main.

@Vincent-Maladiere
Copy link
Member

There's an error that seems unrelated to your PR. I fixed it, could you rebase on main to see if this is still there?

CHANGES.rst Outdated Show resolved Hide resolved
benchmarks/bench_gap_encoder_hp.py Outdated Show resolved Hide resolved
benchmarks/bench_gap_encoder_hp.py Outdated Show resolved Hide resolved
benchmarks/bench_gap_encoder_hp.py Outdated Show resolved Hide resolved
benchmarks/bench_gap_encoder_hp.py Outdated Show resolved Hide resolved
skrub/_gap_encoder.py Outdated Show resolved Hide resolved
skrub/_gap_encoder.py Outdated Show resolved Hide resolved
CHANGES.rst Outdated Show resolved Hide resolved
benchmarks/bench_gap_encoder_hp.py Outdated Show resolved Hide resolved
benchmarks/bench_gap_encoder_hp.py Outdated Show resolved Hide resolved
Co-authored-by: Lilian <lilian@boulard.fr>
@jovan-stojanovic
Copy link
Member

Thanks @LeoGrin !! This will be such an improvement ..

@jovan-stojanovic jovan-stojanovic merged commit 3b37a93 into skrub-data:main Aug 31, 2023
24 checks passed
This was referenced Aug 31, 2023
@GaelVaroquaux
Copy link
Member

The doc building is indeed markedly faster:
image
image

Congratulations!

@Vincent-Maladiere
Copy link
Member

where do you see this @GaelVaroquaux ?

@GaelVaroquaux
Copy link
Member

I go to the history of commit (from the main github page), and I click on the green "tick" for a given commit, then follow the circleci link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants