Gap encoder speedups #680

LeoGrin · 2023-07-28T13:49:17Z

Speedup the GapEncoder.

Implement a better early stopping for the GapEncoder, based on the score. Benchmarked in Benchmark gap encoder early stopping #681. Based on Benchmark Early Stopping for the GapEncoder #663 and Benchmark GapEncoder divergence #593.
- Use an Exponentially weighted average of the batch score instead of computing the full score, based on sklearn's MinibatchNMF's code.
- Remove the early stopping logic based on W_change, as it was never reached. Thus remove the parameters tol.
- Remove the min_iter parameter.
Speedup each loop by taking profit of the sparsity of the count matrix. More precisely, compute H @ W only when X is non-zero (this is adapted from sklearn's MinibatchNMF). I sped up the W update which was the bottleneck. The H update was already better optimized for sparsity, but maybe it can also be sped-up using _special_sparse_dot (I won't have time to do it in the near future.)
Benchmark hyperparameters batch_size, max_iter_e_step and max_no_improvements after these changes. They now default to 1024, 1 and 5. Note: the benchmark should be rerun when we deal with id columns (Handle id columns differently #585), because the hp are chosen to mostly speed up the GapEncoder on these columns, as they are currently the bottleneck. max_iter_e_step should also perhaps be increased if we manage to speedup the H update.

Quick benchmark done on my laptop comparing the speed of this implementation to the current main branch, on the columns of traffic_violations (restricted to 100K rows):

LilianBoulard

Nice implem, thanks!
On top of the below comments, it also needs a proper test :)

benchmarks/bench_gap_encoder_hp.py

skrub/_gap_encoder.py

LeoGrin · 2023-07-31T10:36:35Z

Nice implem, thanks! On top of the below comments, it also needs a proper test :)

Thanks for the review! What kind of test do you have in mind? A test to check that update_multiplicative_w is the same function before and after the change?

LilianBoulard · 2023-08-01T13:50:28Z

Maybe test that passing None runs the maximum number of iterations, and with an early stopping value, that it doesn't (if possible).

GaelVaroquaux · 2023-08-02T20:17:14Z

Very nice. I just gave it a quick try:

On main:

In [1]: from skrub import datasets, TableVectorizer
   ...: 

In [2]: data = datasets.fetch_employee_salaries()

In [3]: tab_vec = TableVectorizer()

In [4]: %time X = tab_vec.fit_transform(data.X)
CPU times: user 36.5 s, sys: 2min 9s, total: 2min 46s
Wall time: 22.3 s

On this branch:

CPU times: user 11 s, sys: 30.9 s, total: 41.9 s
Wall time: 6.22 s

6 seconds has a very very different feel than 22 seconds.

and if I print X and just look at it with my eyes, the results seem exactly the same (so the difference is small). Very nice!!

GaelVaroquaux

LGTM. Good for merge as far as I am concerned. Thanks!!

LeoGrin · 2023-08-04T17:30:28Z

Maybe test that passing None runs the maximum number of iterations, and with an early stopping value, that it doesn't (if possible).

I added a test to check that it doesn't break with max_no_improvement=None, but it seems hard to check the number of iteration (we don't have a n_iter_ attribute). Do you have an idea how it could be done simply? Otherwise, are you fine with merging without this test?

Vincent-Maladiere

A few comments and this is good to go!

skrub/_gap_encoder.py

Vincent-Maladiere · 2023-08-07T14:19:13Z

skrub/_gap_encoder.py

+        HW = sp.coo_matrix((dot_vals, (ii, jj)), shape=X.shape)
+        # in sklearn, it was return WH.tocsr(), but it breaks the code in our case
+        # I'm not sure why
+        return HW


This will need further investigation in another PR

skrub/_gap_encoder.py

Vincent-Maladiere · 2023-08-07T14:23:33Z

Also, why is the parquet file necessary?

LeoGrin · 2023-08-07T15:19:37Z

Also, why is the parquet file necessary?

It contains the result of the hyperparameters benchmark

Vincent-Maladiere · 2023-08-07T21:18:44Z

Yes but still why would we need this file? It's usually not recommended to commit data in repository, the point of having benchmark files is also to reproduce this data.

LeoGrin · 2023-08-07T21:29:41Z

Yeah I'm not sure. We've been doing this for the previous benchmarks: it's pretty small files, and it makes it straightforward for other people to analyse the results. But I see your point, I'm kinda pro pushing the results but I don't have a strong opinion. Le lun. 7 août 2023 à 23:19, Vincent M ***@***.***> a écrit :

…

Yes but still why would we need this file? It's usually not recommended to commit data in repository, the point of having benchmark files is also to reproduce this data. — Reply to this email directly, view it on GitHub <#680 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AK46V2CXHOYE7ZTDPJOIAW3XUFLUBANCNFSM6AAAAAA23Q7GYE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Vincent-Maladiere · 2023-08-08T08:22:50Z

Ok, let's keep this file and move fast. We'll discuss these benchmark files later :)

Vincent-Maladiere · 2023-08-18T08:21:12Z

Hey @LeoGrin, could you add a test with verbose=True? That would help with the coverage! We'll merge then.

jovan-stojanovic

Hey @LeoGrin thanks, I think this is ready to be merged after you merge with main.

Vincent-Maladiere · 2023-08-21T15:36:12Z

There's an error that seems unrelated to your PR. I fixed it, could you rebase on main to see if this is still there?

Co-authored-by: Lilian <lilian@boulard.fr>

CHANGES.rst

benchmarks/bench_gap_encoder_hp.py

skrub/_gap_encoder.py

Co-authored-by: Lilian <lilian@boulard.fr>

CHANGES.rst

benchmarks/bench_gap_encoder_hp.py

Co-authored-by: Lilian <lilian@boulard.fr>

jovan-stojanovic · 2023-08-31T12:47:47Z

Thanks @LeoGrin !! This will be such an improvement ..

GaelVaroquaux · 2023-09-01T12:09:58Z

The doc building is indeed markedly faster:

Congratulations!

Vincent-Maladiere · 2023-09-01T13:18:37Z

where do you see this @GaelVaroquaux ?

GaelVaroquaux · 2023-09-01T14:59:39Z

I go to the history of commit (from the main github page), and I click on the green "tick" for a given commit, then follow the circleci link.

LeoGrin mentioned this pull request Jul 28, 2023

Benchmark gap encoder early stopping #681

Merged

LeoGrin changed the title ~~Gap encoder early stopping~~ Gap encoder speedups Jul 28, 2023

LeoGrin mentioned this pull request Jul 29, 2023

GapEncoder hyperparameter tuning #594

Closed

LilianBoulard reviewed Jul 31, 2023

View reviewed changes

benchmarks/bench_gap_encoder_hp.py Show resolved Hide resolved

skrub/_gap_encoder.py Outdated Show resolved Hide resolved

skrub/_gap_encoder.py Show resolved Hide resolved

GaelVaroquaux approved these changes Aug 2, 2023

View reviewed changes

jovan-stojanovic mentioned this pull request Aug 4, 2023

FEA Joiner add many-to-many joins #674

Merged

Vincent-Maladiere approved these changes Aug 7, 2023

View reviewed changes

jovan-stojanovic approved these changes Aug 21, 2023

View reviewed changes

Vincent-Maladiere mentioned this pull request Aug 21, 2023

TimeSeriesSplit definition in example 3 is duplicated #714

Merged

jovan-stojanovic assigned LeoGrin Aug 24, 2023

LeoGrin added 9 commits August 24, 2023 19:20

working sped up version

f903d88

benchmark hyperparameters

d21ca0c

speedup benchmark

2ada827

fix bug due to mixed type

77599fe

add benchmark results

42ec934

changelog

5bc7a8a

big speedup due to computing Ht@W only when Vt is non-zero

57deada

change naming in _special_sparse_dot

53d217e

change benchmark

602c9e4

LeoGrin and others added 9 commits August 24, 2023 19:23

speedup benchmark

ed3e6e5

fix bug

66d454c

add benchmark and change default

438e7f9

change default

2388dd8

Update skrub/_gap_encoder.py

34a0d66

Co-authored-by: Lilian <lilian@boulard.fr>

changelog

ab0d5d6

add test for max_no_improvement=None

3203a71

Vincent's suggestions

2c4212d

add verbose=True to tests for coverage

c0295c9

LeoGrin force-pushed the gap_encoder_speedup branch from 731ab37 to c0295c9 Compare August 24, 2023 17:31

LilianBoulard reviewed Aug 25, 2023

View reviewed changes

LeoGrin and others added 5 commits August 25, 2023 13:52

Apply suggestions from code review

ee2e6a2

Co-authored-by: Lilian <lilian@boulard.fr>

use loguru

b467919

remove global

ef4c6fc

pre-commit

2e596d9

fix changelog

0d8c675

LilianBoulard reviewed Aug 25, 2023

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

benchmarks/bench_gap_encoder_hp.py Outdated Show resolved Hide resolved

benchmarks/bench_gap_encoder_hp.py Outdated Show resolved Hide resolved

Apply suggestions from code review

439a121

Co-authored-by: Lilian <lilian@boulard.fr>

jovan-stojanovic requested a review from LilianBoulard August 31, 2023 09:15

jovan-stojanovic merged commit 3b37a93 into skrub-data:main Aug 31, 2023
24 checks passed

This was referenced Aug 31, 2023

GapEncoder is slow #342

Closed

Faster example 07_multiple_key_join #727

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gap encoder speedups #680

Gap encoder speedups #680

LeoGrin commented Jul 28, 2023 •

edited

LilianBoulard left a comment

LeoGrin commented Jul 31, 2023

LilianBoulard commented Aug 1, 2023

GaelVaroquaux commented Aug 2, 2023 •

edited

GaelVaroquaux left a comment

LeoGrin commented Aug 4, 2023

Vincent-Maladiere left a comment

Vincent-Maladiere Aug 7, 2023

Vincent-Maladiere commented Aug 7, 2023

LeoGrin commented Aug 7, 2023

Vincent-Maladiere commented Aug 7, 2023

LeoGrin commented Aug 7, 2023 via email

Vincent-Maladiere commented Aug 8, 2023

Vincent-Maladiere commented Aug 18, 2023

jovan-stojanovic left a comment

Vincent-Maladiere commented Aug 21, 2023

jovan-stojanovic commented Aug 31, 2023

GaelVaroquaux commented Sep 1, 2023

Vincent-Maladiere commented Sep 1, 2023

GaelVaroquaux commented Sep 1, 2023

Gap encoder speedups #680

Gap encoder speedups #680

Conversation

LeoGrin commented Jul 28, 2023 • edited

LilianBoulard left a comment

Choose a reason for hiding this comment

LeoGrin commented Jul 31, 2023

LilianBoulard commented Aug 1, 2023

GaelVaroquaux commented Aug 2, 2023 • edited

GaelVaroquaux left a comment

Choose a reason for hiding this comment

LeoGrin commented Aug 4, 2023

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere Aug 7, 2023

Choose a reason for hiding this comment

Vincent-Maladiere commented Aug 7, 2023

LeoGrin commented Aug 7, 2023

Vincent-Maladiere commented Aug 7, 2023

LeoGrin commented Aug 7, 2023 via email

Vincent-Maladiere commented Aug 8, 2023

Vincent-Maladiere commented Aug 18, 2023

jovan-stojanovic left a comment

Choose a reason for hiding this comment

Vincent-Maladiere commented Aug 21, 2023

jovan-stojanovic commented Aug 31, 2023

GaelVaroquaux commented Sep 1, 2023

Vincent-Maladiere commented Sep 1, 2023

GaelVaroquaux commented Sep 1, 2023

LeoGrin commented Jul 28, 2023 •

edited

GaelVaroquaux commented Aug 2, 2023 •

edited