Skip to content

ENH MinHash parallel #267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Dec 16, 2022
Merged

Conversation

LeoGrin
Copy link
Contributor

@LeoGrin LeoGrin commented Jun 29, 2022

Compute the min hash transform method in parallel, as suggested by @alexis-cvetkov.

We no longer use the self.hash_dict attribute, so the fit method does nothing now.

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 29, 2022

I don't understand why the coverage has changed. It seems that the function called by joblib.Parallel (compute_hash) is not counted in the coverage, but I may be reading codecov wrong.

@GaelVaroquaux
Copy link
Member

I worry that the self.hash_dict was really useful to speed things up by avoiding recomputation of repeated entries

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 29, 2022 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 29, 2022 via email

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 29, 2022

@GaelVaroquaux But are people often using the same encoder to transform several Xs?

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jul 6, 2022

@GaelVaroquaux just want to make sure I understood what you were saying before putting the hash_dict back in the code.

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jul 12, 2022

Following discussion with @GaelVaroquaux : using the same encoder to transform several Xs may happen in online learning settings, for instance with a big X.
I've put the hash_dict back in

@LilianBoulard LilianBoulard added this to the 0.4.0 release milestone Sep 9, 2022
Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve coverage, maybe you can add a test for the min_hash asserting that the parameters need to have the right value.
If not, the appropriate error is raised (for instance when the hashing parameter is neither fast nor murmur).

Otherwise looks good!

Thanks!


# Compute the hashes for unique values
unique_x, indices_x = np.unique(X, return_inverse=True)
unique_x_trans = Parallel(n_jobs=self.n_jobs)(delayed(compute_hash)(x) for x in unique_x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should apply things on batches rather than on single rows. I fear that single rows will lead to overhead.

Maybe we should to to use
https://scikit-learn.org/stable/modules/generated/sklearn.utils.gen_even_slices.html
as in
https://github.com/scikit-learn/scikit-learn/blob/703bee65e2122ad273cf0b42460c5c28ca638af8/sklearn/decomposition/_lda.py#L460

We should benchmark this to see what the fastest / best option. Talk to @jjerphan, he has good experience in benchmarking parallel computing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarking really quickly on my Mac M1 with 8 cores, the rows version is about twice faster than the batched version, but we should do a more serious benchmark.

@jovan-stojanovic jovan-stojanovic changed the title Min hash parallel ENH MinHash parallel Nov 16, 2022
@jovan-stojanovic
Copy link
Member

The changelog test apparently works fine :) You should add the PR number to CHANGES.rst

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Dec 14, 2022

  • I've run the benchmark (thanks Lilian) on Margaret, and it confirms that the batched version seems to always be better. I've also added to the benchmark a batch_per_job parameter, but it had no influence on the results.

  • I've put the minhashencoder code with all options in the bench_minhash function, for reproducibility.

  • In the original minhashencoder file, I've only kept the batched version, and removed the parameter to choose.

Ready for review!

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better!

@LilianBoulard
Copy link
Member

Just waiting for Jovan's review, and we'll be ready to merge :)

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just two minor comments/questions.

Comment on lines +2 to +6
# The MinHashEncoder version used for the benchmark
# On the main branch, we only kept the best version of the MinHashEncoder
# which is the batched version
# with batch_per_job=1
# (the batch_per_job parameter has no effect on the results)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see, batches are useful in the end.
The detailed benchmarking results also depend on the number of batches per job, is it because the gain is not that big that we don't leave it as a choice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the impact of batch_per_job is very small, and the impact of batched is large but seems quite constant. In both cases, I don't think it is worth it to have additional parameters, both for ease of use and ease of maintenance.

@jovan-stojanovic
Copy link
Member

Merging! Thanks @LeoGrin

@jovan-stojanovic jovan-stojanovic merged commit cd826ab into skrub-data:main Dec 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants