-
Notifications
You must be signed in to change notification settings - Fork 153
ENH MinHash parallel #267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH MinHash parallel #267
Conversation
I don't understand why the coverage has changed. It seems that the function called by joblib.Parallel ( |
I worry that the self.hash_dict was really useful to speed things up by avoiding recomputation of repeated entries |
We use np.unique, so for one transform we don't recompute repeated entries.
Using self.hash_dict would indeed speed things up if we transform several
inputs with common entries, using the same encoder. Do you think that's
likely to happen / worth the additional memory usage ?
…On Wed, Jun 29, 2022, 20:59 Gael Varoquaux ***@***.***> wrote:
I worry that the self.hash_dict was really useful to speed things up by
avoiding recomputation of repeated entries
—
Reply to this email directly, view it on GitHub
<https://github.com/dirty-cat/dirty_cat/pull/267#issuecomment-1170376511>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AK46V2FGQSUNMHVILR5MBODVRSMIDANCNFSM52GSCHYA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Do you think that's likely to happen / worth the additional memory usage ?
Yes: it's very frequent. Typically, the entries are repeated many times.
|
@GaelVaroquaux But are people often using the same encoder to transform several Xs? |
@GaelVaroquaux just want to make sure I understood what you were saying before putting the hash_dict back in the code. |
Following discussion with @GaelVaroquaux : using the same encoder to transform several Xs may happen in online learning settings, for instance with a big X. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve coverage, maybe you can add a test for the min_hash asserting that the parameters need to have the right value.
If not, the appropriate error is raised (for instance when the hashing
parameter is neither fast
nor murmur
).
Otherwise looks good!
Thanks!
dirty_cat/minhash_encoder.py
Outdated
|
||
# Compute the hashes for unique values | ||
unique_x, indices_x = np.unique(X, return_inverse=True) | ||
unique_x_trans = Parallel(n_jobs=self.n_jobs)(delayed(compute_hash)(x) for x in unique_x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should apply things on batches rather than on single rows. I fear that single rows will lead to overhead.
Maybe we should to to use
https://scikit-learn.org/stable/modules/generated/sklearn.utils.gen_even_slices.html
as in
https://github.com/scikit-learn/scikit-learn/blob/703bee65e2122ad273cf0b42460c5c28ca638af8/sklearn/decomposition/_lda.py#L460
We should benchmark this to see what the fastest / best option. Talk to @jjerphan, he has good experience in benchmarking parallel computing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarking really quickly on my Mac M1 with 8 cores, the rows version is about twice faster than the batched version, but we should do a more serious benchmark.
The changelog test apparently works fine :) You should add the PR number to CHANGES.rst |
…or reproductibility
Ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better!
Just waiting for Jovan's review, and we'll be ready to merge :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just two minor comments/questions.
...marks/results/minhash_batch_comparison-20221119-0181acf6fe4933f17ea34ccbc85dca3975c8e152.csv
Outdated
Show resolved
Hide resolved
# The MinHashEncoder version used for the benchmark | ||
# On the main branch, we only kept the best version of the MinHashEncoder | ||
# which is the batched version | ||
# with batch_per_job=1 | ||
# (the batch_per_job parameter has no effect on the results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I see, batches are useful in the end.
The detailed benchmarking results also depend on the number of batches per job, is it because the gain is not that big that we don't leave it as a choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the impact of batch_per_job
is very small, and the impact of batched
is large but seems quite constant. In both cases, I don't think it is worth it to have additional parameters, both for ease of use and ease of maintenance.
Merging! Thanks @LeoGrin |
Compute the min hash transform method in parallel, as suggested by @alexis-cvetkov.
We no longer use the self.hash_dict attribute, so the fit method does nothing now.