-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel HashingVectorizer #20
Conversation
In the end, this PR contains only the parallel version of the HashingVectorizer. CountVectorizer could be parallelized in a follow-up PR, the situation there is more complicated as it is not stateless and the vocabulary needs to be passed to different threads. For HashingVectorizer, the scaling is reasonably good up to 8-16 CPU cores, after that we seem to reach the strong scaling limit, at least for this dataset. The maximum speed-up obtained is x5 of the scalar version. For
tested on EC2 c4.8xlarge with 36 CPU cores, and loading files from a tmpfs to avoid disk IO limitations. The only limitation is that currently |
This is a first implementation of the parallel token counting using Rayon.
So far there are two issues,
Benchmarks with 2 CPU cores,
master
this PR
Using 2 core CPU, so
RAYON_NUM_THREADS=4
corresponds to hyperthreading,so the parallel scaling doesn't look so bad, it's more that the single-threaded implementation in this case is much slower than on master.