Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel HashingVectorizer #20

Merged
merged 13 commits into from
May 9, 2019
Merged

Parallel HashingVectorizer #20

merged 13 commits into from
May 9, 2019

Conversation

rth
Copy link
Owner

@rth rth commented Mar 3, 2019

This is a first implementation of the parallel token counting using Rayon.

So far there are two issues,

Benchmarks with 2 CPU cores,

master

# vectorizing 19924 documents:
      HashingVectorizer (text-vectorize): 1.29s [70.3 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.69s [33.8 MB/s], shape=(19924, 208706), nnz=3962338

this PR
Using 2 core CPU, so RAYON_NUM_THREADS=4 corresponds to hyperthreading,

$ RAYON_NUM_THREADS=1 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 2.11s [43.2 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 3.22s [28.3 MB/s], shape=(19924, 208706), nnz=3962338
$ RAYON_NUM_THREADS=2 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 1.42s [63.9 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.59s [35.2 MB/s], shape=(19924, 208706), nnz=3962338
$ RAYON_NUM_THREADS=4 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 1.33s [68.2 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.48s [36.6 MB/s], shape=(19924, 208706), nnz=3962338

so the parallel scaling doesn't look so bad, it's more that the single-threaded implementation in this case is much slower than on master.

@rth rth mentioned this pull request Mar 3, 2019
2 tasks
@rth rth changed the title WIP Parallel token counting Parallel HashingVectorizer May 8, 2019
@rth
Copy link
Owner Author

rth commented May 8, 2019

In the end, this PR contains only the parallel version of the HashingVectorizer. CountVectorizer could be parallelized in a follow-up PR, the situation there is more complicated as it is not stateless and the vocabulary needs to be passed to different threads.

For HashingVectorizer, the scaling is reasonably good up to 8-16 CPU cores, after that we seem to reach the strong scaling limit, at least for this dataset. The maximum speed-up obtained is x5 of the scalar version. For n_jobs=1 we fall back to the non parallelized version,

# vectorizing 19924 documents:
     HashingVectorizer (scikit-learn): 4.75s [19.2 MB/s], shape=(19924, 1048576), nnz=4177915
     HashingVectorizer (vtext, n_jobs=1): 1.16s [78.2 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=2): 0.66s [137.4 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=4): 0.40s [227.0 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=8): 0.27s [340.0 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=16): 0.23s [400.7 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=32): 0.23s [394.6 MB/s], shape=(19924, 1048576), nnz=3961670

tested on EC2 c4.8xlarge with 36 CPU cores, and loading files from a tmpfs to avoid disk IO limitations.

The only limitation is that currently n_jobs>1 uses all available CPU cores instead of the desired number. It could be adjusted at start time with RAYON_NUM_THREADS env variable. Fixing this would require using a local rayon thread pool instead of the global one. I have not found out how to do that with the current pipeline definition yet.

@rth rth merged commit 84db353 into master May 9, 2019
@rth rth deleted the parallel-pipe branch May 9, 2019 07:19
@rth rth mentioned this pull request May 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant