Parallel HashingVectorizer #20

rth · 2019-03-03T11:18:21Z

This is a first implementation of the parallel token counting using Rayon.

So far there are two issues,

it's not as fast as it could be because some unnecessary string copies were added in the tokenization pipeline to keep the borrow checker happy
Currently, we have to tokenize everything in a Vect, which requires a lot of memory. A workaround is proposed in https://users.rust-lang.org/t/parallel-work-collected-sequentially/13504/3

Benchmarks with 2 CPU cores,

master

# vectorizing 19924 documents:
      HashingVectorizer (text-vectorize): 1.29s [70.3 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.69s [33.8 MB/s], shape=(19924, 208706), nnz=3962338

this PR
Using 2 core CPU, so RAYON_NUM_THREADS=4 corresponds to hyperthreading,

$ RAYON_NUM_THREADS=1 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 2.11s [43.2 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 3.22s [28.3 MB/s], shape=(19924, 208706), nnz=3962338
$ RAYON_NUM_THREADS=2 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 1.42s [63.9 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.59s [35.2 MB/s], shape=(19924, 208706), nnz=3962338
$ RAYON_NUM_THREADS=4 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 1.33s [68.2 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.48s [36.6 MB/s], shape=(19924, 208706), nnz=3962338

so the parallel scaling doesn't look so bad, it's more that the single-threaded implementation in this case is much slower than on master.

rth · 2019-05-08T10:12:11Z

In the end, this PR contains only the parallel version of the HashingVectorizer. CountVectorizer could be parallelized in a follow-up PR, the situation there is more complicated as it is not stateless and the vocabulary needs to be passed to different threads.

For HashingVectorizer, the scaling is reasonably good up to 8-16 CPU cores, after that we seem to reach the strong scaling limit, at least for this dataset. The maximum speed-up obtained is x5 of the scalar version. For n_jobs=1 we fall back to the non parallelized version,

# vectorizing 19924 documents:
     HashingVectorizer (scikit-learn): 4.75s [19.2 MB/s], shape=(19924, 1048576), nnz=4177915
     HashingVectorizer (vtext, n_jobs=1): 1.16s [78.2 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=2): 0.66s [137.4 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=4): 0.40s [227.0 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=8): 0.27s [340.0 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=16): 0.23s [400.7 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=32): 0.23s [394.6 MB/s], shape=(19924, 1048576), nnz=3961670

tested on EC2 c4.8xlarge with 36 CPU cores, and loading files from a tmpfs to avoid disk IO limitations.

The only limitation is that currently n_jobs>1 uses all available CPU cores instead of the desired number. It could be adjusted at start time with RAYON_NUM_THREADS env variable. Fixing this would require using a local rayon thread pool instead of the global one. I have not found out how to do that with the current pipeline definition yet.

rth added 3 commits March 3, 2019 11:22

Attempt to express tokenization in a pipe

6ea82d9

First parallel implementation using rayon

111e06c

Lint

5d0926c

rth mentioned this pull request Mar 3, 2019

NLP pipeline design #21

Open

2 tasks

rth mentioned this pull request May 3, 2019

What do we want to build? rust-ml/discussion#1

Open

rth added 5 commits May 7, 2019 22:59

Merge branch 'master' into parallel-pipe

2d8c414

Fix merge conflict

70ef92e

Fix benchmark parameters

37fdb92

Optional parallization in HashingVectorizer

8cd5f99

Add HashingVectorizer.n_jobs option

9daa3cd

rth changed the title ~~WIP Parallel token counting~~ Parallel HashingVectorizer May 8, 2019

rth added 5 commits May 8, 2019 09:19

Avoid variable shadowing

f0f5b06

Update python wrapper

e3096b5

Lint

92b7e75

Improve doc

934c226

Update benchmark results

5e7ffdf

rth merged commit 84db353 into master May 9, 2019

rth deleted the parallel-pipe branch May 9, 2019 07:19

rth mentioned this pull request May 20, 2019

Parallel CountVectorizer #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel HashingVectorizer #20

Parallel HashingVectorizer #20

rth commented Mar 3, 2019

rth commented May 8, 2019 •

edited

Loading

Parallel HashingVectorizer #20

Parallel HashingVectorizer #20

Conversation

rth commented Mar 3, 2019

rth commented May 8, 2019 • edited Loading

rth commented May 8, 2019 •

edited

Loading