ENH CountVectorizer: sort features after pruning by frequency #15834

smola · 2019-12-08T16:06:39Z

In CountVectorizer, sort features after pruning by frequency.
When using min_df or max_df with large vocabularies, this can
be a significant speed up.

Here's a simple benchmark:

from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(subset="all")

from time import time
from sklearn.feature_extraction.text import CountVectorizer

n_doc = 2000

for min_df in (1, 2, 3, 5, 8):
    for min_ngram_range in (1,):
        for max_ngram_range in (min_ngram_range + 5,):
            t0 = time()
            cv = CountVectorizer(
                input="content",
                analyzer="char",
                lowercase=False,
                ngram_range=(min_ngram_range, max_ngram_range),
                min_df=min_df,
            )
            cv.fit(dataset.data[:n_doc])
            print(
                "time=%f min_df=%d ngram_range=(%d, %d)"
                % (time() - t0, min_df, min_ngram_range, max_ngram_range)

and example results before the change:

time=20.674258 min_df=1 ngram_range=(1, 6)
time=21.181467 min_df=2 ngram_range=(1, 6)
time=21.105731 min_df=3 ngram_range=(1, 6)
time=21.049018 min_df=5 ngram_range=(1, 6)
time=21.095739 min_df=8 ngram_range=(1, 6)

and after the change:

time=20.434718 min_df=1 ngram_range=(1, 6)
time=17.153750 min_df=2 ngram_range=(1, 6)
time=16.405343 min_df=3 ngram_range=(1, 6)
time=16.040197 min_df=5 ngram_range=(1, 6)
time=15.901990 min_df=8 ngram_range=(1, 6)

jnothman

Great idea! Not sure how we missed this when we cleaned this code up a couple of years ago...

Please add an Efficiency entry to the change log at doc/whats_new/v0.23.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

In CountVectorizer, sort features after pruning by frequency. When using min_df or max_df with large vocabularies, this can be a significant speed up.

smola · 2019-12-09T14:54:51Z

@jnothman Thank you for the review. I have rebased and added the changelog entry.

jnothman · 2019-12-09T19:00:11Z

No need to rebase. Please just append commits.

Awaiting another review.

rth

Thanks @smola , LGTM.

…-learn#15834)

jnothman approved these changes Dec 8, 2019

View reviewed changes

ENH CountVectorizer: sort features after pruning by frequency

e1c402f

In CountVectorizer, sort features after pruning by frequency. When using min_df or max_df with large vocabularies, this can be a significant speed up.

smola force-pushed the countvectorizer-sort branch from 24d8ce7 to e1c402f Compare December 9, 2019 14:53

jnothman added the Waiting for Reviewer label Dec 9, 2019

rth approved these changes Dec 10, 2019

View reviewed changes

rth merged commit 7ee0ae8 into scikit-learn:master Dec 10, 2019

smola deleted the countvectorizer-sort branch December 10, 2019 14:02

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020

ENH CountVectorizer: sort features after pruning by frequency (scikit…

e6cba52

…-learn#15834)

thomasjpfan mentioned this pull request Jul 17, 2020

Diff in CountVectorizer between versions 0.22 and 0.23 #17939

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH CountVectorizer: sort features after pruning by frequency #15834

ENH CountVectorizer: sort features after pruning by frequency #15834

smola commented Dec 8, 2019

jnothman left a comment

smola commented Dec 9, 2019

jnothman commented Dec 9, 2019

rth left a comment

ENH CountVectorizer: sort features after pruning by frequency #15834

ENH CountVectorizer: sort features after pruning by frequency #15834

Conversation

smola commented Dec 8, 2019

jnothman left a comment

Choose a reason for hiding this comment

smola commented Dec 9, 2019

jnothman commented Dec 9, 2019

rth left a comment

Choose a reason for hiding this comment