Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH CountVectorizer: sort features after pruning by frequency #15834

Merged
merged 1 commit into from Dec 10, 2019

Conversation

smola
Copy link
Contributor

@smola smola commented Dec 8, 2019

In CountVectorizer, sort features after pruning by frequency.
When using min_df or max_df with large vocabularies, this can
be a significant speed up.

Here's a simple benchmark:

from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(subset="all")

from time import time
from sklearn.feature_extraction.text import CountVectorizer

n_doc = 2000

for min_df in (1, 2, 3, 5, 8):
    for min_ngram_range in (1,):
        for max_ngram_range in (min_ngram_range + 5,):
            t0 = time()
            cv = CountVectorizer(
                input="content",
                analyzer="char",
                lowercase=False,
                ngram_range=(min_ngram_range, max_ngram_range),
                min_df=min_df,
            )
            cv.fit(dataset.data[:n_doc])
            print(
                "time=%f min_df=%d ngram_range=(%d, %d)"
                % (time() - t0, min_df, min_ngram_range, max_ngram_range)

and example results before the change:

time=20.674258 min_df=1 ngram_range=(1, 6)
time=21.181467 min_df=2 ngram_range=(1, 6)
time=21.105731 min_df=3 ngram_range=(1, 6)
time=21.049018 min_df=5 ngram_range=(1, 6)
time=21.095739 min_df=8 ngram_range=(1, 6)

and after the change:

time=20.434718 min_df=1 ngram_range=(1, 6)
time=17.153750 min_df=2 ngram_range=(1, 6)
time=16.405343 min_df=3 ngram_range=(1, 6)
time=16.040197 min_df=5 ngram_range=(1, 6)
time=15.901990 min_df=8 ngram_range=(1, 6)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! Not sure how we missed this when we cleaned this code up a couple of years ago...

Please add an Efficiency entry to the change log at doc/whats_new/v0.23.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

In CountVectorizer, sort features after pruning by frequency.
When using min_df or max_df with large vocabularies, this can
be a significant speed up.
@smola
Copy link
Contributor Author

smola commented Dec 9, 2019

@jnothman Thank you for the review. I have rebased and added the changelog entry.

@jnothman
Copy link
Member

jnothman commented Dec 9, 2019

No need to rebase. Please just append commits.

Awaiting another review.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smola , LGTM.

@rth rth merged commit 7ee0ae8 into scikit-learn:master Dec 10, 2019
@smola smola deleted the countvectorizer-sort branch December 10, 2019 14:02
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants