Refactored Count Vectorizer to be more memory efficient on N-grams #7107

qingyili · 2016-07-28T19:53:58Z

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

When max_features are specified, CountVectorizer will first count the word occurrence. It will take the the top occurring words up to max_features. It will only make a 2-gram when its 1 gram count is apart of max_features, make a 3-gram when its 2-gram is apart of the max features and so on. This faster and more memory efficient on n-grams when the data sets are large.

amueller · 2016-07-28T20:01:47Z

sklearn/feature_extraction/text.py

@@ -674,7 +725,7 @@ def __init__(self, input='content', encoding='utf-8',
        self.max_features = max_features
        if max_features is not None:
            if (not isinstance(max_features, numbers.Integral) or
-                    max_features <= 0):


amueller · 2016-07-28T20:04:42Z

can you please post speed benchmarks for varying ngram_ranges and dataset sizes?

nelson-liu · 2016-07-28T20:38:38Z

sklearn/feature_extraction/text.py

@@ -1222,7 +1292,6 @@ def __init__(self, input='content', encoding='utf-8',
                 max_features=None, vocabulary=None, binary=False,
                 dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,
                 sublinear_tf=False):
-


please try not to make any non-function modifications to the code; it makes it harder to review / in this case the newline should be there.

nelson-liu · 2016-07-28T20:40:47Z

travis doesn't seem happy, but i'm very interested to see benchmarks for this as well.

adrinjalali · 2024-04-19T10:48:38Z

Closing, never got response from the OP.

qingyili added 2 commits July 28, 2016 12:39

refactored countvectorizer to be more memory efficient on ngrams

b77a3f1

removed print statemetents

bd74d92

amueller reviewed Jul 28, 2016
View reviewed changes

nelson-liu reviewed Jul 28, 2016
View reviewed changes

rth mentioned this pull request Jan 30, 2017

[MRG+1] Ngram Performance #7567

Merged

amueller added Stalled help wanted labels Sep 27, 2018

github-actions bot added the module:feature_extraction label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:49

adrinjalali closed this Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored Count Vectorizer to be more memory efficient on N-grams #7107

Refactored Count Vectorizer to be more memory efficient on N-grams #7107

qingyili commented Jul 28, 2016

amueller Jul 28, 2016

amueller commented Jul 28, 2016 •

edited

nelson-liu Jul 28, 2016

nelson-liu commented Jul 28, 2016

adrinjalali commented Apr 19, 2024

Refactored Count Vectorizer to be more memory efficient on N-grams #7107

Refactored Count Vectorizer to be more memory efficient on N-grams #7107

Conversation

qingyili commented Jul 28, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

amueller Jul 28, 2016

Choose a reason for hiding this comment

amueller commented Jul 28, 2016 • edited

nelson-liu Jul 28, 2016

Choose a reason for hiding this comment

nelson-liu commented Jul 28, 2016

adrinjalali commented Apr 19, 2024

amueller commented Jul 28, 2016 •

edited