Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored Count Vectorizer to be more memory efficient on N-grams #7107

Closed
wants to merge 2 commits into from

Conversation

qingyili
Copy link

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

When max_features are specified, CountVectorizer will first count the word occurrence. It will take the the top occurring words up to max_features. It will only make a 2-gram when its 1 gram count is apart of max_features, make a 3-gram when its 2-gram is apart of the max features and so on. This faster and more memory efficient on n-grams when the data sets are large.

@@ -674,7 +725,7 @@ def __init__(self, input='content', encoding='utf-8',
self.max_features = max_features
if max_features is not None:
if (not isinstance(max_features, numbers.Integral) or
max_features <= 0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

@amueller
Copy link
Member

amueller commented Jul 28, 2016

can you please post speed benchmarks for varying ngram_ranges and dataset sizes?

@@ -1222,7 +1292,6 @@ def __init__(self, input='content', encoding='utf-8',
max_features=None, vocabulary=None, binary=False,
dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,
sublinear_tf=False):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please try not to make any non-function modifications to the code; it makes it harder to review / in this case the newline should be there.

@nelson-liu
Copy link
Contributor

travis doesn't seem happy, but i'm very interested to see benchmarks for this as well.

@rth rth mentioned this pull request Jan 30, 2017
Base automatically changed from master to main January 22, 2021 10:49
@adrinjalali
Copy link
Member

Closing, never got response from the OP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants