Skip to content

Commit

Permalink
ENH CountVectorizer: sort features after pruning by frequency
Browse files Browse the repository at this point in the history
In CountVectorizer, sort features after pruning by frequency.
When using min_df or max_df with large vocabularies, this can
be a significant speed up.
  • Loading branch information
smola committed Dec 8, 2019
1 parent e94b67a commit 24d8ce7
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions sklearn/feature_extraction/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -1223,8 +1223,6 @@ def fit_transform(self, raw_documents, y=None):
X.data.fill(1)

if not self.fixed_vocabulary_:
X = self._sort_features(X, vocabulary)

n_doc = X.shape[0]
max_doc_count = (max_df
if isinstance(max_df, numbers.Integral)
Expand All @@ -1240,6 +1238,8 @@ def fit_transform(self, raw_documents, y=None):
min_doc_count,
max_features)

X = self._sort_features(X, vocabulary)

self.vocabulary_ = vocabulary

return X
Expand Down

0 comments on commit 24d8ce7

Please sign in to comment.