ENH CountVectorizer: sort features after pruning by frequency

In CountVectorizer, sort features after pruning by frequency. When using min_df or max_df with large vocabularies, this can be a significant speed up.
scikit-learn · Dec 8, 2019 · 24d8ce7 · 24d8ce7
1 parent e94b67a
commit 24d8ce7
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/sklearn/feature_extraction/text.py b/sklearn/feature_extraction/text.py
@@ -1223,8 +1223,6 @@ def fit_transform(self, raw_documents, y=None):
             X.data.fill(1)
 
         if not self.fixed_vocabulary_:
-            X = self._sort_features(X, vocabulary)
-
             n_doc = X.shape[0]
             max_doc_count = (max_df
                              if isinstance(max_df, numbers.Integral)
@@ -1240,6 +1238,8 @@ def fit_transform(self, raw_documents, y=None):
                                                        min_doc_count,
                                                        max_features)
 
+            X = self._sort_features(X, vocabulary)
+
             self.vocabulary_ = vocabulary
 
         return X