Text vectorizers clean-up / TfidfVectorizer deprecation #14951

rth · 2019-09-11T09:18:32Z

As mentioned by @jnothman in #14748 (comment),

I'm also happy to slowly deprecate TfidfVectorizer because it provides no
benefit over a pipeline and creates much confusion. I've seen users do
weird stuff like compare TfidfVectorizer to CountVectorizer but use
different params.

+1 on my side to deprecate it.

I would also deprecate norm parameter (for removal) in HashingVectorizer as its combination with TfidfTransformer using default parameters currently produces nonsense results due to norm='l2'. That would also resolve #6972

Generally I have also seen experienced Python users do very weird things with text vectorizers (particularly as soon as customization is involved). Taking some time to think what how we would like its API to look ideally for 1.0 and if we can get partially there without major disruption would, I think, be useful.

In particular, I find the way we currently suggest customizing the behavior by subclassing CountVectorizer to update the analyzer is really awkward. I am pondering on some way to separate the pipeline that processes a single document and returns n-grams, from the CountVectorizer class. Basically making passing analyzers easier while re-using part of the existing processing.

Anyway, we can start with easy deprecations.

The text was updated successfully, but these errors were encountered:

rth · 2019-09-12T13:00:19Z

deprecate TfidfVectorizer because it provides no
benefit over a pipeline and creates much confusion.

One limiting factor is that pipelines don't support get_feature_names() for now, so to get the previous behavior one would need to do something like (pipe.named_steps['vect'].get_feature_names()) but hopefully that would be addressed in the near future.

This was referenced Sep 11, 2019

HashingVectorizer uses l2 norm, Countvectorizer doesn't. Thats confusing #6972

Open

Added option to use standard idf term for TfidfTransformer and TfidfVectorizer #14748

Closed

rth mentioned this issue Sep 12, 2019

Deprecate TfidfVectorizer #14966

Open

1 task

thomasjpfan added the API label Oct 26, 2019

rth mentioned this issue Jan 24, 2020

TfidfVectorizer handles multiple text columns #16148

Open

rth mentioned this issue Sep 30, 2021

add tfidf token2idf_ property #21189

Closed

cmarmo added the Needs Decision Requires decision label Feb 14, 2022

cmarmo added the module:feature_extraction label Mar 23, 2022

thomasjpfan mentioned this issue Apr 14, 2022

Decouple CountVectorizer => TextTokenizer + ItemCountVectorizer #23004

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text vectorizers clean-up / TfidfVectorizer deprecation #14951

Text vectorizers clean-up / TfidfVectorizer deprecation #14951

rth commented Sep 11, 2019

rth commented Sep 12, 2019

Text vectorizers clean-up / TfidfVectorizer deprecation #14951

Text vectorizers clean-up / TfidfVectorizer deprecation #14951

Comments

rth commented Sep 11, 2019

rth commented Sep 12, 2019