Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text vectorizers clean-up / TfidfVectorizer deprecation #14951

Open
rth opened this issue Sep 11, 2019 · 1 comment
Open

Text vectorizers clean-up / TfidfVectorizer deprecation #14951

rth opened this issue Sep 11, 2019 · 1 comment

Comments

@rth
Copy link
Member

rth commented Sep 11, 2019

As mentioned by @jnothman in #14748 (comment),

I'm also happy to slowly deprecate TfidfVectorizer because it provides no
benefit over a pipeline and creates much confusion. I've seen users do
weird stuff like compare TfidfVectorizer to CountVectorizer but use
different params.

+1 on my side to deprecate it.

I would also deprecate norm parameter (for removal) in HashingVectorizer as its combination with TfidfTransformer using default parameters currently produces nonsense results due to norm='l2'. That would also resolve #6972

Generally I have also seen experienced Python users do very weird things with text vectorizers (particularly as soon as customization is involved). Taking some time to think what how we would like its API to look ideally for 1.0 and if we can get partially there without major disruption would, I think, be useful.

In particular, I find the way we currently suggest customizing the behavior by subclassing CountVectorizer to update the analyzer is really awkward. I am pondering on some way to separate the pipeline that processes a single document and returns n-grams, from the CountVectorizer class. Basically making passing analyzers easier while re-using part of the existing processing.

Anyway, we can start with easy deprecations.

@rth
Copy link
Member Author

rth commented Sep 12, 2019

deprecate TfidfVectorizer because it provides no
benefit over a pipeline and creates much confusion.

One limiting factor is that pipelines don't support get_feature_names() for now, so to get the previous behavior one would need to do something like (pipe.named_steps['vect'].get_feature_names()) but hopefully that would be addressed in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants