-
-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional normalization to fetch_20newsgroups_vectorized #14740
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, you need to add an entry in doc/whats_new/v0.22.rst
, in the according section. You'll need to mention that the parameter was added.
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Thanks for your comments, I added the changes. |
Last thing. We need a test in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add a test to sklearn/datasets/tests/test_20news.py
than loading it with normalize=True
is equivalent to applying normalize after loading with normalize=False
(if the run time is not too long). Otherwise LGTM.
Thanks for your replies, I made a test using the first 100 items that checks for equivalence and the norm. |
I think I addressed all points. Sorry for letting this hang for a bit, I assumed you were busy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @stephantul !
I can confirm that test_20news_normalization
runs fine locally, not sure what is going on with codecov CI.
Merging in the latest master often fixes codecov |
Hey, I merged master into my branch, but codecov still fails 😞. Let me know if I need to do other things to make this work. |
Thanks |
Reference Issues/PRs
Fixes #14738
What does this implement/fix? Explain your changes.
fetch_20newsgroups_vectorized
normalizes the data returned by CountVectorizer without indicating this to the user. I added an argument to the function which allows users to switch off normalization (normalization is still performed by default), as well as a comment in the docstring indicating that normalization is performed.Any other comments?
Because the argument to the function is named
normalize
(as suggested by @rth) , I imported the functionsklearn.preprocessing.normalize
asnormalize_func
to clearly distinguish them. I'm not sure if this is correct.