Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional normalization to fetch_20newsgroups_vectorized #14740

Merged
merged 13 commits into from
Oct 28, 2019
Merged

Add optional normalization to fetch_20newsgroups_vectorized #14740

merged 13 commits into from
Oct 28, 2019

Conversation

stephantul
Copy link
Contributor

Reference Issues/PRs

Fixes #14738

What does this implement/fix? Explain your changes.

fetch_20newsgroups_vectorized normalizes the data returned by CountVectorizer without indicating this to the user. I added an argument to the function which allows users to switch off normalization (normalization is still performed by default), as well as a comment in the docstring indicating that normalization is performed.

Any other comments?

Because the argument to the function is named normalize (as suggested by @rth) , I imported the function sklearn.preprocessing.normalize as normalize_func to clearly distinguish them. I'm not sure if this is correct.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, you need to add an entry in doc/whats_new/v0.22.rst, in the according section. You'll need to mention that the parameter was added.

@stephantul
Copy link
Contributor Author

Thanks for your comments, I added the changes.

@glemaitre
Copy link
Member

Last thing. We need a test in datasets/tests/test_20news.py which should check that normalize is behaving properly. So we could check that if normalize=True, we get the unit norm (we only need to check it for a couple of samples), and otherwise this is not.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a test to sklearn/datasets/tests/test_20news.py than loading it with normalize=True is equivalent to applying normalize after loading with normalize=False (if the run time is not too long). Otherwise LGTM.

sklearn/datasets/twenty_newsgroups.py Outdated Show resolved Hide resolved
@stephantul
Copy link
Contributor Author

Thanks for your replies, I made a test using the first 100 items that checks for equivalence and the norm.

@stephantul
Copy link
Contributor Author

I think I addressed all points. Sorry for letting this hang for a bit, I assumed you were busy.

@thomasjpfan thomasjpfan added this to the 0.22 milestone Oct 26, 2019
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @stephantul !

I can confirm that test_20news_normalization runs fine locally, not sure what is going on with codecov CI.

@jnothman
Copy link
Member

Merging in the latest master often fixes codecov

@stephantul
Copy link
Contributor Author

Hey, I merged master into my branch, but codecov still fails 😞. Let me know if I need to do other things to make this work.

@jnothman jnothman merged commit a3a1b8f into scikit-learn:master Oct 28, 2019
@jnothman
Copy link
Member

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fetch_20_newsgroups_vectorized normalizes data, but does not mention this in docstring
5 participants