Add optional normalization to fetch_20newsgroups_vectorized #14740

stephantul · 2019-08-23T11:12:55Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

fetch_20newsgroups_vectorized normalizes the data returned by CountVectorizer without indicating this to the user. I added an argument to the function which allows users to switch off normalization (normalization is still performed by default), as well as a comment in the docstring indicating that normalization is performed.

Any other comments?

Because the argument to the function is named normalize (as suggested by @rth) , I imported the function sklearn.preprocessing.normalize as normalize_func to clearly distinguish them. I'm not sure if this is correct.

sklearn/datasets/twenty_newsgroups.py

glemaitre

In addition, you need to add an entry in doc/whats_new/v0.22.rst, in the according section. You'll need to mention that the parameter was added.

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

stephantul · 2019-08-23T11:45:37Z

Thanks for your comments, I added the changes.

sklearn/datasets/twenty_newsgroups.py

glemaitre · 2019-08-23T12:17:40Z

Last thing. We need a test in datasets/tests/test_20news.py which should check that normalize is behaving properly. So we could check that if normalize=True, we get the unit norm (we only need to check it for a couple of samples), and otherwise this is not.

rth

It would be good to add a test to sklearn/datasets/tests/test_20news.py than loading it with normalize=True is equivalent to applying normalize after loading with normalize=False (if the run time is not too long). Otherwise LGTM.

sklearn/datasets/twenty_newsgroups.py

stephantul · 2019-08-23T12:41:56Z

Thanks for your replies, I made a test using the first 100 items that checks for equivalence and the norm.

sklearn/datasets/tests/test_20news.py

stephantul · 2019-10-24T07:44:16Z

I think I addressed all points. Sorry for letting this hang for a bit, I assumed you were busy.

sklearn/datasets/twenty_newsgroups.py

rth

LGTM, thanks @stephantul !

I can confirm that test_20news_normalization runs fine locally, not sure what is going on with codecov CI.

jnothman · 2019-10-27T21:36:03Z

Merging in the latest master often fixes codecov

…nto scikit-learn-master

stephantul · 2019-10-28T07:31:47Z

Hey, I merged master into my branch, but codecov still fails 😞. Let me know if I need to do other things to make this work.

jnothman · 2019-10-28T07:53:06Z

Thanks

stephantul added 2 commits August 23, 2019 13:03

Add optional normalization argument to fetch_20newsgroups_vectorized

963ccd4

Add extra comments

b1f449e

glemaitre reviewed Aug 23, 2019

View reviewed changes

sklearn/datasets/twenty_newsgroups.py Outdated Show resolved Hide resolved

glemaitre reviewed Aug 23, 2019

View reviewed changes

sklearn/datasets/twenty_newsgroups.py Outdated Show resolved Hide resolved

glemaitre requested changes Aug 23, 2019

View reviewed changes

Stephan Tulkens and others added 4 commits August 23, 2019 13:41

Update sklearn/datasets/twenty_newsgroups.py

b4aa3d5

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update v0.22.rst

f8b54b9

Update twenty_newsgroups.py

0f3b062

Merge branch 'master' of https://github.com/stephantul/scikit-learn

ba2692b

glemaitre reviewed Aug 23, 2019

View reviewed changes

sklearn/datasets/twenty_newsgroups.py Outdated Show resolved Hide resolved

Change boolean to bool

cae23b8

rth reviewed Aug 23, 2019

View reviewed changes

sklearn/datasets/twenty_newsgroups.py Outdated Show resolved Hide resolved

stephantul added 2 commits August 23, 2019 14:40

fix renaming of preprocessing

eecddfa

add test for normalization

31960f8

glemaitre reviewed Aug 23, 2019

View reviewed changes

sklearn/datasets/tests/test_20news.py Outdated Show resolved Hide resolved

update test to take advantage of assert_allclose_dense_sparse

3444406

stephantul requested a review from glemaitre October 24, 2019 07:43

jnothman reviewed Oct 24, 2019

View reviewed changes

sklearn/datasets/twenty_newsgroups.py Outdated Show resolved Hide resolved

Make docstring clearer

6a927b4

thomasjpfan added this to the 0.22 milestone Oct 26, 2019

rth approved these changes Oct 27, 2019

View reviewed changes

stephantul added 2 commits October 28, 2019 08:20

Merge branch 'master' of git://github.com/scikit-learn/scikit-learn i…

3df3639

…nto scikit-learn-master

Merge branch 'scikit-learn-master'

65fde16

jnothman approved these changes Oct 28, 2019

View reviewed changes

jnothman merged commit a3a1b8f into scikit-learn:master Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional normalization to fetch_20newsgroups_vectorized #14740

Add optional normalization to fetch_20newsgroups_vectorized #14740

stephantul commented Aug 23, 2019

glemaitre left a comment

stephantul commented Aug 23, 2019

glemaitre commented Aug 23, 2019

rth left a comment

stephantul commented Aug 23, 2019

stephantul commented Oct 24, 2019

rth left a comment

jnothman commented Oct 27, 2019

stephantul commented Oct 28, 2019

jnothman commented Oct 28, 2019

Add optional normalization to fetch_20newsgroups_vectorized #14740

Add optional normalization to fetch_20newsgroups_vectorized #14740

Conversation

stephantul commented Aug 23, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

glemaitre left a comment

Choose a reason for hiding this comment

stephantul commented Aug 23, 2019

glemaitre commented Aug 23, 2019

rth left a comment

Choose a reason for hiding this comment

stephantul commented Aug 23, 2019

stephantul commented Oct 24, 2019

rth left a comment

Choose a reason for hiding this comment

jnothman commented Oct 27, 2019

stephantul commented Oct 28, 2019

jnothman commented Oct 28, 2019