Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional normalization to fetch_20newsgroups_vectorized #14740

Merged
merged 13 commits into from Oct 28, 2019

Conversation

@stephantul
Copy link
Contributor

stephantul commented Aug 23, 2019

Reference Issues/PRs

Fixes #14738

What does this implement/fix? Explain your changes.

fetch_20newsgroups_vectorized normalizes the data returned by CountVectorizer without indicating this to the user. I added an argument to the function which allows users to switch off normalization (normalization is still performed by default), as well as a comment in the docstring indicating that normalization is performed.

Any other comments?

Because the argument to the function is named normalize (as suggested by @rth) , I imported the function sklearn.preprocessing.normalize as normalize_func to clearly distinguish them. I'm not sure if this is correct.

Copy link
Contributor

glemaitre left a comment

In addition, you need to add an entry in doc/whats_new/v0.22.rst, in the according section. You'll need to mention that the parameter was added.

stephantul and others added 4 commits Aug 23, 2019
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@stephantul

This comment has been minimized.

Copy link
Contributor Author

stephantul commented Aug 23, 2019

Thanks for your comments, I added the changes.

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Aug 23, 2019

Last thing. We need a test in datasets/tests/test_20news.py which should check that normalize is behaving properly. So we could check that if normalize=True, we get the unit norm (we only need to check it for a couple of samples), and otherwise this is not.

Copy link
Member

rth left a comment

It would be good to add a test to sklearn/datasets/tests/test_20news.py than loading it with normalize=True is equivalent to applying normalize after loading with normalize=False (if the run time is not too long). Otherwise LGTM.

sklearn/datasets/twenty_newsgroups.py Outdated Show resolved Hide resolved
@stephantul

This comment has been minimized.

Copy link
Contributor Author

stephantul commented Aug 23, 2019

Thanks for your replies, I made a test using the first 100 items that checks for equivalence and the norm.

@stephantul stephantul requested a review from glemaitre Oct 24, 2019
@stephantul

This comment has been minimized.

Copy link
Contributor Author

stephantul commented Oct 24, 2019

I think I addressed all points. Sorry for letting this hang for a bit, I assumed you were busy.

@thomasjpfan thomasjpfan added this to the 0.22 milestone Oct 26, 2019
@rth
rth approved these changes Oct 27, 2019
Copy link
Member

rth left a comment

LGTM, thanks @stephantul !

I can confirm that test_20news_normalization runs fine locally, not sure what is going on with codecov CI.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 27, 2019

Merging in the latest master often fixes codecov

@stephantul

This comment has been minimized.

Copy link
Contributor Author

stephantul commented Oct 28, 2019

Hey, I merged master into my branch, but codecov still fails 😞. Let me know if I need to do other things to make this work.

@jnothman jnothman merged commit a3a1b8f into scikit-learn:master Oct 28, 2019
16 of 19 checks passed
16 of 19 checks passed
codecov/patch 44.44% of diff hit (target 97.2%)
Details
scikit-learn.scikit-learn Build #20191028.6 failed
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas failed
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/project 97.2% (-0.01%) compared to e243f32
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl) Linux pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 28, 2019

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.