Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] text vectorizers should raise warnings when user params will be unused #14602

Merged
merged 42 commits into from Sep 6, 2019

Conversation

@getgaurav2
Copy link
Contributor

getgaurav2 commented Aug 8, 2019

Reference Issues/PRs

#14580

What does this implement/fix? Explain your changes.

Add Warning and test cases for unused parameters for text vectorizers.

  1. Add all warnings in a new method - _warn_for_unused_params() and call this method in the fit() method for all vectorizers .
  2. Add Unit test case for all warning checks .

Any other comments?

@getgaurav2

This comment has been minimized.

Copy link
Contributor Author

getgaurav2 commented Aug 8, 2019

@jnothman , @rth - can you pls. give any feedback on the test case ?
I can add the other text vectorizers in the same function and iterate over the objects of each vectorizers and check different warnings msg for different parameter combinations
OR
I can create one unit test function per vectorizer per warning .
Screen Shot 2019-08-08 at 4 38 28 AM

sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
getgaurav2 and others added 3 commits Aug 8, 2019
Co-Authored-By: Roman Yurchak <rth.yurchak@gmail.com>
Gaurav Chawla
Gaurav Chawla
Copy link
Member

jnothman left a comment

Looking good

@getgaurav2

This comment has been minimized.

Copy link
Contributor Author

getgaurav2 commented Aug 9, 2019

Screen Shot 2019-08-08 at 7 54 15 PM

Copy link
Contributor Author

getgaurav2 left a comment

Adding only CountVectorizer for now . Will add others as I add warnings for them in the text.py file.

@getgaurav2

This comment has been minimized.

Copy link
Contributor Author

getgaurav2 commented Aug 9, 2019

@jnothman , @rth - can you please check why some of the checks are failing for this commit.
Here is the log :
scikit-learn.scikit-learn (Linux py35_conda_openblas)
PackagesNotFoundError: The following packages are not available from current channels:

  • numpy=1.11.0
  • scipy=0.17.0
  • matplotlib=1.5.1
  • pillow=4.0.0

Linux pylatest_conda_mkl_pandas
~/work/tmp_folder ~/work/1/s
build_tools/azure/test_pytest_soft_dependency.sh: line 21: coverage: command not found

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Aug 10, 2019

It is a unrelated issue.

It should be fixed when #14619 get merged.

@rth
rth approved these changes Aug 10, 2019
Copy link
Member

rth left a comment

LGTM, apart for the two comments below.

sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
Copy link
Member

jnothman left a comment

Thanks

sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
2. Adding condition for self.analyzer != 'word' or callable(self.analyzer) in build_analyzer
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 18, 2019

Tests are failing . Let us know if you need help

Gaurav Chawla added 3 commits Aug 18, 2019
Gaurav Chawla
Gaurav Chawla
@getgaurav2

This comment has been minimized.

Copy link
Contributor Author

getgaurav2 commented Aug 18, 2019

Tests are failing . Let us know if you need help

Thanks. A different unit test test_callable_analyzer_change_behavior seems to be failing ...I'll check the reason and get back to you .

Gaurav Chawla added 2 commits Aug 24, 2019
@getgaurav2

This comment has been minimized.

Copy link
Contributor Author

getgaurav2 commented Aug 24, 2019

Tests are failing . Let us know if you need help

Thanks. A different unit test test_callable_analyzer_change_behavior seems to be failing ...I'll check the reason and get back to you .

@jnothman - Could you help me understand the reason for adding assert len(records) == 1 in test_callable_analyzer_change_behavior . This condition is failing for HashingVectorizer ( screen shot below). It passes for CountVectorizer and HashingVectorizer . I have commented this condition in the test case for now but will add it back once I understand it better.
Screen Shot 2019-08-24 at 10 56 27 AM

The difference b/w the fit() method for HashingVectorizer () vs. the others is calling build_analyzer() . https://github.com/scikit-learn/scikit-learn/pull/14602#issuecomment-521613247

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 24, 2019

It's saying that HashingVectorizer is now issuing the ChangedBehaviorWarning twice. It should only issue it once. By putting build_analyzer in its fit, it might now be repeating that warning.

@getgaurav2

This comment has been minimized.

Copy link
Contributor Author

getgaurav2 commented Aug 25, 2019

It's saying that HashingVectorizer is now issuing the ChangedBehaviorWarning twice. It should only issue it once. By putting build_analyzer in its fit, it might now be repeating that warning.

Yes , build_analyzer () is now being called inside fit() as well as transform() methods which is likely causing the double warnings . Looking into it now .
@jnothman - Do you think it may be easier to go back to the original approach and keep the analyzer != ‘word’ warnings outside of build_ analyzer ?
#14602 (comment)

@getgaurav2

This comment has been minimized.

Copy link
Contributor Author

getgaurav2 commented Aug 28, 2019

@jnothman - just wanted to make sure that you are getting a notification for the comment above.

Copy link
Member

jnothman left a comment

Otherwise LGTM. (Though I've not checked that the set of parameter pairs is complete.)

Please add an Enhancement entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 4, 2019

Assuming you also think this is ready for merge, please change WIP -> MRG

…unused_params

And remove blank lines
@getgaurav2 getgaurav2 changed the title [WIP] text vectorizers should raise warnings when user params will be unused [MRG] text vectorizers should raise warnings when user params will be unused Sep 4, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 5, 2019

Please add a note in what's new as requested above. Thanks!

@getgaurav2

This comment has been minimized.

Copy link
Contributor Author

getgaurav2 commented Sep 5, 2019

Please add a note in what's new as requested above. Thanks!

Yup ..Was just working on it. Added now .

Gaurav Chawla
@@ -146,6 +146,14 @@ Changelog
:mod:`sklearn.feature_extraction`
.................................

- |Enhancement| :func:`feature_extraction.text._warn_for_unused_params` will

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 5, 2019

Member

Please describe changes only in terms of public API: "a warning is now raised if a parameter choice means that another parameter will be unused" or something more clear than that.

This comment has been minimized.

Copy link
@getgaurav2

getgaurav2 Sep 5, 2019

Author Contributor

Got it ..Can you please check if the latest commit makes more sense .

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
Gaurav Chawla
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 6, 2019

Makes more sense. We might yet find clearer words before release, but as far as I'm concerned we can merge. Thanks, @getgaurav2!

@jnothman jnothman merged commit 96ef6b8 into scikit-learn:master Sep 6, 2019
17 checks passed
17 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.87%)
Details
codecov/project 96.92% (+0.04%) compared to ea169b5
Details
scikit-learn.scikit-learn Build #20190905.36 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl_pandas) Linux pylatest_conda_mkl_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.