Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] text vectorizers should raise warnings when user params will be unused #14602

Merged
merged 42 commits into from Sep 6, 2019
Merged

Conversation

getgaurav2
Copy link
Contributor

@getgaurav2 getgaurav2 commented Aug 8, 2019

Reference Issues/PRs

#14580

What does this implement/fix? Explain your changes.

Add Warning and test cases for unused parameters for text vectorizers.

  1. Add all warnings in a new method - _warn_for_unused_params() and call this method in the fit() method for all vectorizers .
  2. Add Unit test case for all warning checks .

Any other comments?

@getgaurav2
Copy link
Contributor Author

@getgaurav2 getgaurav2 commented Aug 8, 2019

@jnothman , @rth - can you pls. give any feedback on the test case ?
I can add the other text vectorizers in the same function and iterate over the objects of each vectorizers and check different warnings msg for different parameter combinations
OR
I can create one unit test function per vectorizer per warning .
Screen Shot 2019-08-08 at 4 38 28 AM

sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

Looking good

@getgaurav2
Copy link
Contributor Author

@getgaurav2 getgaurav2 commented Aug 9, 2019

Screen Shot 2019-08-08 at 7 54 15 PM

Copy link
Contributor Author

@getgaurav2 getgaurav2 left a comment

Adding only CountVectorizer for now . Will add others as I add warnings for them in the text.py file.

@getgaurav2
Copy link
Contributor Author

@getgaurav2 getgaurav2 commented Aug 9, 2019

@jnothman , @rth - can you please check why some of the checks are failing for this commit.
Here is the log :
scikit-learn.scikit-learn (Linux py35_conda_openblas)
PackagesNotFoundError: The following packages are not available from current channels:

  • numpy=1.11.0
  • scipy=0.17.0
  • matplotlib=1.5.1
  • pillow=4.0.0

Linux pylatest_conda_mkl_pandas
~/work/tmp_folder ~/work/1/s
build_tools/azure/test_pytest_soft_dependency.sh: line 21: coverage: command not found

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Aug 10, 2019

It is a unrelated issue.

It should be fixed when #14619 get merged.

rth
rth approved these changes Aug 10, 2019
Copy link
Member

@rth rth left a comment

LGTM, apart for the two comments below.

sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
@jnothman
Copy link
Member

@jnothman jnothman commented Aug 15, 2019

Copy link
Member

@jnothman jnothman left a comment

Thanks

sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
2. Adding condition for self.analyzer != 'word' or callable(self.analyzer) in build_analyzer
@jnothman
Copy link
Member

@jnothman jnothman commented Aug 18, 2019

Tests are failing . Let us know if you need help

@getgaurav2
Copy link
Contributor Author

@getgaurav2 getgaurav2 commented Aug 18, 2019

Tests are failing . Let us know if you need help

Thanks. A different unit test test_callable_analyzer_change_behavior seems to be failing ...I'll check the reason and get back to you .

@getgaurav2
Copy link
Contributor Author

@getgaurav2 getgaurav2 commented Aug 24, 2019

Tests are failing . Let us know if you need help

Thanks. A different unit test test_callable_analyzer_change_behavior seems to be failing ...I'll check the reason and get back to you .

@jnothman - Could you help me understand the reason for adding assert len(records) == 1 in test_callable_analyzer_change_behavior . This condition is failing for HashingVectorizer ( screen shot below). It passes for CountVectorizer and HashingVectorizer . I have commented this condition in the test case for now but will add it back once I understand it better.
Screen Shot 2019-08-24 at 10 56 27 AM

The difference b/w the fit() method for HashingVectorizer () vs. the others is calling build_analyzer() . https://github.com/scikit-learn/scikit-learn/pull/14602#issuecomment-521613247

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 24, 2019

It's saying that HashingVectorizer is now issuing the ChangedBehaviorWarning twice. It should only issue it once. By putting build_analyzer in its fit, it might now be repeating that warning.

@getgaurav2
Copy link
Contributor Author

@getgaurav2 getgaurav2 commented Aug 25, 2019

It's saying that HashingVectorizer is now issuing the ChangedBehaviorWarning twice. It should only issue it once. By putting build_analyzer in its fit, it might now be repeating that warning.

Yes , build_analyzer () is now being called inside fit() as well as transform() methods which is likely causing the double warnings . Looking into it now .
@jnothman - Do you think it may be easier to go back to the original approach and keep the analyzer != ‘word’ warnings outside of build_ analyzer ?
#14602 (comment)

@getgaurav2
Copy link
Contributor Author

@getgaurav2 getgaurav2 commented Aug 28, 2019

@jnothman - just wanted to make sure that you are getting a notification for the comment above.

Copy link
Member

@jnothman jnothman left a comment

Otherwise LGTM. (Though I've not checked that the set of parameter pairs is complete.)

Please add an Enhancement entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

sklearn/feature_extraction/tests/test_text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 4, 2019

Assuming you also think this is ready for merge, please change WIP -> MRG

@getgaurav2 getgaurav2 changed the title [WIP] text vectorizers should raise warnings when user params will be unused [MRG] text vectorizers should raise warnings when user params will be unused Sep 4, 2019
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 5, 2019

Please add a note in what's new as requested above. Thanks!

@getgaurav2
Copy link
Contributor Author

@getgaurav2 getgaurav2 commented Sep 5, 2019

Please add a note in what's new as requested above. Thanks!

Yup ..Was just working on it. Added now .

@@ -146,6 +146,14 @@ Changelog
:mod:`sklearn.feature_extraction`
.................................

- |Enhancement| :func:`feature_extraction.text._warn_for_unused_params` will
Copy link
Member

@jnothman jnothman Sep 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe changes only in terms of public API: "a warning is now raised if a parameter choice means that another parameter will be unused" or something more clear than that.

Copy link
Contributor Author

@getgaurav2 getgaurav2 Sep 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it ..Can you please check if the latest commit makes more sense .

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 6, 2019

Makes more sense. We might yet find clearer words before release, but as far as I'm concerned we can merge. Thanks, @getgaurav2!

@jnothman jnothman merged commit 96ef6b8 into scikit-learn:master Sep 6, 2019
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants