Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Changed VarianceThreshold behaviour when threshold is zero. See #13691 #13704

Merged
merged 16 commits into from May 28, 2019

Conversation

rlms
Copy link
Contributor

@rlms rlms commented Apr 23, 2019

Reference Issues/PRs

Fixes #13691

What does this implement/fix? Explain your changes.

Currently, VarianceThreshold can fail to remove features that only contain one repeated value when the threshold is 0 due to floating point errors introduced by division in np.var. As proposed in #13691, this changes the implementation of VarianceThreshold.fit to use np.ptp (difference between max and min) rather than np.var when the threshold is 0 (and the analogous utils.sparsefuncs.min_max_axis for sparse inputs). This avoids division and causes VarianceThreshold to behave as expected (verified by testing on the code in the issue).

Any other comments?

@rlms
Copy link
Contributor Author

@rlms rlms commented Apr 23, 2019

Tests passed on my machine. Root cause of the error is TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe' at sklearn\utils\sparsefuncs.py:344 when executing test_estimators[VarianceThreshold-check_estimator_sparse_data]. Not sure why that's happening.

Copy link
Member

@jnothman jnothman left a comment

This isn't quite right because it changes the variances. Maybe we should use np.minimum(np.var(X, 1), np.ptp(X, 1)) which more explicitly just handles rounding issues.

This needs a test, perhaps based on the original issue code

@rlms
Copy link
Contributor Author

@rlms rlms commented Apr 24, 2019

@jnothman
Sure. Do you mean setting the variances to np.minimum(np.var(X, 1), np.ptp(X, 1)) or keeping self.variances_ as the variances but using the minimum for comparison against threshold? Also I think the axis should be 0 rather than 1. I'll add a test. There's also the issue of the check failing, which I'm struggling to debug because I can't reproduce it locally.

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 24, 2019

I mean setting variances_ that way. Unless I'm much mistaken, ptp is a lower bound on various and so will help resolve numerical precision errors

@rlms
Copy link
Contributor Author

@rlms rlms commented Apr 24, 2019

Cool. And the test should be a new function in estimator_checks.py that's yielded by _yield_all_checks if the estimator is a VarianceThreshold?

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 24, 2019

@rlms
Copy link
Contributor Author

@rlms rlms commented Apr 24, 2019

So a new file for testing this functionality? Most of the existing tests for estimators seem to be in test_common.py as calls to functions in estimator_checks

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 24, 2019

Copy link
Member

@jnothman jnothman left a comment

Thanks!

sklearn/feature_selection/tests/test_variance_threshold.py Outdated Show resolved Hide resolved
sklearn/feature_selection/variance_threshold.py Outdated Show resolved Hide resolved
Copy link
Member

@NicolasHug NicolasHug left a comment

Nitpick but LGTM if all goes green

sklearn/feature_selection/tests/test_variance_threshold.py Outdated Show resolved Hide resolved
Copy link
Member

@NicolasHug NicolasHug left a comment

Nitpick but LGTM if all goes green

@rlms
Copy link
Contributor Author

@rlms rlms commented May 24, 2019

@jnothman Checks pass, can this now be merged?

Copy link
Member

@jnothman jnothman left a comment

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

Okay, thanks

@jnothman jnothman merged commit be03467 into scikit-learn:master May 28, 2019
10 of 13 checks passed
@jnothman
Copy link
Member

@jnothman jnothman commented May 28, 2019

Thanks @rlms!

@rlms rlms deleted the variance-threshold-zero-variance branch May 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants