New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Changed VarianceThreshold behaviour when threshold is zero. See #13691 #13704
[MRG] Changed VarianceThreshold behaviour when threshold is zero. See #13691 #13704
Conversation
Tests passed on my machine. Root cause of the error is |
This isn't quite right because it changes the variances. Maybe we should use np.minimum(np.var(X, 1), np.ptp(X, 1))
which more explicitly just handles rounding issues.
This needs a test, perhaps based on the original issue code
@jnothman |
I mean setting variances_ that way. Unless I'm much mistaken, ptp is a lower bound on various and so will help resolve numerical precision errors |
Cool. And the test should be a new function in estimator_checks.py that's yielded by _yield_all_checks if the estimator is a VarianceThreshold? |
No, it should be in sklearn/feature_selection/tests
|
So a new file for testing this functionality? Most of the existing tests for estimators seem to be in test_common.py as calls to functions in estimator_checks |
Co-Authored-By: rlms <macsweenroddy@gmail.com>
Co-Authored-By: rlms <macsweenroddy@gmail.com>
…/rlms/scikit-learn into variance-threshold-zero-variance
@jnothman Checks pass, can this now be merged? |
Please add an entry to the change log at doc/whats_new/v0.22.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
Thanks @rlms! |
Reference Issues/PRs
Fixes #13691
What does this implement/fix? Explain your changes.
Currently, VarianceThreshold can fail to remove features that only contain one repeated value when the threshold is 0 due to floating point errors introduced by division in
np.var
. As proposed in #13691, this changes the implementation ofVarianceThreshold.fit
to usenp.ptp
(difference between max and min) rather thannp.var
when the threshold is 0 (and the analogousutils.sparsefuncs.min_max_axis
for sparse inputs). This avoids division and causes VarianceThreshold to behave as expected (verified by testing on the code in the issue).Any other comments?