Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Changed VarianceThreshold behaviour when threshold is zero. See #13691 #13704

Merged
merged 16 commits into from May 28, 2019

Conversation

@rlms
Copy link
Contributor

rlms commented Apr 23, 2019

Reference Issues/PRs

Fixes #13691

What does this implement/fix? Explain your changes.

Currently, VarianceThreshold can fail to remove features that only contain one repeated value when the threshold is 0 due to floating point errors introduced by division in np.var. As proposed in #13691, this changes the implementation of VarianceThreshold.fit to use np.ptp (difference between max and min) rather than np.var when the threshold is 0 (and the analogous utils.sparsefuncs.min_max_axis for sparse inputs). This avoids division and causes VarianceThreshold to behave as expected (verified by testing on the code in the issue).

Any other comments?

@rlms

This comment has been minimized.

Copy link
Contributor Author

rlms commented Apr 23, 2019

Tests passed on my machine. Root cause of the error is TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe' at sklearn\utils\sparsefuncs.py:344 when executing test_estimators[VarianceThreshold-check_estimator_sparse_data]. Not sure why that's happening.

Copy link
Member

jnothman left a comment

This isn't quite right because it changes the variances. Maybe we should use np.minimum(np.var(X, 1), np.ptp(X, 1)) which more explicitly just handles rounding issues.

This needs a test, perhaps based on the original issue code

@rlms

This comment has been minimized.

Copy link
Contributor Author

rlms commented Apr 24, 2019

@jnothman
Sure. Do you mean setting the variances to np.minimum(np.var(X, 1), np.ptp(X, 1)) or keeping self.variances_ as the variances but using the minimum for comparison against threshold? Also I think the axis should be 0 rather than 1. I'll add a test. There's also the issue of the check failing, which I'm struggling to debug because I can't reproduce it locally.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 24, 2019

I mean setting variances_ that way. Unless I'm much mistaken, ptp is a lower bound on various and so will help resolve numerical precision errors

@rlms

This comment has been minimized.

Copy link
Contributor Author

rlms commented Apr 24, 2019

Cool. And the test should be a new function in estimator_checks.py that's yielded by _yield_all_checks if the estimator is a VarianceThreshold?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 24, 2019

@rlms

This comment has been minimized.

Copy link
Contributor Author

rlms commented Apr 24, 2019

So a new file for testing this functionality? Most of the existing tests for estimators seem to be in test_common.py as calls to functions in estimator_checks

@jnothman

This comment has been minimized.

Copy link
Member

jnothman left a comment

Thanks!

sklearn/feature_selection/variance_threshold.py Outdated Show resolved Hide resolved
rlmacsween added 2 commits Apr 24, 2019
NicolasHug and others added 4 commits Apr 24, 2019
Co-Authored-By: rlms <macsweenroddy@gmail.com>
Co-Authored-By: rlms <macsweenroddy@gmail.com>
…/rlms/scikit-learn into variance-threshold-zero-variance
Copy link
Contributor

NicolasHug left a comment

Nitpick but LGTM if all goes green

Copy link
Contributor

NicolasHug left a comment

Nitpick but LGTM if all goes green

@rlms

This comment has been minimized.

Copy link
Contributor Author

rlms commented May 24, 2019

@jnothman Checks pass, can this now be merged?

Copy link
Member

jnothman left a comment

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

rlmacsween and others added 3 commits May 26, 2019
doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
Copy link
Member

jnothman left a comment

Okay, thanks

@jnothman jnothman merged commit be03467 into scikit-learn:master May 28, 2019
10 of 13 checks passed
10 of 13 checks passed
LGTM analysis: Python Running analyses for revisions
Details
ci/circleci: doc CircleCI is running your tests
Details
ci/circleci: doc-min-dependencies CircleCI is running your tests
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
scikit-learn.scikit-learn Build #20190528.23 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
scikit-learn.scikit-learn (Windows py35_32) Windows py35_32 succeeded
Details
scikit-learn.scikit-learn (Windows py37_64) Windows py37_64 succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 28, 2019

Thanks @rlms!

@rlms rlms deleted the rlms:variance-threshold-zero-variance branch May 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.