Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Allow nan/inf in feature selection #11635

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
03eb769
removed check for nan/inf in SelectorMixin.tranform, RFE.fit, and RFE…
Jul 19, 2018
03042a2
removed extra blank line
Jul 19, 2018
d6ffba7
added tests for changes to RFE/RFECV and SelectorMixin
Jul 19, 2018
da4c5f4
pep8 changes and pytest.mark.parametrize for nan_inf tests
Jan 8, 2019
d3601f4
added force_all_finite=False for univariate feature selection fit
Jan 8, 2019
27e1399
added tests for nan and inf being allowed in univariate select
Jan 8, 2019
2b06f2a
added test for nan and inf being allowed in SelectorMixin transform
Jan 8, 2019
fb7d7d6
updated checks to not run check_estimators_nan_inf on feature selectors
Jan 8, 2019
4d4662c
pep8 corrections
Jan 8, 2019
eadb804
created dummy_score function for use by all tests needing a fake scorer
Jan 8, 2019
da866eb
fixed travis test error with RFECV not having cv specified
Jan 8, 2019
9f2ff90
Merge branch 'master' into allow-nan-inf-in-feature-selection
Jan 8, 2019
f7c2fb8
allowed no NaN in pickle check for estimators in ALLOW_NAN
Jan 9, 2019
288cecc
added support for nan/inf in VarianceThreshold.fit, added tests for it
Jan 11, 2019
67fa9e7
added whats new documentation for feature selection allowing NaN/Inf
Jan 17, 2019
9be3870
added Notes for modified feature selectors about allowing NaN/Inf
Jan 17, 2019
c9e8fcc
fixed pep8 error
Jan 17, 2019
34e2710
shortened line length for whats new documentation
Jan 17, 2019
816b0a3
changed VarianceThreshold to only allow nan, not inf. Updated tests f…
Jan 23, 2019
c4619eb
fixed phrasing in whats_new for #11635
Jan 23, 2019
195ed13
added test for all-NaN feature in VarianceThreshold
Jan 24, 2019
87c771a
Merge branch 'master' into allow-nan-inf-in-feature-selection
adpeters Jan 30, 2019
a79155d
Merge branch 'master' into allow-nan-inf-in-feature-selection
adpeters Feb 5, 2019
620726a
Merge branch 'master' into allow-nan-inf-in-feature-selection
adpeters Feb 22, 2019
fe60c94
Merge branch 'master' into HEAD
jnothman Apr 24, 2019
cca7acc
Fix use of check_array
jnothman Apr 24, 2019
017a0db
Use check_X_y
jnothman Apr 24, 2019
0efc7f0
updated SelectFromModel allow_nan tag to inherit from its estimator
May 23, 2019
35878e8
Merge branch 'master' into allow-nan-inf-in-feature-selection
May 23, 2019
4119fcd
moved documentation for #11635 to v0.22
May 23, 2019
9f20985
fixed formatting
May 23, 2019
fc1a975
removed test_transform_accepts_nan_inf from test_from_model.py since …
May 23, 2019
f710408
updated test_transform_accepts_nan_inf to handle NaN tag
May 23, 2019
2b55934
fixed formatting
May 23, 2019
1adc65c
fixed formatting
May 23, 2019
055d4ea
set allow_nan tag to false for univariate feature selectors since the…
adpeters Oct 29, 2019
61d8277
merge with master
adpeters Oct 29, 2019
4499960
updated VarianceThreshold.fit() to handle nan when threshold is 0
adpeters Oct 30, 2019
92bc14f
removed test that no longer applies now that allow_nan tag is False f…
adpeters Oct 30, 2019
e1add7f
added pytest import since it was accidentally removed
adpeters Oct 30, 2019
4921954
removed duplicate numpy import
adpeters Oct 30, 2019
6783182
updated assert_equals to assert
adpeters Oct 30, 2019
8a9164f
reverted changes to univariate feature selectors to break those out t…
adpeters Nov 1, 2019
8167671
updated documentation to reflect removal of univariate changes
adpeters Nov 1, 2019
6176538
fixed rfe handling of tags so they are properly inherited from the es…
adpeters Nov 1, 2019
e43d8f8
added get_tags to mockclassifier for rfe tests
adpeters Nov 1, 2019
6950892
added test for nan/inf in fit of SelectFromModel
adpeters Nov 4, 2019
20f2426
added noqa tag to import to avoid flake errors
adpeters Nov 4, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion doc/whats_new/v0.22.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,6 @@ Changelog
:pr:`123456` by :user:`Joe Bloggs <joeongithub>`.
where 123456 is the *pull request* number, not the issue number.


:mod:`sklearn.base`
...................

Expand Down Expand Up @@ -311,6 +310,15 @@ Changelog
:mod:`sklearn.feature_selection`
................................

- |Enhancement| Updated the following :mod:`feature_selection` estimators to allow
NaN/Inf values in ``transform`` and ``fit``:
:class:`feature_selection.RFE`, :class:`feature_selection.RFECV`,
:class:`feature_selection.SelectFromModel`,
and :class:`feature_selection.VarianceThreshold`. Note that if the underlying
estimator of the feature selector does not allow NaN/Inf then it will still
error, but the feature selectors themselves no longer enforce this
restriction unnecessarily. :issue:`11635` by :user:`Alec Peters <adpeters>`.

- |Fix| Fixed a bug where :class:`feature_selection.VarianceThreshold` with
`threshold=0` did not remove constant features due to numerical instability,
by using range rather than variance in this case.
Expand Down
4 changes: 3 additions & 1 deletion sklearn/feature_selection/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,9 @@ def transform(self, X):
X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
"""
X = check_array(X, dtype=None, accept_sparse='csr')
tags = self._get_tags()
X = check_array(X, dtype=None, accept_sparse='csr',
force_all_finite=not tags.get('allow_nan', True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic now won't work for univariate :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I removed the test for the univariate but I wasn't sure what to do for that case. I guess the only way you could really use it would be to subclass and override the tags.

mask = self.get_support()
if not mask.any():
warn("No features were selected: either the data is"
Expand Down
8 changes: 8 additions & 0 deletions sklearn/feature_selection/_from_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,10 @@ class SelectFromModel(MetaEstimatorMixin, SelectorMixin, BaseEstimator):
threshold_ : float
The threshold value used for feature selection.

Notes
-----
Allows NaN/Inf in the input if the underlying estimator does as well.

Examples
--------
>>> from sklearn.feature_selection import SelectFromModel
Expand Down Expand Up @@ -249,3 +253,7 @@ def partial_fit(self, X, y=None, **fit_params):
self.estimator_ = clone(self.estimator)
self.estimator_.partial_fit(X, y, **fit_params)
return self

def _more_tags(self):
estimator_tags = self.estimator._get_tags()
return {'allow_nan': estimator_tags.get('allow_nan', True)}
17 changes: 14 additions & 3 deletions sklearn/feature_selection/_rfe.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,10 @@ class RFE(SelectorMixin, MetaEstimatorMixin, BaseEstimator):
>>> selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

Notes
-----
Allows NaN/Inf in the input if the underlying estimator does as well.

See also
--------
RFECV : Recursive feature elimination with built-in cross-validated
Expand Down Expand Up @@ -150,7 +154,9 @@ def _fit(self, X, y, step_score=None):
# and is used when implementing RFECV
# self.scores_ will not be calculated when calling _fit through fit

X, y = check_X_y(X, y, "csc", ensure_min_features=2)
tags = self._get_tags()
X, y = check_X_y(X, y, "csc", ensure_min_features=2,
force_all_finite=not tags.get('allow_nan', True))
# Initialization
n_features = X.shape[1]
if self.n_features_to_select is None:
Expand Down Expand Up @@ -326,7 +332,9 @@ def predict_log_proba(self, X):
return self.estimator_.predict_log_proba(self.transform(X))

def _more_tags(self):
return {'poor_score': True}
estimator_tags = self.estimator._get_tags()
return {'poor_score': True,
'allow_nan': estimator_tags.get('allow_nan', True)}


class RFECV(RFE):
Expand Down Expand Up @@ -421,6 +429,8 @@ class RFECV(RFE):
``ceil((n_features - min_features_to_select) / step) + 1``,
where step is the number of features removed at each iteration.

Allows NaN/Inf in the input if the underlying estimator does as well.

Examples
--------
The following example shows how to retrieve the a-priori not known 5
Expand Down Expand Up @@ -479,7 +489,8 @@ def fit(self, X, y, groups=None):
train/test set. Only used in conjunction with a "Group" :term:`cv`
instance (e.g., :class:`~sklearn.model_selection.GroupKFold`).
"""
X, y = check_X_y(X, y, "csr", ensure_min_features=2)
X, y = check_X_y(X, y, "csr", ensure_min_features=2,
force_all_finite=False)

# Initialization
cv = check_cv(self.cv, y, is_classifier(self.estimator))
Expand Down
18 changes: 14 additions & 4 deletions sklearn/feature_selection/_variance_threshold.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,10 @@ class VarianceThreshold(SelectorMixin, BaseEstimator):
variances_ : array, shape (n_features,)
Variances of individual features.

Notes
-----
Allows NaN in the input.

Examples
--------
The following dataset has integer features, two of which are the same
Expand Down Expand Up @@ -61,24 +65,27 @@ def fit(self, X, y=None):
-------
self
"""
X = check_array(X, ('csr', 'csc'), dtype=np.float64)
X = check_array(X, ('csr', 'csc'), dtype=np.float64,
force_all_finite='allow-nan')

if hasattr(X, "toarray"): # sparse matrix
_, self.variances_ = mean_variance_axis(X, axis=0)
if self.threshold == 0:
mins, maxes = min_max_axis(X, axis=0)
peak_to_peaks = maxes - mins
else:
self.variances_ = np.var(X, axis=0)
self.variances_ = np.nanvar(X, axis=0)
if self.threshold == 0:
peak_to_peaks = np.ptp(X, axis=0)

if self.threshold == 0:
# Use peak-to-peak to avoid numeric precision issues
# for constant features
self.variances_ = np.minimum(self.variances_, peak_to_peaks)
compare_arr = np.array([self.variances_, peak_to_peaks])
self.variances_ = np.nanmin(compare_arr, axis=0)

if np.all(self.variances_ <= self.threshold):
if np.all(~np.isfinite(self.variances_) |
(self.variances_ <= self.threshold)):
msg = "No feature in X meets the variance threshold {0:.5f}"
if X.shape[0] == 1:
msg += " (X contains only one sample)"
Expand All @@ -90,3 +97,6 @@ def _get_support_mask(self):
check_is_fitted(self)

return self.variances_ > self.threshold

def _more_tags(self):
return {'allow_nan': True}
57 changes: 56 additions & 1 deletion sklearn/feature_selection/tests/test_from_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,28 @@
from sklearn.linear_model import LogisticRegression, SGDClassifier, Lasso
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import (RandomForestClassifier,
HistGradientBoostingClassifier)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.base import BaseEstimator


class NaNTag(BaseEstimator):
def _more_tags(self):
return {'allow_nan': True}


class NoNaNTag(BaseEstimator):
def _more_tags(self):
return {'allow_nan': False}


class NaNTagRandomForest(RandomForestClassifier):
def _more_tags(self):
return {'allow_nan': True}


iris = datasets.load_iris()
data, y = iris.data, iris.target
rng = np.random.RandomState(0)
Expand Down Expand Up @@ -320,3 +338,40 @@ def test_threshold_without_refitting():
# Set a higher threshold to filter out more features.
model.threshold = "1.0 * mean"
assert X_transform.shape[1] > model.transform(data).shape[1]


def test_fit_accepts_nan_inf():
# Test that fit doesn't check for np.inf and np.nan values.
clf = HistGradientBoostingClassifier(random_state=0)

model = SelectFromModel(estimator=clf)

nan_data = data.copy()
nan_data[0] = np.NaN
nan_data[1] = np.Inf

model.fit(data, y)


def test_transform_accepts_nan_inf():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also check for fit, maybe use HistGradientBoostingClassifier?

# Test that transform doesn't check for np.inf and np.nan values.
clf = NaNTagRandomForest(n_estimators=100, random_state=0)
nan_data = data.copy()

model = SelectFromModel(estimator=clf)
model.fit(nan_data, y)

nan_data[0] = np.NaN
nan_data[1] = np.Inf

model.transform(nan_data)


def test_allow_nan_tag_comes_from_estimator():
allow_nan_est = NaNTag()
model = SelectFromModel(estimator=allow_nan_est)
assert model._get_tags()['allow_nan'] is True

no_nan_est = NoNaNTag()
model = SelectFromModel(estimator=no_nan_est)
assert model._get_tags()['allow_nan'] is False
26 changes: 26 additions & 0 deletions sklearn/feature_selection/tests/test_rfe.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Testing Recursive feature elimination
"""

import pytest
import numpy as np
from numpy.testing import assert_array_almost_equal, assert_array_equal
from scipy import sparse
Expand Down Expand Up @@ -54,6 +55,9 @@ def get_params(self, deep=True):
def set_params(self, **params):
return self

def _get_tags(self):
return {}


def test_rfe_features_importance():
generator = check_random_state(0)
Expand Down Expand Up @@ -369,3 +373,25 @@ def test_rfe_cv_groups():
)
est_groups.fit(X, y, groups=groups)
assert est_groups.n_features_ > 0

glemaitre marked this conversation as resolved.
Show resolved Hide resolved

@pytest.mark.parametrize("cv", [
None,
5
])
def test_rfe_allow_nan_inf_in_x(cv):
iris = load_iris()
X = iris.data
y = iris.target

# add nan and inf value to X
X[0][0] = np.NaN
X[0][1] = np.Inf

clf = MockClassifier()
if cv is not None:
rfe = RFECV(estimator=clf, cv=cv)
else:
rfe = RFE(estimator=clf)
rfe.fit(X, y)
rfe.transform(X)
12 changes: 12 additions & 0 deletions sklearn/feature_selection/tests/test_variance_threshold.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,15 @@ def test_zero_variance_floating_point_error():
msg = "No feature in X meets the variance threshold 0.00000"
with pytest.raises(ValueError, match=msg):
VarianceThreshold().fit(X)


def test_variance_nan():
arr = np.array(data, dtype=np.float64)
# add single NaN and feature should still be included
arr[0, 0] = np.NaN
# make all values in feature NaN and feature should be rejected
arr[:, 1] = np.NaN

for X in [arr, csr_matrix(arr), csc_matrix(arr), bsr_matrix(arr)]:
sel = VarianceThreshold().fit(X)
assert_array_equal([0, 3, 4], sel.get_support(indices=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ideally also check that an all-nan feature is treated as 0 variance...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy.nanvar returns NaN for an all-NaN feature, is that okay behavior or do we want to change that to 0 variance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind which as long as the feature is rejected. It is probably simpler to set it to 0 at fit time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I left it as NaN since that doesn't require any changes to fit and would also allow you to differentiate between an all-NaN feature and one that truly has 0 variance, if desired. But I updated the test to ensure that the all-NaN feature is rejected.