Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] fix selectFdr bug #7490

Merged
merged 17 commits into from Oct 20, 2016
5 changes: 5 additions & 0 deletions doc/whats_new.rst
Expand Up @@ -44,6 +44,11 @@ Enhancements
Bug fixes
.........

- Fix a bug where :class:`sklearn.feature_selection.SelectFdr` did not
exactly implement Benjamini-Hochberg procedure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's better, but indicating that it formerly may have selected fewer features than it should (and now does) would be more helpful for former users of this estimator.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks.

(`#7490 <https://github.com/scikit-learn/scikit-learn/pull/7490>`_) by
`Peng Meng`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add yourself at the bottom of the page if you want your name to be a link.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks very much.


- :class:`sklearn.manifold.LocallyLinearEmbedding` now correctly handles
integer inputs
(`#6282 <https://github.com/scikit-learn/scikit-learn/pull/6282>`_) by
Expand Down
36 changes: 35 additions & 1 deletion sklearn/feature_selection/tests/test_feature_select.py
Expand Up @@ -371,6 +371,40 @@ def test_select_heuristics_regression():
assert_less(np.sum(support[5:] == 1), 3)


def test_boundary_case_ch2():
# Test boundary case, and always aiming to select 1 feature
X = np.array([[10, 20], [20, 20], [20, 30]])
y = np.array([[1], [0], [0]])
scores, pvalues = chi2(X, y)
assert_array_almost_equal(scores, np.array([4., 0.71428571]))
assert_array_almost_equal(pvalues, np.array([0.04550026, 0.39802472]))

filter_fdr = SelectFdr(chi2, alpha=0.1)
filter_fdr.fit(X, y)
support_fdr = filter_fdr.get_support()
assert_array_equal(support_fdr, np.array([True, False]))

filter_kbest = SelectKBest(chi2, k=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller did you mean this sort of thing, testing the select 1 feature boundary case for each strategy?

filter_kbest.fit(X, y)
support_kbest = filter_kbest.get_support()
assert_array_equal(support_kbest, np.array([True, False]))

filter_percentile = SelectPercentile(chi2, percentile=50)
filter_percentile.fit(X, y)
support_percentile = filter_percentile.get_support()
assert_array_equal(support_percentile, np.array([True, False]))

filter_fpr = SelectFpr(chi2, alpha=0.1)
filter_fpr.fit(X, y)
support_fpr = filter_fpr.get_support()
assert_array_equal(support_fpr, np.array([True, False]))

filter_fwe = SelectFwe(chi2, alpha=0.1)
filter_fwe.fit(X, y)
support_fwe = filter_fwe.get_support()
assert_array_equal(support_fwe, np.array([True, False]))


def test_select_fdr_regression():
# Test that fdr heuristic actually has low FDR.
def single_fdr(alpha, n_informative, random_state):
Expand Down Expand Up @@ -404,7 +438,7 @@ def single_fdr(alpha, n_informative, random_state):
# FDR = E(FP / (TP + FP)) <= alpha
false_discovery_rate = np.mean([single_fdr(alpha, n_informative,
random_state) for
random_state in range(30)])
random_state in range(100)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FDR = E(FP / (TP + FP)) <= alpha

FDR is count by average many times (N) of "FP/(TP+FP)"
The larger of N, FDR is more accuracy.
Actually, N is the larger the better here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it not pass without that change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, range 30 will not pass test

assert_greater_equal(alpha, false_discovery_rate)

# Make sure that the empirical false discovery rate increases
Expand Down
4 changes: 2 additions & 2 deletions sklearn/feature_selection/univariate_selection.py
Expand Up @@ -596,8 +596,8 @@ def _get_support_mask(self):

n_features = len(self.pvalues_)
sv = np.sort(self.pvalues_)
selected = sv[sv <= float(self.alpha) / n_features
* np.arange(n_features)]
selected = sv[sv <= float(self.alpha) / n_features *
np.arange(1, n_features + 1)]
if selected.size == 0:
return np.zeros_like(self.pvalues_, dtype=bool)
return self.pvalues_ <= selected.max()
Expand Down