[BUG] Fix SelectFDR thresholding bug (#2771) #2932

Closed
wants to merge 1 commit into
from

6 participants

@ajtulloch

From #2771, we were
not correctly scaling the alpha
parameter (http://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
with the number of features (== hypothesis). Thus, the alpha
parameter was not invariant with respect the number of features.

The correction is as suggested in the original issue, and a test has
been added that verifies that for various numbers of features, an
appropriate false discovery rate is generated when using the selector.

This test passes with the new FDR logic, and fails with the old FDR logic.

@ajtulloch ajtulloch [Feature Selection] Fix SelectFDR thresholding bug (#2771)
From scikit-learn#2771, we were
not correctly scaling the alpha
parameter (http://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
with the number of features (== hypothesis).  Thus, the alpha
parameter was not invariant wrt the number of features.

The correction is as suggested in the original issue, and a test has
been added that verifies that for various numbers of features, an
appropriate false discovery rate is generated when using the selector.
321fbcc
@ajtulloch

ping?

@GaelVaroquaux
scikit-learn member
@agramfort
scikit-learn member

@bthirion can you have a look?

@bthirion bthirion commented on the diff Mar 8, 2014
sklearn/feature_selection/univariate_selection.py
@@ -491,7 +491,8 @@ def __init__(self, score_func=f_classif, alpha=5e-2):
def _get_support_mask(self):
alpha = self.alpha
sv = np.sort(self.pvalues_)
- threshold = sv[sv < alpha * np.arange(len(self.pvalues_))].max()
+ m = float(len(self.pvalues_))
+ threshold = sv[sv < alpha / m * np.arange(len(self.pvalues_))].max()
@bthirion
scikit-learn member
bthirion added a line comment Mar 8, 2014

OK for the correction.
Note that thisline could be more concisely rewritten as
threshold = sv[sv < alpha / m * np.arange(m)].max()

@VirgileFritsch
scikit-learn member
VirgileFritsch added a line comment Mar 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bthirion bthirion commented on the diff Mar 8, 2014
sklearn/feature_selection/tests/test_feature_select.py
- """
- X, y = make_regression(n_samples=200, n_features=20,
- n_informative=5, shuffle=False, random_state=0)
-
- univariate_filter = SelectFdr(f_regression, alpha=0.01)
- X_r = univariate_filter.fit(X, y).transform(X)
- X_r2 = GenericUnivariateSelect(
- f_regression, mode='fdr', param=0.01).fit(X, y).transform(X)
- assert_array_equal(X_r, X_r2)
- support = univariate_filter.get_support()
- gtruth = np.zeros(20)
- gtruth[:5] = 1
- assert_array_equal(support, gtruth)
+ with the fdr heuristic.
+
+ Tests that the scaling factors
@bthirion
scikit-learn member
bthirion added a line comment Mar 8, 2014

Unended sentence...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bthirion bthirion commented on the diff Mar 8, 2014
sklearn/feature_selection/tests/test_feature_select.py
assert_array_equal(X_r, X_r2)
support = univariate_filter.get_support()
gtruth = np.zeros(20)
gtruth[:5] = 1
assert_array_equal(support, gtruth)
+ num_false_positives = np.sum(support[5:] == 1)
+ num_true_positives = np.sum(support[:5] == 1)
+
+ false_discovery_rate = float(num_false_positives) / \
+ (num_true_positives + num_false_positives)
+ # We have that
+ # FDR = E(num_false_positives / (true_positives + false_positives))
+ # <= alpha
@bthirion
scikit-learn member
bthirion added a line comment Mar 8, 2014

Note that the expected value is actually alpha, not <= alpha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bthirion bthirion commented on the diff Mar 8, 2014
sklearn/feature_selection/tests/test_feature_select.py
assert_array_equal(X_r, X_r2)
support = univariate_filter.get_support()
gtruth = np.zeros(20)
gtruth[:5] = 1
assert_array_equal(support, gtruth)
+ num_false_positives = np.sum(support[5:] == 1)
+ num_true_positives = np.sum(support[:5] == 1)
+
+ false_discovery_rate = float(num_false_positives) / \
+ (num_true_positives + num_false_positives)
+ # We have that
+ # FDR = E(num_false_positives / (true_positives + false_positives))
+ # <= alpha
+ assert(false_discovery_rate < alpha)
@bthirion
scikit-learn member
bthirion added a line comment Mar 8, 2014

This may be true in that case, but in geenral, you don't expect the actual fdr to be smaller than alpha.
This test is a bit strange, because the test in l.282 actually guarantees the absence of false positives.

A refactoring of this test would thus be welcome.

@arjoly
scikit-learn member
arjoly added a line comment Jul 18, 2014

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bthirion bthirion commented on the diff Mar 8, 2014
sklearn/feature_selection/tests/test_feature_select.py
+
+ univariate_filter = SelectFdr(f_regression, alpha=alpha)
+ X_r = univariate_filter.fit(X, y).transform(X)
+ X_r2 = GenericUnivariateSelect(
+ f_regression, mode='fdr', param=alpha).fit(X, y).transform(X)
+ assert_array_equal(X_r, X_r2)
+ support = univariate_filter.get_support()
+ num_false_positives = np.sum(support[n_informative:] == 1)
+ num_true_positives = np.sum(support[:n_informative] == 1)
+
+ false_discovery_rate = float(num_false_positives) / \
+ (num_true_positives + num_false_positives)
+ # We have that
+ # FDR = E(num_false_positives / (true_positives + false_positives))
+ # <= alpha
+ assert(false_discovery_rate < alpha)
@bthirion
scikit-learn member
bthirion added a line comment Mar 8, 2014

Same comment here.

@arjoly
scikit-learn member
arjoly added a line comment Jul 18, 2014

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bthirion
scikit-learn member

OK with this PR as is.
This could be an opportunity to improve the tests.

@ogrisel ogrisel added this to the 0.15.1 milestone Jul 29, 2014
@amueller amueller added the Bug label Jan 9, 2015
@amueller
scikit-learn member

Fixed in #4146.

@amueller amueller closed this Feb 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment