Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] fix selectFdr bug #7490

Merged
merged 17 commits into from Oct 20, 2016
Merged

[MRG + 2] fix selectFdr bug #7490

merged 17 commits into from Oct 20, 2016

Conversation

@mpjlu
Copy link

@mpjlu mpjlu commented Sep 26, 2016

Reference Issue

Fixes #7474

What does this implement/fix? Explain your changes.

Description

selectFdr in scikit-learn/sklearn/feature_selection/univariate_selection.py:

def _get_support_mask(self):
    check_is_fitted(self, 'scores_')

    n_features = len(self.pvalues_)
    sv = np.sort(self.pvalues_)
    **selected = sv[sv <= float(self.alpha) / n_features * np.arange(n_features)]**
    if selected.size == 0:
        return np.zeros_like(self.pvalues_, dtype=bool)
    return self.pvalues_ <= selected.max()

Should Be:

    selected = sv[sv <= float(self.alpha) / n_features
                  * (np.arange(n_features) + 1)]

Because np.arange is start from 0, here it should be start from 1

Any other comments?


This change is Reviewable

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 26, 2016

I've inserted "Fixes #7474" in the issue description above.

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 26, 2016

Tests aren't passing.

@mpjlu
Copy link
Author

@mpjlu mpjlu commented Sep 26, 2016

Any one can input some comments on this build failed. I am not familiar with this.

@@ -596,7 +596,7 @@ def _get_support_mask(self):
n_features = len(self.pvalues_)
sv = np.sort(self.pvalues_)
selected = sv[sv <= float(self.alpha) / n_features
* np.arange(n_features)]
* (np.arange(n_features) + 1)]

This comment has been minimized.

@yl565

yl565 Sep 26, 2016
Contributor

Looks like a pep8 problem. Line break should be after *.

selected = sv[sv <= float(self.alpha) / n_features *
              (np.arange(n_features) + 1)]

This comment has been minimized.

@nelson-liu

nelson-liu Sep 26, 2016
Contributor

fwiw, i think pep8 has changed their stance on this, see [1,2]. It does seem to be what's causing the travis failure though (and we should keep consistency with the rest of the codebase).

[1] https://www.python.org/dev/peps/pep-0008/#should-a-line-break-before-or-after-a-binary-operator
[2] http://bugs.python.org/issue26763

This comment has been minimized.

@jnothman

jnothman Sep 26, 2016
Member

I'd be happy to disable W503. I tend to agree with the more recent convention in terms of readability.

Strange that https://www.python.org/dev/peps/pep-0008 doesn't mention when the document was updated.

This comment has been minimized.

@nelson-liu

nelson-liu Sep 27, 2016
Contributor

@jnothman , guido van rossum comments in the bug tracker on 4/15 that he updated the pep documentation. Interestingly, I thought that W503 is ignored by default in flake8

This comment has been minimized.

@jnothman

jnothman Sep 27, 2016
Member

conda flake8 might be outdated.

@@ -404,7 +404,7 @@ def single_fdr(alpha, n_informative, random_state):
# FDR = E(FP / (TP + FP)) <= alpha
false_discovery_rate = np.mean([single_fdr(alpha, n_informative,
random_state) for
random_state in range(30)])
random_state in range(100)])

This comment has been minimized.

@jnothman

jnothman Sep 27, 2016
Member

What is this change for?

This comment has been minimized.

@mpjlu

mpjlu Sep 27, 2016
Author

FDR = E(FP / (TP + FP)) <= alpha

FDR is count by average many times (N) of "FP/(TP+FP)"
The larger of N, FDR is more accuracy.
Actually, N is the larger the better here.

This comment has been minimized.

@amueller

amueller Oct 17, 2016
Member

Did it not pass without that change?

This comment has been minimized.

@mpjlu

mpjlu Oct 18, 2016
Author

Yes, range 30 will not pass test

selected = sv[sv <= float(self.alpha) / n_features
* np.arange(n_features)]
selected = sv[sv <= float(self.alpha) / n_features *
(np.arange(n_features) + 1)]

This comment has been minimized.

@jnothman

jnothman Sep 28, 2016
Member

should be arange(1, n_features+1)

This comment has been minimized.

@mpjlu

mpjlu Oct 17, 2016
Author

done

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 29, 2016

In any case, I don't see how this test ensures that your change works. Could you please add a test that clearly would fail at master?

@NelleV
Copy link
Member

@NelleV NelleV commented Sep 29, 2016

Such a test could be that provided with only one p-value, that the function returns the approriate answer (ie, in this case that if the p-value is small enough but not 0, it should be selected - in master, it would only be selected if the p-value is equal to 0, but now it would be selected if the p-value is smaller than alpha).

@mpjlu
Copy link
Author

@mpjlu mpjlu commented Sep 29, 2016

thanks @NelleV, i agree withyou. this test is very simple and direct, is it necessary

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 5, 2016

yes it is necessary.

@mpjlu
Copy link
Author

@mpjlu mpjlu commented Oct 10, 2016

hi @jnothman , sorry for late reply because of national holiday.
The clear case to show our change works is like the previous comment of @NelleV, but it is not easy to construct such a case based on the the current method to use make_regression to generate the rand sample data.
It is easy to manually construct the sample data to show our change works, but the code style is different with others, all others use sample generation method to generate the sample data.

I propose the following method to show our change works,
The test is to assert num_false_positives, and num_true_positives like this:

def test_select_fdr_regression2():
        X, y = make_regression(n_samples=150, n_features=20,
                               n_informative=5, shuffle=False,
                               random_state=0, noise=10)
        with warnings.catch_warnings(record=True):
            # Warnings can be raised when no features are selected
            # (low alpha or very noisy data)
            univariate_filter = SelectFdr(f_regression, alpha=0.01)
            univariate_filter.fit(X, y)
        support = univariate_filter.get_support()
        num_false_positives = np.sum(support[n_informative:] == 1)
        num_true_positives = np.sum(support[:n_informative] == 1) 
        assert_equal(num_false_positives, 5) 
        assert_equal(num_true_positives, 7)

How do you think about it?
Actually, there are many cases this change works and the master fail. we just need to write one here.

@amueller
Copy link
Member

@amueller amueller commented Oct 10, 2016

@mpjlu please format your code using backticks (I did it for you this time).

@amueller
Copy link
Member

@amueller amueller commented Oct 10, 2016

I don't understand the test. The FDR in this case is 5/12 which is about 0.5, while the test claims to control it at .01? Or am I misunderstanding something here?

@mpjlu
Copy link
Author

@mpjlu mpjlu commented Oct 11, 2016

Hi @amueller , the code in the last comment is just an code template, not the real data. Sorry for the misunderstanding.

@amueller
Copy link
Member

@amueller amueller commented Oct 13, 2016

ah...

@mpjlu
Copy link
Author

@mpjlu mpjlu commented Oct 14, 2016

Hi @jnothman , do you think I should add a test case (by manually construct the sample data) that clearly would fail at master? Any other work I need do for this PR?

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 14, 2016

You should construct a test case that ensures that the implementation does what it should. Ideally that would make the off-by-one distinction violated presently.

@mpjlu mpjlu force-pushed the mpjlu:master branch from 076bd01 to 6cb1cc0 Oct 17, 2016
@mpjlu
Copy link
Author

@mpjlu mpjlu commented Oct 17, 2016

hi @jnothman , I add a test case for this pr. For the sample data of the test, the master code cannot select one feature, it will select two features or select none. This PR can select 0, 1 or 2 features according to alpha.

# by PR #7490, the result is array([True, False])
X = np.array([[10, 20], [20, 20], [20, 30]])
y = np.array([[1], [0], [0]])
univariate_filter = SelectFdr(chi2, alpha=0.1)

This comment has been minimized.

@amueller

amueller Oct 17, 2016
Member

can you please assert that the p-values are the ones you specify above?

This comment has been minimized.

@mpjlu

mpjlu Oct 18, 2016
Author

Yes, the p-values in comments are real p-values.

This comment has been minimized.

@mpjlu

mpjlu Oct 18, 2016
Author

Do you mean I should assert the p-values in the test code?

This comment has been minimized.

@amueller

amueller Oct 18, 2016
Member

yes please.

@maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented Oct 19, 2016

Hi, since it is a bug fix, I suppose you can add a entry in what's new under bug fixes section. Hope it helps.

@mpjlu
Copy link
Author

@mpjlu mpjlu commented Oct 19, 2016

thanks @maniteja123

@@ -44,6 +44,10 @@ Enhancements
Bug fixes
.........

- :class:`sklearn.feature_selection.SelectFdr` now correctly works

This comment has been minimized.

@jnothman

jnothman Oct 19, 2016
Member

A more descriptive message to tell people what was broken and is now fixed would be helpful. Often we use a message like "Fixed a bug where :class:feature_selection.SelectFdr did not ..."

This comment has been minimized.

@mpjlu

mpjlu Oct 19, 2016
Author

how about"Fixed a bug where :class:feature_selection.SelectFdr did not accurately implement Benjamini-Hochberg procedure"

@@ -44,6 +44,11 @@ Enhancements
Bug fixes
.........

- Fix a bug where :class:`sklearn.feature_selection.SelectFdr` did not
exactly implement Benjamini-Hochberg procedure

This comment has been minimized.

@jnothman

jnothman Oct 19, 2016
Member

That's better, but indicating that it formerly may have selected fewer features than it should (and now does) would be more helpful for former users of this estimator.

This comment has been minimized.

@mpjlu

mpjlu Oct 19, 2016
Author

updated, thanks.

@mpjlu
Copy link
Author

@mpjlu mpjlu commented Oct 19, 2016

Seems there is a build environment problem? any comments for this? Thanks.

@maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented Oct 19, 2016

Hi, yeah you are right. It looks like that was some appveyor issue. I suppose it is fine since all the other builds succeed.

@mpjlu
Copy link
Author

@mpjlu mpjlu commented Oct 19, 2016

thanks @maniteja123

@amueller amueller added this to the 0.18.1 milestone Oct 19, 2016
Copy link
Member

@amueller amueller left a comment

Looks good to me apart from adding the link.

exactly implement Benjamini-Hochberg procedure. It formerly may have
selected fewer features than it should.
(`#7490 <https://github.com/scikit-learn/scikit-learn/pull/7490>`_) by
`Peng Meng`_.

This comment has been minimized.

@amueller

amueller Oct 19, 2016
Member

You need to add yourself at the bottom of the page if you want your name to be a link.

This comment has been minimized.

@mpjlu

mpjlu Oct 20, 2016
Author

Done, thanks very much.

@amueller amueller changed the title [MGR + 1] fix selectFdr bug [MGR + 2] fix selectFdr bug Oct 19, 2016
@jnothman jnothman changed the title [MGR + 2] fix selectFdr bug [MRG + 2] fix selectFdr bug Oct 20, 2016
@jnothman jnothman merged commit 2caa144 into scikit-learn:master Oct 20, 2016
0 of 3 checks passed
0 of 3 checks passed
ci/circleci CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
jnothman added a commit to jnothman/scikit-learn that referenced this pull request Oct 20, 2016
amueller added a commit to amueller/scikit-learn that referenced this pull request Oct 25, 2016
# Conflicts:
#	doc/whats_new.rst
amueller added a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016
# Conflicts:
#	doc/whats_new.rst
afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017
Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* tag '0.18.1': (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* releases: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...

 Conflicts:  removed
	sklearn/externals/joblib/__init__.py
	sklearn/externals/joblib/_parallel_backends.py
	sklearn/externals/joblib/testing.py
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* dfsg: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

7 participants