Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Novelty detection for LOF #10700

Merged
merged 24 commits into from Jul 19, 2018

Conversation

Projects
6 participants
@albertcthomas
Copy link
Contributor

albertcthomas commented Feb 25, 2018

Fixes #10191

What does this implement/fix? Explain your changes.

This PR will allow users to use LOF for novelty detection as suggested in this comment of PR #9015. An argument novelty=True or False is added in __init__:

  • if novelty=False (default) then lof.predict(X) raises ValueError('predict is not available when novelty=False, use fit_predict if you want to predict on training data. Use novelty=True if you want to use LOF for novelty detection and predict on new unseen data.')
    Similar error for lof.decision_function(X) and lof.score_samples(X)
  • if novelty=True, lof.fit_predict(X) raises ValueError('fit_predict is not available when novelty=True. Use novelty=False if you want to predict on the training set.'). But lof.predict(X), lof.decision_function and lof.score_samples(X) are OK.

This way users can use LOF for novelty detection but by default, as novelty=False, users will get an error with an informative message saying to use novelty=True if they really want to predict on new data.

Any other comments?

This PR also refactors doc on outlier detection: particularly, plot_outlier_detection.py is removed in favor of plot_anomaly_comparison.py

@jnothman
Copy link
Member

jnothman left a comment

So we have two big questions:

  • does LOF make sense as an inductive model?
  • if so, is it necessary to make two different modes?

It doesn't look like the fitting is any different in the inductive case from fit_predict. Is the main concern that prediction on the training data is not identical to the results of fit_predict? We have such cases with {fit_,}transform in the works, FWIW.

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Feb 26, 2018

The main concern is indeed that fit_predict would not be identical to fit.predict.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 26, 2018

'considering the negative_outlier_factor_ attribute. Use '
'novelty=True if you want to use LOF for novelty detection and '
'compute score_samples for new unseen data.')
raise NotImplementedError(msg)

This comment has been minimized.

@agramfort

agramfort Feb 27, 2018

Member

I would structure the code differently.

if not self.novelty:
     raise

...

return ...```

it feels better to first capture errors and to end the function with a return
@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Feb 27, 2018

I find this solution very elegant and rigorous. Using the id will not be a robust solution.
And we still don't break the contract that fit.predict == fit_predict unless the user explicitly asks for it.

if self.novelty:
msg = ('fit_predict is not available when novelty=True. Use '
'novelty=False if you want to predict on the training set.')
raise NotImplementedError(msg)

This comment has been minimized.

@jnothman

jnothman Mar 4, 2018

Member

I don't think this is the right error type. It makes it seem like one day we will implement this case

This comment has been minimized.

@jnothman

jnothman Mar 4, 2018

Member

A TypeError would be more normal for a bad call, while a ValueError would be possible too. None is quite satisfying.

This comment has been minimized.

@agramfort

agramfort Mar 4, 2018

Member

I would use ValueError

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Mar 4, 2018

Test failures. I'm happy with this solution. I'm still not sure we can maintain similar assurances for all Transformers, though

'if you want to predict on training data. Use novelty=True if '
'you want to use LOF for novelty detection and predict on new '
'unseen data.')
raise NotImplementedError(msg)

This comment has been minimized.

@agramfort

agramfort Mar 4, 2018

Member

ValueError

as I said elsewhere I think it more natural to finish a function with a return than a raise though

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Mar 5, 2018

Thanks for your comments and reviews @jnothman and @agramfort. I think this is a good solution as some people are not ready to break fit.predict = fit_predict (for what I think are good reasons). I'm OK with ValueError. In a way the error is related to the value of the novelty parameter in this case. I will finish this PR, I just wanted to be sure about this before completely implementing this feature.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Mar 5, 2018

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Mar 5, 2018

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Mar 5, 2018

Unless this kind of solution can be more generally applied (i.e. not just for LOF of outlier detection estimators), novelty seems good to me as well.

@albertcthomas albertcthomas force-pushed the albertcthomas:novelty_for_lof branch from bec8fa2 to 666c18a Apr 2, 2018

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Apr 8, 2018

BTW something's strange in the CI tests... the example examples/covariance/plot_outlier_detection.py should fail as it is using the private method _decision_function, which I removed in the previous commits.

Running the example locally returns

AttributeError: 'LocalOutlierFactor' object has no attribute '_decision_function'
@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Apr 8, 2018

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Apr 8, 2018

I also suggest to remove this example (examples/covariance/plot_outlier_detection.py) since we now have examples/plot_anomaly_comparison.py. WDYT?

@albertcthomas albertcthomas force-pushed the albertcthomas:novelty_for_lof branch from 407f8e4 to 25e9362 Apr 8, 2018

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Apr 8, 2018

For plot_lof_outlier_detection.py, I refitted the model with novelty=True in order to show the level sets of the decision_function but that might confuse the users. We should maybe remove this and only illustrate the use of the negative_outlier_factor_ attribute.

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Apr 8, 2018

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Apr 8, 2018

ok but I would suggest to reuse the blob of text from the doc at the top.
At least some parts.
Check that you don't break any link in the doc.

@agramfort you are saying this about removing plot_outlier_detection.py?

@sklearn-lgtm

This comment has been minimized.

Copy link

sklearn-lgtm commented Apr 8, 2018

This pull request introduces 1 alert when merging 25e9362 into bfad4da - view on lgtm.com

new alerts:

  • 1 for Variable defined multiple times

Comment posted by lgtm.com

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Apr 8, 2018

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Apr 30, 2018

I'm almost done here, just need to fix a couple typos in the doc but Circleci returns the following error. I don't know if it's related to this PR.

Unexpected failing examples:
/home/circleci/project/examples/gaussian_process/plot_gpr_co2.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/circleci/project/examples/gaussian_process/plot_gpr_co2.py", line 75, in <module>
    data = fetch_mldata('mauna-loa-atmospheric-co2').data
  File "/home/circleci/project/sklearn/datasets/mldata.py", line 154, in fetch_mldata
    mldata_url = urlopen(urlname)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error
@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Apr 30, 2018

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Apr 30, 2018

Yes that's what I already tried. I will try again this evening. Thanks!

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 30, 2018

@albertcthomas albertcthomas force-pushed the albertcthomas:novelty_for_lof branch 2 times, most recently from fdeb437 to 5dd2e8c May 2, 2018

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Jul 18, 2018

Actually, @glemaitre what's the difference between
@pytest.mark.filterwarnings and @ignore_warnings(category=DeprecationWarning)?

@albertcthomas albertcthomas force-pushed the albertcthomas:novelty_for_lof branch from 5fc7ff8 to 645cb56 Jul 18, 2018

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Jul 19, 2018

CIs are green but I don’t know why we don’t have the codecov checks.

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Jul 19, 2018

good to go from my end.

Anyone for MRG+2?

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Jul 19, 2018

I would have liked codecov to run. I don't understand why it's not running. I'd like to check the coverage of this PR.

@albertcthomas albertcthomas force-pushed the albertcthomas:novelty_for_lof branch from 645cb56 to 9e59491 Jul 19, 2018

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Jul 19, 2018

I restarted CIs to see if we can get code coverage.

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Jul 19, 2018

@GaelVaroquaux codecov is green

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jul 19, 2018

Actually, @glemaitre what's the difference between
@pytest.mark.filterwarnings and @ignore_warnings(category=DeprecationWarning)?

We can make more fine-grain ignoring.

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jul 19, 2018

It looks good to me. I would however change the deprecation warning to future warning for the outier methods.

@glemaitre glemaitre merged commit 4d0a262 into scikit-learn:master Jul 19, 2018

7 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.29%)
Details
codecov/project 95.3% (+<.01%) compared to 5140762
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

scikit-learn 0.20 automation moved this from PRs tagged to Done Jul 19, 2018

@albertcthomas

This comment has been minimized.

Copy link
Contributor Author

albertcthomas commented Jul 19, 2018

Thanks @jnothman @agramfort @GaelVaroquaux and @glemaitre for the help and the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.