Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Novelty detection for LOF #10700

Merged
merged 24 commits into from Jul 19, 2018
Merged

Conversation

@albertcthomas
Copy link
Collaborator

@albertcthomas albertcthomas commented Feb 25, 2018

Fixes #10191

What does this implement/fix? Explain your changes.

This PR will allow users to use LOF for novelty detection as suggested in this comment of PR #9015. An argument novelty=True or False is added in __init__:

  • if novelty=False (default) then lof.predict(X) raises ValueError('predict is not available when novelty=False, use fit_predict if you want to predict on training data. Use novelty=True if you want to use LOF for novelty detection and predict on new unseen data.')
    Similar error for lof.decision_function(X) and lof.score_samples(X)
  • if novelty=True, lof.fit_predict(X) raises ValueError('fit_predict is not available when novelty=True. Use novelty=False if you want to predict on the training set.'). But lof.predict(X), lof.decision_function and lof.score_samples(X) are OK.

This way users can use LOF for novelty detection but by default, as novelty=False, users will get an error with an informative message saying to use novelty=True if they really want to predict on new data.

Any other comments?

This PR also refactors doc on outlier detection: particularly, plot_outlier_detection.py is removed in favor of plot_anomaly_comparison.py

Copy link
Member

@jnothman jnothman left a comment

So we have two big questions:

  • does LOF make sense as an inductive model?
  • if so, is it necessary to make two different modes?

It doesn't look like the fitting is any different in the inductive case from fit_predict. Is the main concern that prediction on the training data is not identical to the results of fit_predict? We have such cases with {fit_,}transform in the works, FWIW.

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Feb 26, 2018

The main concern is indeed that fit_predict would not be identical to fit.predict.

Loading

@jnothman
Copy link
Member

@jnothman jnothman commented Feb 26, 2018

Loading

'considering the negative_outlier_factor_ attribute. Use '
'novelty=True if you want to use LOF for novelty detection and '
'compute score_samples for new unseen data.')
raise NotImplementedError(msg)
Copy link
Member

@agramfort agramfort Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would structure the code differently.

if not self.novelty:
     raise

...

return ...```

it feels better to first capture errors and to end the function with a return

Loading

@agramfort
Copy link
Member

@agramfort agramfort commented Feb 27, 2018

I find this solution very elegant and rigorous. Using the id will not be a robust solution.
And we still don't break the contract that fit.predict == fit_predict unless the user explicitly asks for it.

Loading

if self.novelty:
msg = ('fit_predict is not available when novelty=True. Use '
'novelty=False if you want to predict on the training set.')
raise NotImplementedError(msg)
Copy link
Member

@jnothman jnothman Mar 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right error type. It makes it seem like one day we will implement this case

Loading

Copy link
Member

@jnothman jnothman Mar 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A TypeError would be more normal for a bad call, while a ValueError would be possible too. None is quite satisfying.

Loading

Copy link
Member

@agramfort agramfort Mar 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use ValueError

Loading

@jnothman
Copy link
Member

@jnothman jnothman commented Mar 4, 2018

Test failures. I'm happy with this solution. I'm still not sure we can maintain similar assurances for all Transformers, though

Loading

'if you want to predict on training data. Use novelty=True if '
'you want to use LOF for novelty detection and predict on new '
'unseen data.')
raise NotImplementedError(msg)
Copy link
Member

@agramfort agramfort Mar 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError

as I said elsewhere I think it more natural to finish a function with a return than a raise though

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Mar 5, 2018

Thanks for your comments and reviews @jnothman and @agramfort. I think this is a good solution as some people are not ready to break fit.predict = fit_predict (for what I think are good reasons). I'm OK with ValueError. In a way the error is related to the value of the novelty parameter in this case. I will finish this PR, I just wanted to be sure about this before completely implementing this feature.

Loading

@jnothman
Copy link
Member

@jnothman jnothman commented Mar 5, 2018

Loading

@agramfort
Copy link
Member

@agramfort agramfort commented Mar 5, 2018

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Mar 5, 2018

Unless this kind of solution can be more generally applied (i.e. not just for LOF of outlier detection estimators), novelty seems good to me as well.

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Apr 8, 2018

BTW something's strange in the CI tests... the example examples/covariance/plot_outlier_detection.py should fail as it is using the private method _decision_function, which I removed in the previous commits.

Running the example locally returns

AttributeError: 'LocalOutlierFactor' object has no attribute '_decision_function'

Loading

@agramfort
Copy link
Member

@agramfort agramfort commented Apr 8, 2018

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Apr 8, 2018

I also suggest to remove this example (examples/covariance/plot_outlier_detection.py) since we now have examples/plot_anomaly_comparison.py. WDYT?

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Apr 8, 2018

For plot_lof_outlier_detection.py, I refitted the model with novelty=True in order to show the level sets of the decision_function but that might confuse the users. We should maybe remove this and only illustrate the use of the negative_outlier_factor_ attribute.

Loading

@agramfort
Copy link
Member

@agramfort agramfort commented Apr 8, 2018

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Apr 8, 2018

ok but I would suggest to reuse the blob of text from the doc at the top.
At least some parts.
Check that you don't break any link in the doc.

@agramfort you are saying this about removing plot_outlier_detection.py?

Loading

@sklearn-lgtm
Copy link

@sklearn-lgtm sklearn-lgtm commented Apr 8, 2018

This pull request introduces 1 alert when merging 25e9362 into bfad4da - view on lgtm.com

new alerts:

  • 1 for Variable defined multiple times

Comment posted by lgtm.com

Loading

@agramfort
Copy link
Member

@agramfort agramfort commented Apr 8, 2018

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Apr 30, 2018

I'm almost done here, just need to fix a couple typos in the doc but Circleci returns the following error. I don't know if it's related to this PR.

Unexpected failing examples:
/home/circleci/project/examples/gaussian_process/plot_gpr_co2.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/circleci/project/examples/gaussian_process/plot_gpr_co2.py", line 75, in <module>
    data = fetch_mldata('mauna-loa-atmospheric-co2').data
  File "/home/circleci/project/sklearn/datasets/mldata.py", line 154, in fetch_mldata
    mldata_url = urlopen(urlname)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

Loading

@agramfort
Copy link
Member

@agramfort agramfort commented Apr 30, 2018

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Apr 30, 2018

Yes that's what I already tried. I will try again this evening. Thanks!

Loading

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 30, 2018

Loading

@albertcthomas albertcthomas force-pushed the novelty_for_lof branch 2 times, most recently from fdeb437 to 5dd2e8c May 2, 2018
@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Jul 18, 2018

Actually, @glemaitre what's the difference between
@pytest.mark.filterwarnings and @ignore_warnings(category=DeprecationWarning)?

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Jul 19, 2018

CIs are green but I don’t know why we don’t have the codecov checks.

Loading

@agramfort
Copy link
Member

@agramfort agramfort commented Jul 19, 2018

good to go from my end.

Anyone for MRG+2?

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 19, 2018

I would have liked codecov to run. I don't understand why it's not running. I'd like to check the coverage of this PR.

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Jul 19, 2018

I restarted CIs to see if we can get code coverage.

Loading

@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Jul 19, 2018

@GaelVaroquaux codecov is green

Loading

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 19, 2018

Actually, @glemaitre what's the difference between
@pytest.mark.filterwarnings and @ignore_warnings(category=DeprecationWarning)?

We can make more fine-grain ignoring.

Loading

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 19, 2018

It looks good to me. I would however change the deprecation warning to future warning for the outier methods.

Loading

@glemaitre glemaitre merged commit 4d0a262 into scikit-learn:master Jul 19, 2018
7 checks passed
Loading
scikit-learn 0.20 automation moved this from PRs tagged to Done Jul 19, 2018
@albertcthomas
Copy link
Collaborator Author

@albertcthomas albertcthomas commented Jul 19, 2018

Thanks @jnothman @agramfort @GaelVaroquaux and @glemaitre for the help and the reviews!

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Linked issues

Successfully merging this pull request may close these issues.

6 participants