New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] LOF algorithm (Anomaly Detection) #5279

Merged
merged 18 commits into from Oct 25, 2016

Conversation

Projects
None yet
@ngoix
Contributor

ngoix commented Sep 16, 2015

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Sep 16, 2015

Member

thanks for the early PR

let me know when you need a review ie when you addressed the standard things (tests, example, some basic doc)

Member

agramfort commented Sep 16, 2015

thanks for the early PR

let me know when you need a review ie when you addressed the standard things (tests, example, some basic doc)

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Sep 21, 2015

Member

I'd also be interested in reviewing this when you've moved past the WIP stage.

Member

jmschrei commented Sep 21, 2015

I'd also be interested in reviewing this when you've moved past the WIP stage.

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 9, 2015

Contributor

I think it is ready for a first review @agramfort @jmschrei !

Contributor

ngoix commented Oct 9, 2015

I think it is ready for a first review @agramfort @jmschrei !

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Oct 9, 2015

Member

I would like to see some more extensive unit tests, particularly in cases where the algorithm should fail (wrong dimensions or other incorrect types of data passed in). I'll be able to look more at the performance of the code once you merge the mixin with the other class, and change the API to always take in an X matrix.

Member

jmschrei commented Oct 9, 2015

I would like to see some more extensive unit tests, particularly in cases where the algorithm should fail (wrong dimensions or other incorrect types of data passed in). I'll be able to look more at the performance of the code once you merge the mixin with the other class, and change the API to always take in an X matrix.

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Oct 9, 2015

Member

I'd also like to see an example of it performing against a/many current algorithm(s), so that it is clear it is a valuable contribution.

Member

jmschrei commented Oct 9, 2015

I'd also like to see an example of it performing against a/many current algorithm(s), so that it is clear it is a valuable contribution.

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 12, 2015

Contributor

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

because then each sample is considered in its own neighbourhoud: in predict(X), X is considered as 'new observations'.

What the user wants is:

for each x in X,
fit(X-{x})
predict(x)

which is allowed by

fit(X)
predict()

It is like looking for k-nearest-neighbors of points in a dataset X: you can do:

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors()

which is different from

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors(X)

I can make predict() have as signature

def predict(self, X):

and allows taking X=None in argument... Is it allowed ?

Contributor

ngoix commented Oct 12, 2015

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

because then each sample is considered in its own neighbourhoud: in predict(X), X is considered as 'new observations'.

What the user wants is:

for each x in X,
fit(X-{x})
predict(x)

which is allowed by

fit(X)
predict()

It is like looking for k-nearest-neighbors of points in a dataset X: you can do:

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors()

which is different from

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors(X)

I can make predict() have as signature

def predict(self, X):

and allows taking X=None in argument... Is it allowed ?

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Oct 12, 2015

Member

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

implement a fit_predict(X) method is the way to go.

Member

agramfort commented Oct 12, 2015

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

implement a fit_predict(X) method is the way to go.

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 12, 2015

Contributor

Ok thanks !

Contributor

ngoix commented Oct 12, 2015

Ok thanks !

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 12, 2015

Contributor

I merged the mixin with LOF class, changed the API and added a comparison example.
@agramfort @jmschrei what do you think?

Contributor

ngoix commented Oct 12, 2015

I merged the mixin with LOF class, changed the API and added a comparison example.
@agramfort @jmschrei what do you think?

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 18, 2016

Contributor

Thanks @agramfort I've done the changes.

Contributor

ngoix commented Oct 18, 2016

Thanks @agramfort I've done the changes.

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Oct 18, 2016

Member

@ngoix please update what's new and let's merge this !

Member

agramfort commented Oct 18, 2016

@ngoix please update what's new and let's merge this !

ngoix added some commits Sep 14, 2015

LOF algorithm
add tests and example

fix DepreciationWarning by reshape(1,-1) one-sample data

LOF with inheritance

lof and lof2 return same score

fix bugs

fix bugs

optimized and cosmit

rm lof2

cosmit

rm MixinLOF + fit_predict

fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors

add comparison example + doc

LOF -> LocalOutlierFactor
cosmit

change LOF API:
-fit(X).predict() and fit(X).decision_function() do prediction on X without
 considering samples as their own neighbors (ie without considering X as a
 new dataset as does fit(X).predict(X))
-rm fit_predict() method
-add a contamination parameter st predict returns a binary value like other
 anomaly detection algos

cosmit

doc + debug example

correction doc

pass on doc + examples

pep8 + fix warnings

first attempt at fixing API issues

minor changes

takes into account tguillemot advice

-remove pairwise_distance calculation as to heavy in memory
-add benchmarks

cosmit

minor changes + deals with duplicates

fix depreciation warnings
@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 19, 2016

Contributor

done!

Contributor

ngoix commented Oct 19, 2016

done!

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Oct 20, 2016

Member

@amueller want to take a final look?

for me it's good enough to merge

Member

agramfort commented Oct 20, 2016

@amueller want to take a final look?

for me it's good enough to merge

@amueller

I think caching the LRD on the training set would be good (and actually make the code easier to follow). I think either predict and decision_function should both be private or neither. I kinda tend towards both, as making public is easier than hiding.
The rest is mostly minor, though how to tune n_neighbors seems pretty important.

Show outdated Hide outdated sklearn/neighbors/lof.py
Show outdated Hide outdated sklearn/neighbors/lof.py
@@ -18,6 +18,9 @@
hence more adapted to large-dimensional settings, even if it performs
quite well in the examples below.
- using the Local Outlier Factor to measure the local deviation of a given

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

It's kinda odd that this example lives in this folder... but whatever..

@amueller

amueller Oct 20, 2016

Member

It's kinda odd that this example lives in this folder... but whatever..

This comment has been minimized.

@ngoix

ngoix Oct 22, 2016

Contributor

Yes very weird! it is the folder of the first outlier detection algorithm in scikit-learn.

@ngoix

ngoix Oct 22, 2016

Contributor

Yes very weird! it is the folder of the first outlier detection algorithm in scikit-learn.

Show outdated Hide outdated sklearn/neighbors/lof.py
The question is not, how isolated the sample is, but how isolated it is
with respect to the surrounding neighborhood.
This strategy is illustrated below.

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

I don't feel that the example illustrates the point that was just made about the different densities. I'm fine to leave it as-is but I don't get a good idea of the global vs local. It would be nice to also illustrate a failure mode maybe?

@amueller

amueller Oct 20, 2016

Member

I don't feel that the example illustrates the point that was just made about the different densities. I'm fine to leave it as-is but I don't get a good idea of the global vs local. It would be nice to also illustrate a failure mode maybe?

This comment has been minimized.

@ngoix

ngoix Oct 22, 2016

Contributor

No global vs local anymore!

@ngoix

ngoix Oct 22, 2016

Contributor

No global vs local anymore!

Show outdated Hide outdated sklearn/neighbors/lof.py
Show outdated Hide outdated sklearn/neighbors/lof.py
Show outdated Hide outdated sklearn/neighbors/tests/test_lof.py
Show outdated Hide outdated sklearn/neighbors/tests/test_lof.py
Show outdated Hide outdated sklearn/neighbors/lof.py
Show outdated Hide outdated sklearn/neighbors/tests/test_lof.py
Show outdated Hide outdated sklearn/neighbors/lof.py
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
clf.fit(X)
scores_pred = clf.decision_function(X)
if clf_name == "Local Outlier Factor":

This comment has been minimized.

@amueller

amueller Oct 24, 2016

Member

Wait, I don't understand this. Please elaborate.

@amueller

amueller Oct 24, 2016

Member

Wait, I don't understand this. Please elaborate.

Show outdated Hide outdated sklearn/neighbors/lof.py

@amueller amueller merged commit 788a458 into scikit-learn:master Oct 25, 2016

2 of 3 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 25, 2016

Member

thanks :)

Member

amueller commented Oct 25, 2016

thanks :)

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Oct 25, 2016

Member

Hurray 🍻 Thanks @ngoix !!

Member

raghavrv commented Oct 25, 2016

Hurray 🍻 Thanks @ngoix !!

@tguillemot

This comment has been minimized.

Show comment
Hide comment
@tguillemot

tguillemot Oct 25, 2016

Contributor

Youpi 🍻 !!

Contributor

tguillemot commented Oct 25, 2016

Youpi 🍻 !!

@albertcthomas

This comment has been minimized.

Show comment
Hide comment
@albertcthomas

albertcthomas Oct 25, 2016

Contributor

Thanks @ngoix !!

Contributor

albertcthomas commented Oct 25, 2016

Thanks @ngoix !!

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Oct 25, 2016

Member
Member

GaelVaroquaux commented Oct 25, 2016

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Oct 25, 2016

Member

Congrats !

Member

agramfort commented Oct 25, 2016

Congrats !

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG+2] LOF algorithm (Anomaly Detection) (#5279)
* LOF algorithm

add tests and example

fix DepreciationWarning by reshape(1,-1) one-sample data

LOF with inheritance

lof and lof2 return same score

fix bugs

fix bugs

optimized and cosmit

rm lof2

cosmit

rm MixinLOF + fit_predict

fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors

add comparison example + doc

LOF -> LocalOutlierFactor
cosmit

change LOF API:
-fit(X).predict() and fit(X).decision_function() do prediction on X without
 considering samples as their own neighbors (ie without considering X as a
 new dataset as does fit(X).predict(X))
-rm fit_predict() method
-add a contamination parameter st predict returns a binary value like other
 anomaly detection algos

cosmit

doc + debug example

correction doc

pass on doc + examples

pep8 + fix warnings

first attempt at fixing API issues

minor changes

takes into account tguillemot advice

-remove pairwise_distance calculation as to heavy in memory
-add benchmarks

cosmit

minor changes + deals with duplicates

fix depreciation warnings

* factorize the two for loops

* take into account @albertthomas88 review and cosmit

* fix doc

* alex review + rebase

* make predict private add outlier_factor_ attribute and update tests

* make fit_predict take y argument

* fix benchmarks file

* update examples

* make decision_function public (rm X=None default)

* fix travis

* take into account tguillemot review + remove useless k_distance function

* fix broken links :meth:`kneighbors`

* cosmit

* whatsnew

* amueller review + remove _local_outlier_factor method

* add n_neighbors_ parameter the effective nb neighbors we use

* make decision_function private and negative_outlier_factor attribute

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

[MRG+2] LOF algorithm (Anomaly Detection) (#5279)
* LOF algorithm

add tests and example

fix DepreciationWarning by reshape(1,-1) one-sample data

LOF with inheritance

lof and lof2 return same score

fix bugs

fix bugs

optimized and cosmit

rm lof2

cosmit

rm MixinLOF + fit_predict

fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors

add comparison example + doc

LOF -> LocalOutlierFactor
cosmit

change LOF API:
-fit(X).predict() and fit(X).decision_function() do prediction on X without
 considering samples as their own neighbors (ie without considering X as a
 new dataset as does fit(X).predict(X))
-rm fit_predict() method
-add a contamination parameter st predict returns a binary value like other
 anomaly detection algos

cosmit

doc + debug example

correction doc

pass on doc + examples

pep8 + fix warnings

first attempt at fixing API issues

minor changes

takes into account tguillemot advice

-remove pairwise_distance calculation as to heavy in memory
-add benchmarks

cosmit

minor changes + deals with duplicates

fix depreciation warnings

* factorize the two for loops

* take into account @albertthomas88 review and cosmit

* fix doc

* alex review + rebase

* make predict private add outlier_factor_ attribute and update tests

* make fit_predict take y argument

* fix benchmarks file

* update examples

* make decision_function public (rm X=None default)

* fix travis

* take into account tguillemot review + remove useless k_distance function

* fix broken links :meth:`kneighbors`

* cosmit

* whatsnew

* amueller review + remove _local_outlier_factor method

* add n_neighbors_ parameter the effective nb neighbors we use

* make decision_function private and negative_outlier_factor attribute

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+2] LOF algorithm (Anomaly Detection) (#5279)
* LOF algorithm

add tests and example

fix DepreciationWarning by reshape(1,-1) one-sample data

LOF with inheritance

lof and lof2 return same score

fix bugs

fix bugs

optimized and cosmit

rm lof2

cosmit

rm MixinLOF + fit_predict

fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors

add comparison example + doc

LOF -> LocalOutlierFactor
cosmit

change LOF API:
-fit(X).predict() and fit(X).decision_function() do prediction on X without
 considering samples as their own neighbors (ie without considering X as a
 new dataset as does fit(X).predict(X))
-rm fit_predict() method
-add a contamination parameter st predict returns a binary value like other
 anomaly detection algos

cosmit

doc + debug example

correction doc

pass on doc + examples

pep8 + fix warnings

first attempt at fixing API issues

minor changes

takes into account tguillemot advice

-remove pairwise_distance calculation as to heavy in memory
-add benchmarks

cosmit

minor changes + deals with duplicates

fix depreciation warnings

* factorize the two for loops

* take into account @albertthomas88 review and cosmit

* fix doc

* alex review + rebase

* make predict private add outlier_factor_ attribute and update tests

* make fit_predict take y argument

* fix benchmarks file

* update examples

* make decision_function public (rm X=None default)

* fix travis

* take into account tguillemot review + remove useless k_distance function

* fix broken links :meth:`kneighbors`

* cosmit

* whatsnew

* amueller review + remove _local_outlier_factor method

* add n_neighbors_ parameter the effective nb neighbors we use

* make decision_function private and negative_outlier_factor attribute

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+2] LOF algorithm (Anomaly Detection) (#5279)
* LOF algorithm

add tests and example

fix DepreciationWarning by reshape(1,-1) one-sample data

LOF with inheritance

lof and lof2 return same score

fix bugs

fix bugs

optimized and cosmit

rm lof2

cosmit

rm MixinLOF + fit_predict

fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors

add comparison example + doc

LOF -> LocalOutlierFactor
cosmit

change LOF API:
-fit(X).predict() and fit(X).decision_function() do prediction on X without
 considering samples as their own neighbors (ie without considering X as a
 new dataset as does fit(X).predict(X))
-rm fit_predict() method
-add a contamination parameter st predict returns a binary value like other
 anomaly detection algos

cosmit

doc + debug example

correction doc

pass on doc + examples

pep8 + fix warnings

first attempt at fixing API issues

minor changes

takes into account tguillemot advice

-remove pairwise_distance calculation as to heavy in memory
-add benchmarks

cosmit

minor changes + deals with duplicates

fix depreciation warnings

* factorize the two for loops

* take into account @albertthomas88 review and cosmit

* fix doc

* alex review + rebase

* make predict private add outlier_factor_ attribute and update tests

* make fit_predict take y argument

* fix benchmarks file

* update examples

* make decision_function public (rm X=None default)

* fix travis

* take into account tguillemot review + remove useless k_distance function

* fix broken links :meth:`kneighbors`

* cosmit

* whatsnew

* amueller review + remove _local_outlier_factor method

* add n_neighbors_ parameter the effective nb neighbors we use

* make decision_function private and negative_outlier_factor attribute

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+2] LOF algorithm (Anomaly Detection) (#5279)
* LOF algorithm

add tests and example

fix DepreciationWarning by reshape(1,-1) one-sample data

LOF with inheritance

lof and lof2 return same score

fix bugs

fix bugs

optimized and cosmit

rm lof2

cosmit

rm MixinLOF + fit_predict

fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors

add comparison example + doc

LOF -> LocalOutlierFactor
cosmit

change LOF API:
-fit(X).predict() and fit(X).decision_function() do prediction on X without
 considering samples as their own neighbors (ie without considering X as a
 new dataset as does fit(X).predict(X))
-rm fit_predict() method
-add a contamination parameter st predict returns a binary value like other
 anomaly detection algos

cosmit

doc + debug example

correction doc

pass on doc + examples

pep8 + fix warnings

first attempt at fixing API issues

minor changes

takes into account tguillemot advice

-remove pairwise_distance calculation as to heavy in memory
-add benchmarks

cosmit

minor changes + deals with duplicates

fix depreciation warnings

* factorize the two for loops

* take into account @albertthomas88 review and cosmit

* fix doc

* alex review + rebase

* make predict private add outlier_factor_ attribute and update tests

* make fit_predict take y argument

* fix benchmarks file

* update examples

* make decision_function public (rm X=None default)

* fix travis

* take into account tguillemot review + remove useless k_distance function

* fix broken links :meth:`kneighbors`

* cosmit

* whatsnew

* amueller review + remove _local_outlier_factor method

* add n_neighbors_ parameter the effective nb neighbors we use

* make decision_function private and negative_outlier_factor attribute
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment