New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] LOF algorithm (Anomaly Detection) #5279

Merged
merged 18 commits into from Oct 25, 2016

Conversation

Projects
None yet
@ngoix
Contributor

ngoix commented Sep 16, 2015

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Sep 16, 2015

Member

thanks for the early PR

let me know when you need a review ie when you addressed the standard things (tests, example, some basic doc)

Member

agramfort commented Sep 16, 2015

thanks for the early PR

let me know when you need a review ie when you addressed the standard things (tests, example, some basic doc)

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Sep 21, 2015

Member

I'd also be interested in reviewing this when you've moved past the WIP stage.

Member

jmschrei commented Sep 21, 2015

I'd also be interested in reviewing this when you've moved past the WIP stage.

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 9, 2015

Contributor

I think it is ready for a first review @agramfort @jmschrei !

Contributor

ngoix commented Oct 9, 2015

I think it is ready for a first review @agramfort @jmschrei !

Show outdated Hide outdated sklearn/neighbors/lof.py
__all__ = ["LOF"]
class LOFMixin(object):

This comment has been minimized.

@agramfort

agramfort Oct 9, 2015

Member

why do you need a mixin here?

I would put all methods from Mixin into LOF class and make them private.

@agramfort

agramfort Oct 9, 2015

Member

why do you need a mixin here?

I would put all methods from Mixin into LOF class and make them private.

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

I agree. Mixins imply that multiple estimators will be using them,

@jmschrei

jmschrei Oct 9, 2015

Member

I agree. Mixins imply that multiple estimators will be using them,

This comment has been minimized.

@ngoix

ngoix Oct 12, 2015

Contributor

Ok

@ngoix

ngoix Oct 12, 2015

Contributor

Ok

Show outdated Hide outdated sklearn/neighbors/lof.py
"""
def __init__(self, n_neighbors=5, algorithm='auto', leaf_size=30,
metric='minkowski', p=2, metric_params=None,
n_jobs=1, **kwargs):

This comment has been minimized.

@agramfort

agramfort Oct 9, 2015

Member

**kwargs in init are not ok. why do you need this?

@agramfort

agramfort Oct 9, 2015

Member

**kwargs in init are not ok. why do you need this?

This comment has been minimized.

@ngoix

ngoix Oct 12, 2015

Contributor

Ok I will remove it.

@ngoix

ngoix Oct 12, 2015

Contributor

Ok I will remove it.

Show outdated Hide outdated sklearn/neighbors/lof.py
metric_params=metric_params, n_jobs=n_jobs, **kwargs)
def predict(self, X=None, n_neighbors=None):
"""Predict LOF score of X.

This comment has been minimized.

@agramfort

agramfort Oct 9, 2015

Member

empty line missing

@agramfort

agramfort Oct 9, 2015

Member

empty line missing

Show outdated Hide outdated sklearn/neighbors/lof.py
leaf_size=leaf_size, metric=metric, p=p,
metric_params=metric_params, n_jobs=n_jobs, **kwargs)
def predict(self, X=None, n_neighbors=None):

This comment has been minimized.

@agramfort

agramfort Oct 9, 2015

Member

predict signature should be

def predict(self, X):

that's it

@agramfort

agramfort Oct 9, 2015

Member

predict signature should be

def predict(self, X):

that's it

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

@agramfort mentions this, but X must be provided in predict methods.

@jmschrei

jmschrei Oct 9, 2015

Member

@agramfort mentions this, but X must be provided in predict methods.

This comment has been minimized.

@ngoix

ngoix Oct 12, 2015

Contributor

discussion below ? predict must not be able to handle X=None?

@ngoix

ngoix Oct 12, 2015

Contributor

discussion below ? predict must not be able to handle X=None?

Show outdated Hide outdated sklearn/neighbors/lof.py
Parameters
----------

This comment has been minimized.

@agramfort

agramfort Oct 9, 2015

Member

remove line

@agramfort

agramfort Oct 9, 2015

Member

remove line

Show outdated Hide outdated sklearn/neighbors/lof.py
X : array-like, last dimension same as that of fit data, optional
(default=None)
The querry sample or samples to compute the LOF wrt to the training

This comment has been minimized.

@agramfort

agramfort Oct 9, 2015

Member

bad indent

@agramfort

agramfort Oct 9, 2015

Member

bad indent

Show outdated Hide outdated sklearn/neighbors/lof.py
Returns
-------
lof_scores : array of shape (n_samples,)
The LOF score of each input samples. The lower, the more normal.

This comment has been minimized.

@agramfort

agramfort Oct 9, 2015

Member

indent

@agramfort

agramfort Oct 9, 2015

Member

indent

Show outdated Hide outdated sklearn/neighbors/lof.py
def k_distance(self, X=None):
"""
Compute the k_distance and the neighborhood of querry samples X wrt

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

This isn't a terribly informative docstring. It doesn't define what k_distance is, or what self._fit_X is. Maybe change it to describe things in terms of the algorithm, irrespective of the underlying implementation. Also, query not querry.

@jmschrei

jmschrei Oct 9, 2015

Member

This isn't a terribly informative docstring. It doesn't define what k_distance is, or what self._fit_X is. Maybe change it to describe things in terms of the algorithm, irrespective of the underlying implementation. Also, query not querry.

Show outdated Hide outdated sklearn/neighbors/lof.py
Parameters
----------

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

remove line

@jmschrei

jmschrei Oct 9, 2015

Member

remove line

Show outdated Hide outdated sklearn/neighbors/lof.py
Parameters
----------

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

remove line

@jmschrei

jmschrei Oct 9, 2015

Member

remove line

Show outdated Hide outdated sklearn/neighbors/tests/test_lof.py
# Test LOF
clf = neighbors.LOF()
clf.fit(X)
pred = clf.predict()

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

Predict must take in values to predict anomalies in.

@jmschrei

jmschrei Oct 9, 2015

Member

Predict must take in values to predict anomalies in.

Show outdated Hide outdated sklearn/neighbors/tests/test_lof.py
clf = neighbors.LOF()
clf.fit(X)
pred = clf.predict()
assert_array_equal(clf._fit_X, X)

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

The ranking of samples as to their anomaly status must be easily received by the user; usually returned by the predict method. Having to do in and get an attribute is not okay.

@jmschrei

jmschrei Oct 9, 2015

Member

The ranking of samples as to their anomaly status must be easily received by the user; usually returned by the predict method. Having to do in and get an attribute is not okay.

This comment has been minimized.

@ngoix

ngoix Oct 12, 2015

Contributor

Sorry I don't see your point, the ranking is received directly in pred...
clf._fit_X is just the training samples (defined in the base estimator NeighborsBase), the user doesn't see or need it...

@ngoix

ngoix Oct 12, 2015

Contributor

Sorry I don't see your point, the ranking is received directly in pred...
clf._fit_X is just the training samples (defined in the base estimator NeighborsBase), the user doesn't see or need it...

Show outdated Hide outdated sklearn/neighbors/lof.py
"""
distances, neighbors_indices = self.kneighbors(
X=X, n_neighbors=self.n_neighbors)
neighbors_indices = neighbors_indices

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

What is the point of this line?

@jmschrei

jmschrei Oct 9, 2015

Member

What is the point of this line?

This comment has been minimized.

@ngoix

ngoix Oct 12, 2015

Contributor

Right thanks!

@ngoix

ngoix Oct 12, 2015

Contributor

Right thanks!

Show outdated Hide outdated sklearn/neighbors/lof.py
The LRD of p.
"""
p_0 = self._fit_X if p is None else p

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

You should have to explicitly pass in a dataset to this function.

@jmschrei

jmschrei Oct 9, 2015

Member

You should have to explicitly pass in a dataset to this function.

This comment has been minimized.

@ngoix

ngoix Oct 12, 2015

Contributor

discussion below

@ngoix

ngoix Oct 12, 2015

Contributor

discussion below

Show outdated Hide outdated sklearn/neighbors/lof.py
neighbors_indices = self.neighbors_indices_fit_X_ if p is None else self.k_distance(p)[1]
n_jobs = _get_n_jobs(self.n_jobs)

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

When you merge this into the LOF class, just call _get_n_jobs once in the __init__ function.

@jmschrei

jmschrei Oct 9, 2015

Member

When you merge this into the LOF class, just call _get_n_jobs once in the __init__ function.

This comment has been minimized.

@ngoix

ngoix Oct 12, 2015

Contributor

Ok

@ngoix

ngoix Oct 12, 2015

Contributor

Ok

Show outdated Hide outdated sklearn/neighbors/lof.py
dist = pairwise_distances(p_0, self._fit_X,
self.effective_metric_,
n_jobs=n_jobs,
**self.effective_metric_params_)

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

What is **self.effective_metric_params_?

@jmschrei

jmschrei Oct 9, 2015

Member

What is **self.effective_metric_params_?

Show outdated Hide outdated sklearn/neighbors/lof.py
lrd = p_lrd if p is None else self.local_reachability_density(p=None)
for j in range(p_0.shape[0]):
cpt = -1

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

I don't understand what's going on with this cpt variable

@jmschrei

jmschrei Oct 9, 2015

Member

I don't understand what's going on with this cpt variable

This comment has been minimized.

@ngoix

ngoix Oct 12, 2015

Contributor

I rename it neighbors_number.

@ngoix

ngoix Oct 12, 2015

Contributor

I rename it neighbors_number.

Show outdated Hide outdated sklearn/neighbors/lof.py
Parameters
----------
p : array-like of shape (n_samples, n_features)

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

p should be named X

@jmschrei

jmschrei Oct 9, 2015

Member

p should be named X

Show outdated Hide outdated sklearn/neighbors/lof.py
p_lrd = self.local_reachability_density(p)
lrd_ratios_array = np.zeros((p_0.shape[0], self.n_neighbors))
# Avoid re-computing p_lrd if p is None:

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

p should never be None

@jmschrei

jmschrei Oct 9, 2015

Member

p should never be None

Show outdated Hide outdated sklearn/neighbors/lof.py
class LOF(NeighborsBase, KNeighborsMixin, LOFMixin, UnsupervisedMixin):
"""Unsupervised Outlier Detection.

This comment has been minimized.

@jmschrei

jmschrei Oct 9, 2015

Member

I like the documentation.

@jmschrei

jmschrei Oct 9, 2015

Member

I like the documentation.

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Oct 9, 2015

Member

I would like to see some more extensive unit tests, particularly in cases where the algorithm should fail (wrong dimensions or other incorrect types of data passed in). I'll be able to look more at the performance of the code once you merge the mixin with the other class, and change the API to always take in an X matrix.

Member

jmschrei commented Oct 9, 2015

I would like to see some more extensive unit tests, particularly in cases where the algorithm should fail (wrong dimensions or other incorrect types of data passed in). I'll be able to look more at the performance of the code once you merge the mixin with the other class, and change the API to always take in an X matrix.

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Oct 9, 2015

Member

I'd also like to see an example of it performing against a/many current algorithm(s), so that it is clear it is a valuable contribution.

Member

jmschrei commented Oct 9, 2015

I'd also like to see an example of it performing against a/many current algorithm(s), so that it is clear it is a valuable contribution.

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 12, 2015

Contributor

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

because then each sample is considered in its own neighbourhoud: in predict(X), X is considered as 'new observations'.

What the user wants is:

for each x in X,
fit(X-{x})
predict(x)

which is allowed by

fit(X)
predict()

It is like looking for k-nearest-neighbors of points in a dataset X: you can do:

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors()

which is different from

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors(X)

I can make predict() have as signature

def predict(self, X):

and allows taking X=None in argument... Is it allowed ?

Contributor

ngoix commented Oct 12, 2015

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

because then each sample is considered in its own neighbourhoud: in predict(X), X is considered as 'new observations'.

What the user wants is:

for each x in X,
fit(X-{x})
predict(x)

which is allowed by

fit(X)
predict()

It is like looking for k-nearest-neighbors of points in a dataset X: you can do:

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors()

which is different from

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors(X)

I can make predict() have as signature

def predict(self, X):

and allows taking X=None in argument... Is it allowed ?

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Oct 12, 2015

Member

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

implement a fit_predict(X) method is the way to go.

Member

agramfort commented Oct 12, 2015

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

implement a fit_predict(X) method is the way to go.

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 12, 2015

Contributor

Ok thanks !

Contributor

ngoix commented Oct 12, 2015

Ok thanks !

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 12, 2015

Contributor

I merged the mixin with LOF class, changed the API and added a comparison example.
@agramfort @jmschrei what do you think?

Contributor

ngoix commented Oct 12, 2015

I merged the mixin with LOF class, changed the API and added a comparison example.
@agramfort @jmschrei what do you think?

Show outdated Hide outdated examples/covariance/plot_outlier_detection.py
clf.fit(X)
y_pred = clf.decision_function(X).ravel()
if clf_name=="Local Outlier Factor":

This comment has been minimized.

@agramfort

agramfort Oct 12, 2015

Member

pep8

@agramfort
Show outdated Hide outdated examples/covariance/plot_outlier_detection.py
y_pred = -clf.fit_predict(X).ravel()
else:
clf.fit(X)
y_pred = clf.decision_function(X).ravel()

This comment has been minimized.

@agramfort

agramfort Oct 12, 2015

Member

this should be consistent between all estimators. LOF should not be a special case.

@agramfort

agramfort Oct 12, 2015

Member

this should be consistent between all estimators. LOF should not be a special case.

This comment has been minimized.

@jmschrei
@jmschrei

This comment has been minimized.

@ngoix

ngoix Oct 13, 2015

Contributor

I agree but then how can we differentiate a new sample which has already been sampled in training, from a sample from training we want to score ?

@ngoix

ngoix Oct 13, 2015

Contributor

I agree but then how can we differentiate a new sample which has already been sampled in training, from a sample from training we want to score ?

Show outdated Hide outdated examples/neighbors/plot_lof.py
LOF example
==========================================
An example using LOF for anomaly detection.

This comment has been minimized.

@agramfort

agramfort Oct 12, 2015

Member

You should be more explicit.

LOF is jargon. I vote to rename the object LocalOutlierFactor to be explicit.
Also never use the acronym without writing the full words.

you could add a ref to the doc of this example too.

@agramfort

agramfort Oct 12, 2015

Member

You should be more explicit.

LOF is jargon. I vote to rename the object LocalOutlierFactor to be explicit.
Also never use the acronym without writing the full words.

you could add a ref to the doc of this example too.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Oct 12, 2015

Member
@GaelVaroquaux

GaelVaroquaux via email Oct 12, 2015

Member

This comment has been minimized.

@jmschrei
@jmschrei
Show outdated Hide outdated sklearn/neighbors/lof.py
class LOF(NeighborsBase, KNeighborsMixin, UnsupervisedMixin):
"""Unsupervised Outlier Detection.

This comment has been minimized.

@agramfort

agramfort Oct 12, 2015

Member

Local Outlier Factor (LOF)

LOF is an unsupervised outlier detection method. It returns ...

@agramfort

agramfort Oct 12, 2015

Member

Local Outlier Factor (LOF)

LOF is an unsupervised outlier detection method. It returns ...

Show outdated Hide outdated sklearn/neighbors/lof.py
leaf_size=leaf_size, metric=metric, p=p,
metric_params=metric_params, n_jobs=n_jobs)
self.get_n_jobs_ = _get_n_jobs(self.n_jobs)

This comment has been minimized.

@agramfort

agramfort Oct 12, 2015

Member

this should be in the fit method. init just sets attributes from params.

@agramfort

agramfort Oct 12, 2015

Member

this should be in the fit method. init just sets attributes from params.

This comment has been minimized.

@jmschrei

jmschrei Oct 12, 2015

Member

n_jobs is one of the params, though. I think self.n_jobs = _get_n_jobs(n_jobs) would be best. You're just correcting the number of jobs being passed in. Though this might be a correction for _init_params in NeighborsBase?

@jmschrei

jmschrei Oct 12, 2015

Member

n_jobs is one of the params, though. I think self.n_jobs = _get_n_jobs(n_jobs) would be best. You're just correcting the number of jobs being passed in. Though this might be a correction for _init_params in NeighborsBase?

This comment has been minimized.

@ngoix

ngoix Oct 13, 2015

Contributor

The base class does _get_n_jobs(self.n_jobs), isn't it a problem?

@ngoix

ngoix Oct 13, 2015

Contributor

The base class does _get_n_jobs(self.n_jobs), isn't it a problem?

This comment has been minimized.

@agramfort

agramfort Oct 13, 2015

Member

don't do it twice then

@agramfort

agramfort Oct 13, 2015

Member

don't do it twice then

This comment has been minimized.

@ngoix

ngoix Oct 13, 2015

Contributor

Yes but the result is not kept in memory!

@ngoix

ngoix Oct 13, 2015

Contributor

Yes but the result is not kept in memory!

This comment has been minimized.

@tguillemot

tguillemot Feb 24, 2016

Contributor

You can remove the line self.get_n_jobs_ = _get_n_jobs(self.n_jobs) but you have to add n_jobs = _get_n_jobs(self.n_jobs) in the method _local_reachability_density (see other comment)

@tguillemot

tguillemot Feb 24, 2016

Contributor

You can remove the line self.get_n_jobs_ = _get_n_jobs(self.n_jobs) but you have to add n_jobs = _get_n_jobs(self.n_jobs) in the method _local_reachability_density (see other comment)

This comment has been minimized.

@ngoix

ngoix Mar 1, 2016

Contributor

ok thanks

@ngoix

ngoix Mar 1, 2016

Contributor

ok thanks

Show outdated Hide outdated sklearn/neighbors/lof.py
Parameters
----------
X : array-like, last dimension same as that of fit data, optional

This comment has been minimized.

@agramfort

agramfort Oct 12, 2015

Member

it's usually implicit in all docstrings.

X : array-like, shape (n_samples, n_features)

it's not optional.

@agramfort

agramfort Oct 12, 2015

Member

it's usually implicit in all docstrings.

X : array-like, shape (n_samples, n_features)

it's not optional.

Show outdated Hide outdated sklearn/neighbors/lof.py
reach_dist_array = np.zeros((p_0.shape[0], self.n_neighbors))
for j in range(p_0.shape[0]):

This comment has been minimized.

@agramfort

agramfort Oct 12, 2015

Member

no way to vectorize this? a loop over samples is costly.

@agramfort

agramfort Oct 12, 2015

Member

no way to vectorize this? a loop over samples is costly.

This comment has been minimized.

@ngoix

ngoix Oct 13, 2015

Contributor

I tried but didn't succeed...

@ngoix

ngoix Oct 13, 2015

Contributor

I tried but didn't succeed...

Show outdated Hide outdated sklearn/neighbors/lof.py
Attributes
----------

This comment has been minimized.

@agramfort

agramfort Oct 12, 2015

Member

???

@agramfort
Show outdated Hide outdated sklearn/neighbors/lof.py
The LOF of p. The lower, the more normal.
"""
p_0 = self._fit_X if p is None else p

This comment has been minimized.

@agramfort

agramfort Oct 12, 2015

Member

I agree with @jmschrei that keeping X as attribute is a bit brutal. Do you really need it just to skip first sample? can't you just check if min dist is 0?

@agramfort

agramfort Oct 12, 2015

Member

I agree with @jmschrei that keeping X as attribute is a bit brutal. Do you really need it just to skip first sample? can't you just check if min dist is 0?

This comment has been minimized.

@jmschrei
@jmschrei

This comment has been minimized.

@ngoix

ngoix Oct 13, 2015

Contributor

We only keep X as attribute when p is None, ie when the querry samples are equal to the training ones (in this case indeed, the base class method KNeighborsMixin.kneighbors checks if min dist is 0 and removes the first sample in this case). I believe this case is optimal...

If the querry samples are different from the training ones, p is not None and p contains the querry samples. As commented the line below, this case is not really optimal for now, since only the neighbors of neighbors of p need to be considered (LOF is local).

@ngoix

ngoix Oct 13, 2015

Contributor

We only keep X as attribute when p is None, ie when the querry samples are equal to the training ones (in this case indeed, the base class method KNeighborsMixin.kneighbors checks if min dist is 0 and removes the first sample in this case). I believe this case is optimal...

If the querry samples are different from the training ones, p is not None and p contains the querry samples. As commented the line below, this case is not really optimal for now, since only the neighbors of neighbors of p need to be considered (LOF is local).

This comment has been minimized.

@ngoix

ngoix Jan 16, 2016

Contributor

Just to be sure we are on the same page: self._fit_X is an attribute of the base class, it is kept in memory anyway.

@ngoix

ngoix Jan 16, 2016

Contributor

Just to be sure we are on the same page: self._fit_X is an attribute of the base class, it is kept in memory anyway.

Show outdated Hide outdated sklearn/neighbors/lof.py
# May be optimized (if p is not None) by only computing it for
# X = neighbors or neighbors of p
self.neighbors_indices_p_ = self.neighbors_indices_fit_X_ if p is None else self._k_distance(p)[1]

This comment has been minimized.

@agramfort

agramfort Oct 12, 2015

Member

line too long ...

@agramfort

agramfort Oct 12, 2015

Member

line too long ...

This comment has been minimized.

@ngoix

ngoix Oct 13, 2015

Contributor

Yes I don't know how to write it on two lines...

@ngoix

ngoix Oct 13, 2015

Contributor

Yes I don't know how to write it on two lines...

This comment has been minimized.

@agramfort

agramfort Oct 13, 2015

Member

do it like this

self.neighbors_indices_p_ = \
    self.neighbors_indices_fit_X_ if p is None else self._k_distance(p)[1]
@agramfort

agramfort Oct 13, 2015

Member

do it like this

self.neighbors_indices_p_ = \
    self.neighbors_indices_fit_X_ if p is None else self._k_distance(p)[1]

This comment has been minimized.

@ngoix

ngoix Oct 13, 2015

Contributor

ok thanks!

@ngoix

ngoix Oct 13, 2015

Contributor

ok thanks!

Show outdated Hide outdated sklearn/neighbors/__init__.py
@@ -26,4 +27,5 @@
'kneighbors_graph',
'radius_neighbors_graph',
'KernelDensity',
'LSHForest']
'LSHForest',
'LOF']

This comment has been minimized.

@jmschrei

jmschrei Oct 12, 2015

Member

Expand LOF out here too

@jmschrei

jmschrei Oct 12, 2015

Member

Expand LOF out here too

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 18, 2016

Contributor

Thanks @agramfort I've done the changes.

Contributor

ngoix commented Oct 18, 2016

Thanks @agramfort I've done the changes.

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Oct 18, 2016

Member

@ngoix please update what's new and let's merge this !

Member

agramfort commented Oct 18, 2016

@ngoix please update what's new and let's merge this !

ngoix added some commits Sep 14, 2015

LOF algorithm
add tests and example

fix DepreciationWarning by reshape(1,-1) one-sample data

LOF with inheritance

lof and lof2 return same score

fix bugs

fix bugs

optimized and cosmit

rm lof2

cosmit

rm MixinLOF + fit_predict

fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors

add comparison example + doc

LOF -> LocalOutlierFactor
cosmit

change LOF API:
-fit(X).predict() and fit(X).decision_function() do prediction on X without
 considering samples as their own neighbors (ie without considering X as a
 new dataset as does fit(X).predict(X))
-rm fit_predict() method
-add a contamination parameter st predict returns a binary value like other
 anomaly detection algos

cosmit

doc + debug example

correction doc

pass on doc + examples

pep8 + fix warnings

first attempt at fixing API issues

minor changes

takes into account tguillemot advice

-remove pairwise_distance calculation as to heavy in memory
-add benchmarks

cosmit

minor changes + deals with duplicates

fix depreciation warnings
@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Oct 19, 2016

Contributor

done!

Contributor

ngoix commented Oct 19, 2016

done!

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Oct 20, 2016

Member

@amueller want to take a final look?

for me it's good enough to merge

Member

agramfort commented Oct 20, 2016

@amueller want to take a final look?

for me it's good enough to merge

@amueller

I think caching the LRD on the training set would be good (and actually make the code easier to follow). I think either predict and decision_function should both be private or neither. I kinda tend towards both, as making public is easier than hiding.
The rest is mostly minor, though how to tune n_neighbors seems pretty important.

Show outdated Hide outdated sklearn/neighbors/lof.py
Returns
-------
lof_scores : array, shape (n_samples,)
The Local Outlier Factor of each input samples. The lower,

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

This seems to contradict the title of the docstring.

@amueller

amueller Oct 20, 2016

Member

This seems to contradict the title of the docstring.

This comment has been minimized.

@ngoix

ngoix Oct 22, 2016

Contributor

Yes it is -lof_scores

@ngoix

ngoix Oct 22, 2016

Contributor

Yes it is -lof_scores

Show outdated Hide outdated sklearn/neighbors/lof.py
return is_inlier
def decision_function(self, X):
"""Opposite of the Local Outlier Factor of X (as bigger is better).

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

I think the docstring should be more explicit. Is low outlier or high outlier?
Actually, to be consistent with the other estimators, I think negative needs to be outlier.

@amueller

amueller Oct 20, 2016

Member

I think the docstring should be more explicit. Is low outlier or high outlier?
Actually, to be consistent with the other estimators, I think negative needs to be outlier.

This comment has been minimized.

@ngoix

ngoix Oct 22, 2016

Contributor

I don't think so, for all the decision functions, bigger is better (large values correspond to inliers). For prediction, negative values (-1) correspond to outliers though. (It's true that this is a bit odd)

@ngoix

ngoix Oct 22, 2016

Contributor

I don't think so, for all the decision functions, bigger is better (large values correspond to inliers). For prediction, negative values (-1) correspond to outliers though. (It's true that this is a bit odd)

@@ -18,6 +18,9 @@
hence more adapted to large-dimensional settings, even if it performs
quite well in the examples below.
- using the Local Outlier Factor to measure the local deviation of a given

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

It's kinda odd that this example lives in this folder... but whatever..

@amueller

amueller Oct 20, 2016

Member

It's kinda odd that this example lives in this folder... but whatever..

This comment has been minimized.

@ngoix

ngoix Oct 22, 2016

Contributor

Yes very weird! it is the folder of the first outlier detection algorithm in scikit-learn.

@ngoix

ngoix Oct 22, 2016

Contributor

Yes very weird! it is the folder of the first outlier detection algorithm in scikit-learn.

Show outdated Hide outdated sklearn/neighbors/lof.py
# Avoid re-computing X_lrd if same parameters:
if not (np.all(distances_X == self._distances_fit_X_) *
np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)):

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

this == raises a deprecation warning

/home/andy/checkout/scikit-learn/sklearn/neighbors/lof.py:279: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.
np.all(self.neighbors_indices_fit_X == neighbors_indices_X)):

This means they have different size, I think. So I guess you should check the shape first?

Also happens for the line above.

@amueller

amueller Oct 20, 2016

Member

this == raises a deprecation warning

/home/andy/checkout/scikit-learn/sklearn/neighbors/lof.py:279: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.
np.all(self.neighbors_indices_fit_X == neighbors_indices_X)):

This means they have different size, I think. So I guess you should check the shape first?

Also happens for the line above.

The question is not, how isolated the sample is, but how isolated it is
with respect to the surrounding neighborhood.
This strategy is illustrated below.

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

I don't feel that the example illustrates the point that was just made about the different densities. I'm fine to leave it as-is but I don't get a good idea of the global vs local. It would be nice to also illustrate a failure mode maybe?

@amueller

amueller Oct 20, 2016

Member

I don't feel that the example illustrates the point that was just made about the different densities. I'm fine to leave it as-is but I don't get a good idea of the global vs local. It would be nice to also illustrate a failure mode maybe?

This comment has been minimized.

@ngoix

ngoix Oct 22, 2016

Contributor

No global vs local anymore!

@ngoix

ngoix Oct 22, 2016

Contributor

No global vs local anymore!

Show outdated Hide outdated sklearn/neighbors/lof.py
# Avoid re-computing X_lrd if same parameters:
if not (np.all(distances_X == self._distances_fit_X_) *
np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)):
lrd = self._local_reachability_density(

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

It seems that lrd is "small" compared to _distances_fit_X_ and _neighbors_indices_fit_X_. Why not compute it in fit and store it once and for all? You are currently recomputing it on every call to _local_outlier_factor.

@amueller

amueller Oct 20, 2016

Member

It seems that lrd is "small" compared to _distances_fit_X_ and _neighbors_indices_fit_X_. Why not compute it in fit and store it once and for all? You are currently recomputing it on every call to _local_outlier_factor.

Show outdated Hide outdated sklearn/neighbors/lof.py
Parameters
----------
distances_X : array, shape (n_query, self.n_neighbors)
Distances to the neighbors (in the training samples self._fit_X) of

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

I would put backticks around _fit_X to be save ;)

@amueller

amueller Oct 20, 2016

Member

I would put backticks around _fit_X to be save ;)

This comment has been minimized.

@ngoix

ngoix Oct 24, 2016

Contributor

Do you mean replacing self._fit_X by self._fit_X or self._fit_X or just by _fit_X? I don't understand the purpose...

@ngoix

ngoix Oct 24, 2016

Contributor

Do you mean replacing self._fit_X by self._fit_X or self._fit_X or just by _fit_X? I don't understand the purpose...

This comment has been minimized.

@amueller

amueller Oct 24, 2016

Member

I meant putting backticks around self._fit_X. a) for nicer highlighting b) I'm not sure sphinx will render the current version correctly because of the underscore. But I might be paranoid.

@amueller

amueller Oct 24, 2016

Member

I meant putting backticks around self._fit_X. a) for nicer highlighting b) I'm not sure sphinx will render the current version correctly because of the underscore. But I might be paranoid.

Show outdated Hide outdated sklearn/neighbors/tests/test_lof.py
score = clf.fit(X).outlier_factor_
assert_array_equal(clf._fit_X, X)
# Assert scores are good:

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

Assert smallest outlier score is is greater than largest inlier score

@amueller

amueller Oct 20, 2016

Member

Assert smallest outlier score is is greater than largest inlier score

This comment has been minimized.

@ngoix

ngoix Oct 23, 2016

Contributor

ok

@ngoix

ngoix Oct 23, 2016

Contributor

ok

Show outdated Hide outdated sklearn/neighbors/tests/test_lof.py
clf = neighbors.LocalOutlierFactor().fit(X_train)
# predict scores (the lower, the more normal)
y_pred = - clf.decision_function(X_test)

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

I would find it more natural to give the outliers the negative label. If you want to leave it like this, remove space after -

@amueller

amueller Oct 20, 2016

Member

I would find it more natural to give the outliers the negative label. If you want to leave it like this, remove space after -

This comment has been minimized.

@ngoix

ngoix Oct 23, 2016

Contributor

I agree but this is to be consistent with OneClassSVM, EllipticEnvelop and IsolationForest.

@ngoix

ngoix Oct 23, 2016

Contributor

I agree but this is to be consistent with OneClassSVM, EllipticEnvelop and IsolationForest.

Show outdated Hide outdated sklearn/neighbors/lof.py
distance between them. This works for Scipy's metrics, but is less
efficient than passing the metric name as a string.
Distance matrices are not supported.

This comment has been minimized.

@amueller

amueller Oct 20, 2016

Member

I don't understand this comment.

@amueller

amueller Oct 20, 2016

Member

I don't understand this comment.

Show outdated Hide outdated sklearn/neighbors/tests/test_lof.py
clf = neighbors.LocalOutlierFactor().fit(X_train)
# predict scores (the lower, the more normal)
y_pred = -clf.decision_function(X_test)

This comment has been minimized.

@amueller

amueller Oct 24, 2016

Member

I meant changing y_test to be [0] * 20 + [-1] * 20 and then remove the -

@amueller

amueller Oct 24, 2016

Member

I meant changing y_test to be [0] * 20 + [-1] * 20 and then remove the -

Show outdated Hide outdated sklearn/neighbors/lof.py
return self
def _predict(self, X=None):

This comment has been minimized.

@amueller

amueller Oct 24, 2016

Member

I would really like to be consistent. I don't think there's a good argument to have one but not the other. Not sure if the example is a strong enough point to make them both public.

@amueller

amueller Oct 24, 2016

Member

I would really like to be consistent. I don't think there's a good argument to have one but not the other. Not sure if the example is a strong enough point to make them both public.

for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
clf.fit(X)
scores_pred = clf.decision_function(X)
if clf_name == "Local Outlier Factor":

This comment has been minimized.

@amueller

amueller Oct 24, 2016

Member

Wait, I don't understand this. Please elaborate.

@amueller

amueller Oct 24, 2016

Member

Wait, I don't understand this. Please elaborate.

Show outdated Hide outdated sklearn/neighbors/lof.py
Attributes
----------
outlier_factor_ : numpy array, shape (n_samples,)
The LOF of X. The lower, the more normal.

This comment has been minimized.

@amueller

amueller Oct 24, 2016

Member

I don't know which comment of yours refers to which comment of mine.

For the first comment: Yes, I'd either do negative_outlier_factor or inlier_score or something generic?

For the second comment: The explanation of outlier_factor_ as an attribute says "The LOF of X" what is X? It's the training set this LocalOutlierFactor estimator was trained on, right?

@amueller

amueller Oct 24, 2016

Member

I don't know which comment of yours refers to which comment of mine.

For the first comment: Yes, I'd either do negative_outlier_factor or inlier_score or something generic?

For the second comment: The explanation of outlier_factor_ as an attribute says "The LOF of X" what is X? It's the training set this LocalOutlierFactor estimator was trained on, right?

@amueller