[MRG+2] LOF algorithm (Anomaly Detection) #5279
Conversation
thanks for the early PR let me know when you need a review ie when you addressed the standard things (tests, example, some basic doc) |
I'd also be interested in reviewing this when you've moved past the WIP stage. |
I think it is ready for a first review @agramfort @jmschrei ! |
__all__ = ["LOF"] | ||
|
||
|
||
class LOFMixin(object): |
agramfort
Oct 9, 2015
Member
why do you need a mixin here?
I would put all methods from Mixin into LOF class and make them private.
why do you need a mixin here?
I would put all methods from Mixin into LOF class and make them private.
jmschrei
Oct 9, 2015
Member
I agree. Mixins imply that multiple estimators will be using them,
I agree. Mixins imply that multiple estimators will be using them,
ngoix
Oct 12, 2015
Author
Contributor
Ok
Ok
""" | ||
def __init__(self, n_neighbors=5, algorithm='auto', leaf_size=30, | ||
metric='minkowski', p=2, metric_params=None, | ||
n_jobs=1, **kwargs): |
agramfort
Oct 9, 2015
Member
**kwargs in init are not ok. why do you need this?
**kwargs in init are not ok. why do you need this?
ngoix
Oct 12, 2015
Author
Contributor
Ok I will remove it.
Ok I will remove it.
metric_params=metric_params, n_jobs=n_jobs, **kwargs) | ||
|
||
def predict(self, X=None, n_neighbors=None): | ||
"""Predict LOF score of X. |
agramfort
Oct 9, 2015
Member
empty line missing
empty line missing
leaf_size=leaf_size, metric=metric, p=p, | ||
metric_params=metric_params, n_jobs=n_jobs, **kwargs) | ||
|
||
def predict(self, X=None, n_neighbors=None): |
agramfort
Oct 9, 2015
Member
predict signature should be
def predict(self, X):
that's it
predict signature should be
def predict(self, X):
that's it
ngoix
Oct 12, 2015
Author
Contributor
discussion below ? predict must not be able to handle X=None?
discussion below ? predict must not be able to handle X=None?
X : array-like, last dimension same as that of fit data, optional | ||
(default=None) | ||
The querry sample or samples to compute the LOF wrt to the training |
agramfort
Oct 9, 2015
Member
bad indent
bad indent
Returns | ||
------- | ||
lof_scores : array of shape (n_samples,) | ||
The LOF score of each input samples. The lower, the more normal. |
agramfort
Oct 9, 2015
Member
indent
indent
|
||
def k_distance(self, X=None): | ||
""" | ||
Compute the k_distance and the neighborhood of querry samples X wrt |
jmschrei
Oct 9, 2015
Member
This isn't a terribly informative docstring. It doesn't define what k_distance is, or what self._fit_X is. Maybe change it to describe things in terms of the algorithm, irrespective of the underlying implementation. Also, query
not querry
.
This isn't a terribly informative docstring. It doesn't define what k_distance is, or what self._fit_X is. Maybe change it to describe things in terms of the algorithm, irrespective of the underlying implementation. Also, query
not querry
.
# Test LOF | ||
clf = neighbors.LOF() | ||
clf.fit(X) | ||
pred = clf.predict() |
jmschrei
Oct 9, 2015
Member
Predict must take in values to predict anomalies in.
Predict must take in values to predict anomalies in.
clf = neighbors.LOF() | ||
clf.fit(X) | ||
pred = clf.predict() | ||
assert_array_equal(clf._fit_X, X) |
jmschrei
Oct 9, 2015
Member
The ranking of samples as to their anomaly status must be easily received by the user; usually returned by the predict method. Having to do in and get an attribute is not okay.
The ranking of samples as to their anomaly status must be easily received by the user; usually returned by the predict method. Having to do in and get an attribute is not okay.
ngoix
Oct 12, 2015
Author
Contributor
Sorry I don't see your point, the ranking is received directly in pred
...
clf._fit_X
is just the training samples (defined in the base estimator NeighborsBase), the user doesn't see or need it...
Sorry I don't see your point, the ranking is received directly in pred
...
clf._fit_X
is just the training samples (defined in the base estimator NeighborsBase), the user doesn't see or need it...
""" | ||
distances, neighbors_indices = self.kneighbors( | ||
X=X, n_neighbors=self.n_neighbors) | ||
neighbors_indices = neighbors_indices |
jmschrei
Oct 9, 2015
Member
What is the point of this line?
What is the point of this line?
ngoix
Oct 12, 2015
Author
Contributor
Right thanks!
Right thanks!
The LRD of p. | ||
""" | ||
|
||
p_0 = self._fit_X if p is None else p |
jmschrei
Oct 9, 2015
Member
You should have to explicitly pass in a dataset to this function.
You should have to explicitly pass in a dataset to this function.
ngoix
Oct 12, 2015
Author
Contributor
discussion below
discussion below
|
||
neighbors_indices = self.neighbors_indices_fit_X_ if p is None else self.k_distance(p)[1] | ||
|
||
n_jobs = _get_n_jobs(self.n_jobs) |
jmschrei
Oct 9, 2015
Member
When you merge this into the LOF class, just call _get_n_jobs
once in the __init__
function.
When you merge this into the LOF class, just call _get_n_jobs
once in the __init__
function.
ngoix
Oct 12, 2015
Author
Contributor
Ok
Ok
dist = pairwise_distances(p_0, self._fit_X, | ||
self.effective_metric_, | ||
n_jobs=n_jobs, | ||
**self.effective_metric_params_) |
jmschrei
Oct 9, 2015
Member
What is **self.effective_metric_params_
?
What is **self.effective_metric_params_
?
lrd = p_lrd if p is None else self.local_reachability_density(p=None) | ||
|
||
for j in range(p_0.shape[0]): | ||
cpt = -1 |
jmschrei
Oct 9, 2015
Member
I don't understand what's going on with this cpt variable
I don't understand what's going on with this cpt variable
ngoix
Oct 12, 2015
Author
Contributor
I rename it neighbors_number
.
I rename it neighbors_number
.
Parameters | ||
---------- | ||
p : array-like of shape (n_samples, n_features) |
jmschrei
Oct 9, 2015
Member
p should be named X
p should be named X
p_lrd = self.local_reachability_density(p) | ||
lrd_ratios_array = np.zeros((p_0.shape[0], self.n_neighbors)) | ||
|
||
# Avoid re-computing p_lrd if p is None: |
jmschrei
Oct 9, 2015
Member
p should never be None
p should never be None
|
||
|
||
class LOF(NeighborsBase, KNeighborsMixin, LOFMixin, UnsupervisedMixin): | ||
"""Unsupervised Outlier Detection. |
jmschrei
Oct 9, 2015
Member
I like the documentation.
I like the documentation.
I would like to see some more extensive unit tests, particularly in cases where the algorithm should fail (wrong dimensions or other incorrect types of data passed in). I'll be able to look more at the performance of the code once you merge the mixin with the other class, and change the API to always take in an X matrix. |
I'd also like to see an example of it performing against a/many current algorithm(s), so that it is clear it is a valuable contribution. |
If you have a dataset X and want to remove outliers from it, you don't want to do
because then each sample is considered in its own neighbourhoud: in predict(X), X is considered as 'new observations'. What the user wants is:
which is allowed by
It is like looking for k-nearest-neighbors of points in a dataset X: you can do:
which is different from
I can make
and allows taking X=None in argument... Is it allowed ? |
implement a fit_predict(X) method is the way to go. |
Ok thanks ! |
I merged the mixin with LOF class, changed the API and added a comparison example. |
clf.fit(X) | ||
y_pred = clf.decision_function(X).ravel() | ||
|
||
if clf_name=="Local Outlier Factor": |
agramfort
Oct 12, 2015
Member
pep8
pep8
done! |
@amueller want to take a final look? for me it's good enough to merge |
I think caching the LRD on the training set would be good (and actually make the code easier to follow). I think either |
Returns | ||
------- | ||
lof_scores : array, shape (n_samples,) | ||
The Local Outlier Factor of each input samples. The lower, |
amueller
Oct 20, 2016
Member
This seems to contradict the title of the docstring.
This seems to contradict the title of the docstring.
ngoix
Oct 22, 2016
Author
Contributor
Yes it is -lof_scores
Yes it is -lof_scores
return is_inlier | ||
|
||
def decision_function(self, X): | ||
"""Opposite of the Local Outlier Factor of X (as bigger is better). |
amueller
Oct 20, 2016
Member
I think the docstring should be more explicit. Is low outlier or high outlier?
Actually, to be consistent with the other estimators, I think negative needs to be outlier.
I think the docstring should be more explicit. Is low outlier or high outlier?
Actually, to be consistent with the other estimators, I think negative needs to be outlier.
ngoix
Oct 22, 2016
Author
Contributor
I don't think so, for all the decision functions, bigger is better (large values correspond to inliers). For prediction, negative values (-1) correspond to outliers though. (It's true that this is a bit odd)
I don't think so, for all the decision functions, bigger is better (large values correspond to inliers). For prediction, negative values (-1) correspond to outliers though. (It's true that this is a bit odd)
@@ -18,6 +18,9 @@ | |||
hence more adapted to large-dimensional settings, even if it performs | |||
quite well in the examples below. | |||
- using the Local Outlier Factor to measure the local deviation of a given |
amueller
Oct 20, 2016
Member
It's kinda odd that this example lives in this folder... but whatever..
It's kinda odd that this example lives in this folder... but whatever..
ngoix
Oct 22, 2016
Author
Contributor
Yes very weird! it is the folder of the first outlier detection algorithm in scikit-learn.
Yes very weird! it is the folder of the first outlier detection algorithm in scikit-learn.
|
||
# Avoid re-computing X_lrd if same parameters: | ||
if not (np.all(distances_X == self._distances_fit_X_) * | ||
np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)): |
amueller
Oct 20, 2016
Member
this ==
raises a deprecation warning
/home/andy/checkout/scikit-learn/sklearn/neighbors/lof.py:279: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.
np.all(self.neighbors_indices_fit_X == neighbors_indices_X)):
This means they have different size, I think. So I guess you should check the shape first?
Also happens for the line above.
this ==
raises a deprecation warning
/home/andy/checkout/scikit-learn/sklearn/neighbors/lof.py:279: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.
np.all(self.neighbors_indices_fit_X == neighbors_indices_X)):
This means they have different size, I think. So I guess you should check the shape first?
Also happens for the line above.
The question is not, how isolated the sample is, but how isolated it is | ||
with respect to the surrounding neighborhood. | ||
|
||
This strategy is illustrated below. |
amueller
Oct 20, 2016
Member
I don't feel that the example illustrates the point that was just made about the different densities. I'm fine to leave it as-is but I don't get a good idea of the global vs local. It would be nice to also illustrate a failure mode maybe?
I don't feel that the example illustrates the point that was just made about the different densities. I'm fine to leave it as-is but I don't get a good idea of the global vs local. It would be nice to also illustrate a failure mode maybe?
ngoix
Oct 22, 2016
Author
Contributor
No global vs local anymore!
No global vs local anymore!
# Avoid re-computing X_lrd if same parameters: | ||
if not (np.all(distances_X == self._distances_fit_X_) * | ||
np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)): | ||
lrd = self._local_reachability_density( |
amueller
Oct 20, 2016
Member
It seems that lrd
is "small" compared to _distances_fit_X_
and _neighbors_indices_fit_X_
. Why not compute it in fit
and store it once and for all? You are currently recomputing it on every call to _local_outlier_factor
.
It seems that lrd
is "small" compared to _distances_fit_X_
and _neighbors_indices_fit_X_
. Why not compute it in fit
and store it once and for all? You are currently recomputing it on every call to _local_outlier_factor
.
Parameters | ||
---------- | ||
distances_X : array, shape (n_query, self.n_neighbors) | ||
Distances to the neighbors (in the training samples self._fit_X) of |
amueller
Oct 20, 2016
Member
I would put backticks around _fit_X
to be save ;)
I would put backticks around _fit_X
to be save ;)
ngoix
Oct 24, 2016
Author
Contributor
Do you mean replacing self._fit_X by self._fit_X
or self._fit_X
or just by _fit_X
? I don't understand the purpose...
Do you mean replacing self._fit_X by self._fit_X
or self._fit_X
or just by _fit_X
? I don't understand the purpose...
amueller
Oct 24, 2016
Member
I meant putting backticks around self._fit_X
. a) for nicer highlighting b) I'm not sure sphinx will render the current version correctly because of the underscore. But I might be paranoid.
I meant putting backticks around self._fit_X
. a) for nicer highlighting b) I'm not sure sphinx will render the current version correctly because of the underscore. But I might be paranoid.
score = clf.fit(X).outlier_factor_ | ||
assert_array_equal(clf._fit_X, X) | ||
|
||
# Assert scores are good: |
amueller
Oct 20, 2016
Member
Assert smallest outlier score is is greater than largest inlier score
Assert smallest outlier score is is greater than largest inlier score
ngoix
Oct 23, 2016
Author
Contributor
ok
ok
clf = neighbors.LocalOutlierFactor().fit(X_train) | ||
|
||
# predict scores (the lower, the more normal) | ||
y_pred = - clf.decision_function(X_test) |
amueller
Oct 20, 2016
Member
I would find it more natural to give the outliers the negative label. If you want to leave it like this, remove space after -
I would find it more natural to give the outliers the negative label. If you want to leave it like this, remove space after -
ngoix
Oct 23, 2016
Author
Contributor
I agree but this is to be consistent with OneClassSVM
, EllipticEnvelop
and IsolationForest
.
I agree but this is to be consistent with OneClassSVM
, EllipticEnvelop
and IsolationForest
.
distance between them. This works for Scipy's metrics, but is less | ||
efficient than passing the metric name as a string. | ||
Distance matrices are not supported. |
amueller
Oct 20, 2016
Member
I don't understand this comment.
I don't understand this comment.
clf = neighbors.LocalOutlierFactor().fit(X_train) | ||
|
||
# predict scores (the lower, the more normal) | ||
y_pred = -clf.decision_function(X_test) |
amueller
Oct 24, 2016
Member
I meant changing y_test
to be [0] * 20 + [-1] * 20
and then remove the -
I meant changing y_test
to be [0] * 20 + [-1] * 20
and then remove the -
|
||
return self | ||
|
||
def _predict(self, X=None): |
amueller
Oct 24, 2016
Member
I would really like to be consistent. I don't think there's a good argument to have one but not the other. Not sure if the example is a strong enough point to make them both public.
I would really like to be consistent. I don't think there's a good argument to have one but not the other. Not sure if the example is a strong enough point to make them both public.
for i, (clf_name, clf) in enumerate(classifiers.items()): | ||
# fit the data and tag outliers | ||
clf.fit(X) | ||
scores_pred = clf.decision_function(X) | ||
if clf_name == "Local Outlier Factor": |
amueller
Oct 24, 2016
Member
Wait, I don't understand this. Please elaborate.
Wait, I don't understand this. Please elaborate.
Attributes | ||
---------- | ||
outlier_factor_ : numpy array, shape (n_samples,) | ||
The LOF of X. The lower, the more normal. |
amueller
Oct 24, 2016
Member
I don't know which comment of yours refers to which comment of mine.
For the first comment: Yes, I'd either do negative_outlier_factor
or inlier_score
or something generic?
For the second comment: The explanation of outlier_factor_
as an attribute says "The LOF of X" what is X? It's the training set this LocalOutlierFactor
estimator was trained on, right?
I don't know which comment of yours refers to which comment of mine.
For the first comment: Yes, I'd either do negative_outlier_factor
or inlier_score
or something generic?
For the second comment: The explanation of outlier_factor_
as an attribute says "The LOF of X" what is X? It's the training set this LocalOutlierFactor
estimator was trained on, right?
thanks :) |
Hurray |
Youpi |
Thanks @ngoix !! |
Merged #5279.
Whoot!!
Thanks to everybody involved.
|
Congrats ! |
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
Local Outlier Factor implementation.
Motivated by previous discussions
http://sourceforge.net/p/scikit-learn/mailman/message/32485020
and
#4163
benchmarks of LOF on:
https://github.com/ngoix/scikit-learn/blob/AD_benchmarks/benchmarks/bench_lof_vs_iforest.py