New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of consistency for decision_function methods in outlier detection #8693

Closed
albertcthomas opened this Issue Apr 3, 2017 · 11 comments

Comments

Projects
None yet
5 participants
@albertcthomas
Contributor

albertcthomas commented Apr 3, 2017

Description

I think we could improve the consistency of the decision_function of the outlier detection algorithms implemented in scikit-learn.

  • decision_function for OCSVM is such that if the value is positive then the sample is an inlier and if negative then it is an outlier. It takes into account the parameter nu which can be seen as a contamination parameter. The decision_function of IsolationForest does not take into account the contamination parameter, it just returns the score of the samples. For LOF, it is private _decision_function and does not take into account the contamination parameter. For EllipticEnveloppe, decision_function takes into account the contamination parameter and it is said in the documentation that it is meant to "ensure a compatibility with other outlier detection tools such as the One-Class SVM".

decision_function should maybe stick with the OCSVM convention and we could add a score_samples method, as for kernel density estimation, which would return the scores of the algorithms as defined in their original papers. This would be useful when performing benchmarks with ROC curves for instance. When I did a benchmark with sklearn anomaly detection algorithms I defined a subclass for each algorithm, each with a score method.

If you think this should be adressed I can submit a PR.

See also #8677.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 3, 2017

Member
Member

jnothman commented Apr 3, 2017

@albertcthomas

This comment has been minimized.

Show comment
Hide comment
@albertcthomas

albertcthomas Apr 4, 2017

Contributor

This is what I suggest but feedbacks are welcome:

  • We can provide a score_samples method that implements the score given in the original papers of LOF, Isolation Forest and Elliptic Envelope (with the convention bigger is better).

  • OneClassSVM should definitely have a decision_function method because that's the term used for SVM in general (and in the original paper of the One-Class SVM). Furthermore this keeps OneClassSVM consistent with the SVM API of sklearn.

  • If we decide to provide a decision_function method for other outlier detection algorithms I think it should be consistent with the one of OneClassSVM and take into account the contamination parameter (as it is done for EllipticEnvelope). The drawback is that it is then a bit redundant with score_samples.

Contributor

albertcthomas commented Apr 4, 2017

This is what I suggest but feedbacks are welcome:

  • We can provide a score_samples method that implements the score given in the original papers of LOF, Isolation Forest and Elliptic Envelope (with the convention bigger is better).

  • OneClassSVM should definitely have a decision_function method because that's the term used for SVM in general (and in the original paper of the One-Class SVM). Furthermore this keeps OneClassSVM consistent with the SVM API of sklearn.

  • If we decide to provide a decision_function method for other outlier detection algorithms I think it should be consistent with the one of OneClassSVM and take into account the contamination parameter (as it is done for EllipticEnvelope). The drawback is that it is then a bit redundant with score_samples.

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Apr 19, 2017

Contributor

+1 for more consistency

Also it would be great to remove OutlierDetectionMixin which is not used by any outlier detection algorithm excepting by EllipticEnvelope

IMHO it would be clearer to have a standardized decision_function method for all the algo, with an optional contamination parameter. (Except for OCSVM and Elliptic_Envelope, decision functions structurally cannot depend on the dataset contamination, which is then just used to define a threshold for prediction)

Contributor

ngoix commented Apr 19, 2017

+1 for more consistency

Also it would be great to remove OutlierDetectionMixin which is not used by any outlier detection algorithm excepting by EllipticEnvelope

IMHO it would be clearer to have a standardized decision_function method for all the algo, with an optional contamination parameter. (Except for OCSVM and Elliptic_Envelope, decision functions structurally cannot depend on the dataset contamination, which is then just used to define a threshold for prediction)

@rcamino

This comment has been minimized.

Show comment
Hide comment
@rcamino

rcamino May 15, 2017

Hi all,

I've been using Isolation Forests, I have some questions regarding this issue.

I red that you have the contamination parameter in Isolation Forest for consistency.

Correct me if I'm wrong, but I think that with the approach from the original papers you can discover the proportion of anomalies in the dataset, but with the implementation of scikit learn you have to define it.

Isn't that like deleting a good property from the algorithm for exploratory analysis?

Thanks.

rcamino commented May 15, 2017

Hi all,

I've been using Isolation Forests, I have some questions regarding this issue.

I red that you have the contamination parameter in Isolation Forest for consistency.

Correct me if I'm wrong, but I think that with the approach from the original papers you can discover the proportion of anomalies in the dataset, but with the implementation of scikit learn you have to define it.

Isn't that like deleting a good property from the algorithm for exploratory analysis?

Thanks.

@albertcthomas

This comment has been minimized.

Show comment
Hide comment
@albertcthomas

albertcthomas May 16, 2017

Contributor

@rcamino I don't think the original paper gives a method to find the proportion of anomalies. Are you referring to this paragraph of the original paper?

(a) if instances return s very close to 1, then they are
definitely anomalies,
(b) if instances have s much smaller than 0.5, then they
are quite safe to be regarded as normal instances, and
(c) if all the instances return s ≈ 0.5, then the entire
sample does not really have any distinct anomaly.

The question then is how do you define 'very close to 1' and 'much smaller than 0.5'? I think this would need more work than what the original paper says. BTW you can still access the values of s with the `decision_function.

Contributor

albertcthomas commented May 16, 2017

@rcamino I don't think the original paper gives a method to find the proportion of anomalies. Are you referring to this paragraph of the original paper?

(a) if instances return s very close to 1, then they are
definitely anomalies,
(b) if instances have s much smaller than 0.5, then they
are quite safe to be regarded as normal instances, and
(c) if all the instances return s ≈ 0.5, then the entire
sample does not really have any distinct anomaly.

The question then is how do you define 'very close to 1' and 'much smaller than 0.5'? I think this would need more work than what the original paper says. BTW you can still access the values of s with the `decision_function.

@rcamino

This comment has been minimized.

Show comment
Hide comment
@rcamino

rcamino May 16, 2017

@albertcthomas Yes, I'm referring to that paragraph. It is true that 0.5 is not exactly defined as the threshold for anomalies, the expressions "very close to 1", "much smaller than 0.5" and "all the instances return s ≈ 0.5" need some interpretation and analysis in your dataset.

I want to analyse this with the decision_function, but I don't know exactly the range for the output scores.
Is it [-1, 1]?
Should I translate the 0.5 from the paper to 0 in this scale?

Sorry for asking here, I don't know what's the right place (had no luck in stackexchange).

Thanks again.

rcamino commented May 16, 2017

@albertcthomas Yes, I'm referring to that paragraph. It is true that 0.5 is not exactly defined as the threshold for anomalies, the expressions "very close to 1", "much smaller than 0.5" and "all the instances return s ≈ 0.5" need some interpretation and analysis in your dataset.

I want to analyse this with the decision_function, but I don't know exactly the range for the output scores.
Is it [-1, 1]?
Should I translate the 0.5 from the paper to 0 in this scale?

Sorry for asking here, I don't know what's the right place (had no luck in stackexchange).

Thanks again.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 16, 2017

Member
Member

GaelVaroquaux commented May 16, 2017

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix May 16, 2017

Contributor

@rcamino the contamination parameter is only used for producing binary output (predict method) from the decision function. The link between our decision_function f and the scoring function s from the original paper is f = 0.5 - s (to respect the 'bigger is better' constraint).

Contributor

ngoix commented May 16, 2017

@rcamino the contamination parameter is only used for producing binary output (predict method) from the decision function. The link between our decision_function f and the scoring function s from the original paper is f = 0.5 - s (to respect the 'bigger is better' constraint).

@rcamino

This comment has been minimized.

Show comment
Hide comment
@rcamino

rcamino May 16, 2017

@ngoix Great, thank you! Do you think that small formula should be added in the decision_function documentation?

rcamino commented May 16, 2017

@ngoix Great, thank you! Do you think that small formula should be added in the decision_function documentation?

@albertcthomas

This comment has been minimized.

Show comment
Hide comment
@albertcthomas

albertcthomas May 16, 2017

Contributor

The score_samples I suggested was inspired by the KernelDensity implementation as this also provides a decision function for anomaly detection in small dimensions.

Contributor

albertcthomas commented May 16, 2017

The score_samples I suggested was inspired by the KernelDensity implementation as this also provides a decision function for anomaly detection in small dimensions.

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Jun 6, 2017

Contributor

To summarize :

  1. Minor changes:
  • for all decision functions except the EllipticEnvelop one, bigger is better (the smaller the more abnormal).
    -> change EllipticEnvelop decision function to its opposite when raw_value is True (or remove raw_value param?).
  • clean outlier_detection.py (remove OutlierDetectionMixin) and robust_covariance.py.
  1. API changes:
    Rq: it is true that EllipticEnvelop decision function uses contamination, but just to shift the output such taht 0 becomes the threshold corresponding to this contamination parameter.
  • First solution: make all decision functions use contamination just as Elliptic Envelop, ie just for shifting (then positive value = inlier and negative = outlier).

    • drawback 1: in some algo, there are two interesting values: one using the structure of the decision function, one corresponding to the contamination param.
    • idea: Why not use the first one when the contamination param is None ? It would be natural, as when we have no idea on the proportion of outliers, we want the algo to find it. In that case, we don't define a default contamination to 10% anymore.
    • drawback 2: OCSVM would be the only one to lack a contamination param as it is somehow include in nu.
  • Second solution (@albertcthomas) :
    First solution + add a score_samples method which would return the scores of the algorithms as defined in their original papers.

    • drawbacks: same as first solution, plus:
      scores_samples would always be a shifted version of decision_function, wouldn't it ?
Contributor

ngoix commented Jun 6, 2017

To summarize :

  1. Minor changes:
  • for all decision functions except the EllipticEnvelop one, bigger is better (the smaller the more abnormal).
    -> change EllipticEnvelop decision function to its opposite when raw_value is True (or remove raw_value param?).
  • clean outlier_detection.py (remove OutlierDetectionMixin) and robust_covariance.py.
  1. API changes:
    Rq: it is true that EllipticEnvelop decision function uses contamination, but just to shift the output such taht 0 becomes the threshold corresponding to this contamination parameter.
  • First solution: make all decision functions use contamination just as Elliptic Envelop, ie just for shifting (then positive value = inlier and negative = outlier).

    • drawback 1: in some algo, there are two interesting values: one using the structure of the decision function, one corresponding to the contamination param.
    • idea: Why not use the first one when the contamination param is None ? It would be natural, as when we have no idea on the proportion of outliers, we want the algo to find it. In that case, we don't define a default contamination to 10% anymore.
    • drawback 2: OCSVM would be the only one to lack a contamination param as it is somehow include in nu.
  • Second solution (@albertcthomas) :
    First solution + add a score_samples method which would return the scores of the algorithms as defined in their original papers.

    • drawbacks: same as first solution, plus:
      scores_samples would always be a shifted version of decision_function, wouldn't it ?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment