Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear selection of LOF threshold #23837

Open
gutjuri opened this issue Jul 5, 2022 · 4 comments
Open

Unclear selection of LOF threshold #23837

gutjuri opened this issue Jul 5, 2022 · 4 comments
Labels
module:neighbors Needs Investigation Issue requires investigation

Comments

@gutjuri
Copy link

gutjuri commented Jul 5, 2022

In

- if 'auto', the threshold is determined as in the
original paper,
the documentation states that if the parameter 'contamination' is set to 'auto', the threshold for LOF will be calculated as outlined in the original paper.

But here

The offset is set to -1.5 (inliers score around -1), except when a
contamination parameter different than "auto" is provided. In that
and here
self.offset_ = -1.5
we can see that the threshold is set to 1.5 in all cases.
The original paper (https://doi.org/10.1145/335191.335388) never mentions the "magic number" 1.5, which makes me wonder where it comes from.

I'd be glad if you could clarify this. If someone is able to provide a reference for the value of 1.5 I'll gladly update the documentation.

EDIT: The original paper mentions the number 1.5, but only in an example (Sec. 7.3). I don't think they meant to suggest that there is anything special about the number 1.5, it might even be possible that they settled on this number because of space constraints in the paper.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jul 5, 2022
@thomasjpfan
Copy link
Member

Thank you for opening this issue.

@albertcthomas This seems to be added here: #9015. What do you think about this issue?

@thomasjpfan thomasjpfan added module:neighbors Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Jul 5, 2022
@glevv
Copy link
Contributor

glevv commented Oct 27, 2022

It's mentioned in 7.3 and the reason is described in 7.1.

Arguments are a bit weak, IMO

@gutjuri
Copy link
Author

gutjuri commented Oct 27, 2022

It's mentioned in 7.3

See my original post. In the paper, it is only used in one example.

and the reason is described in 7.1.

Could you show me where exactly the reason for choosing the number 1.5 is described in Section 7.1? I can't find it.

@glevv
Copy link
Contributor

glevv commented Oct 27, 2022

We see that the objects in the uniform clusters all have their LOF equal to 1. Most objects in the Gaussian clusters also have 1 as their LOF values. Slightly outside the Gaussian clusters, there are several weak outliers, i.e., those with relatively low, but larger than 1, LOF values. The remaining seven objects all have significantly larger LOF values.

They are stating that LOF threshold should be significantly higher than 1. In 7.2 they look at outliers that have scores (2, 2.4, 2.5, 2.8, 6) and in 7.3 they used 1.5 threshold.

This threshold could be set to legacy computation, I guess, if 1.5 is too confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:neighbors Needs Investigation Issue requires investigation
Projects
None yet
Development

No branches or pull requests

3 participants