Unclear selection of LOF threshold #23837

gutjuri · 2022-07-05T11:26:51Z

In

Lines 103 to 104 in baf0ea2

    
                   - if 'auto', the threshold is determined as in the 
        
                     original paper,

the documentation states that if the parameter 'contamination' is set to 'auto', the threshold for LOF will be calculated as outlined in the original paper.

But here

scikit-learn/sklearn/neighbors/_lof.py

Lines 147 to 148 in baf0ea2

    
                   The offset is set to -1.5 (inliers score around -1), except when a 
        
                   contamination parameter different than "auto" is provided. In that

and here

scikit-learn/sklearn/neighbors/_lof.py

Line 310 in baf0ea2

self.offset_ = -1.5

we can see that the threshold is set to 1.5 in all cases.
The original paper (https://doi.org/10.1145/335191.335388) never mentions the "magic number" 1.5, which makes me wonder where it comes from.

I'd be glad if you could clarify this. If someone is able to provide a reference for the value of 1.5 I'll gladly update the documentation.

EDIT: The original paper mentions the number 1.5, but only in an example (Sec. 7.3). I don't think they meant to suggest that there is anything special about the number 1.5, it might even be possible that they settled on this number because of space constraints in the paper.

thomasjpfan · 2022-07-05T13:57:40Z

Thank you for opening this issue.

@albertcthomas This seems to be added here: #9015. What do you think about this issue?

glevv · 2022-10-27T06:37:05Z

It's mentioned in 7.3 and the reason is described in 7.1.

Arguments are a bit weak, IMO

gutjuri · 2022-10-27T07:05:16Z

It's mentioned in 7.3

See my original post. In the paper, it is only used in one example.

and the reason is described in 7.1.

Could you show me where exactly the reason for choosing the number 1.5 is described in Section 7.1? I can't find it.

glevv · 2022-10-27T09:07:42Z

We see that the objects in the uniform clusters all have their LOF equal to 1. Most objects in the Gaussian clusters also have 1 as their LOF values. Slightly outside the Gaussian clusters, there are several weak outliers, i.e., those with relatively low, but larger than 1, LOF values. The remaining seven objects all have significantly larger LOF values.

They are stating that LOF threshold should be significantly higher than 1. In 7.2 they look at outliers that have scores (2, 2.4, 2.5, 2.8, 6) and in 7.3 they used 1.5 threshold.

This threshold could be set to legacy computation, I guess, if 1.5 is too confusing.

github-actions bot added the Needs Triage Issue requires triage label Jul 5, 2022

thomasjpfan added module:neighbors Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear selection of LOF threshold #23837

Unclear selection of LOF threshold #23837

gutjuri commented Jul 5, 2022 •

edited

thomasjpfan commented Jul 5, 2022

glevv commented Oct 27, 2022

gutjuri commented Oct 27, 2022

glevv commented Oct 27, 2022 •

edited

Unclear selection of LOF threshold #23837

Unclear selection of LOF threshold #23837

Comments

gutjuri commented Jul 5, 2022 • edited

thomasjpfan commented Jul 5, 2022

glevv commented Oct 27, 2022

gutjuri commented Oct 27, 2022

glevv commented Oct 27, 2022 • edited

gutjuri commented Jul 5, 2022 •

edited

glevv commented Oct 27, 2022 •

edited