[MRG] Fix LocalOutlierFactor's output for data with duplicated samples#28773
[MRG] Fix LocalOutlierFactor's output for data with duplicated samples#28773OmarManzoor merged 18 commits intoscikit-learn:mainfrom
Conversation
…cated samples Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers. Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative outlier factor (under -1e7), labeling them as outliers. This fix checks if the minimum negative outlier factor is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning. Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired. Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.
|
I think I don't like the recursive automatic change to neighbors. Maybe we should instead just warn the user when we detect the problem with very negative outlier factor values and let the user re-fit the model with a larger value of |
Removed automatic change to neighbors number and changed the warning Also changed the associated test, to catch the warning.
|
@ogrisel Changed the code. Now it only raises a warning, as suggested. |
Changed comment according to review Co-authored-by: Tim Head <betatim@gmail.com>
|
@betatim Is there anything more I can do? |
OmarManzoor
left a comment
There was a problem hiding this comment.
Thanks for the PR @HenriqueProj. I think this may require a changelog entry as a Fix.
Changed test description according to review Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
|
@OmarManzoor Made the changes, is everything ok? Thanks for the feedback! |
OmarManzoor
left a comment
There was a problem hiding this comment.
Minor suggestions and also the conflict needs to be resolved.
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
OmarManzoor
left a comment
There was a problem hiding this comment.
LGTM. Thanks @HenriqueProj
|
Thank you for all the feedback! |
Reference Issues/PRs
Fixes #27839
What does this implement/fix? Explain your changes.
Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers.
Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low
negative_outlier_factor_(under -1e7), labeling them as outliers.This fix checks if the minimum
negative_outlier_factor_is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning.Notes: Added a
handle_duplicatesvariable, which allows developers to manually handle the duplicate values, if desired.Also added a
memory_limitvariable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.Any other comments?