Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[MRG] Error for cosine affinity when zero vectors present #7943
What does this implement/fix? Explain your changes.
Previously, code would hang/create memory leak when trying to agglomerate observations/features when using cosine distance and zero vectors. Now an error is produced in this setting.
Cosine distance is generally undefined when taking distances to zero vectors--and in fact different answers are produced depending on the code used scipy's pdist vs. sklearn's cosine_distances vs. sklearn's paired_cosine_distances.
Any other comments?
I think that this is not the right fix.
There are two problems here: one that cosine is not well defined when vectors are null, and the other that hierarchical clustering is buggy.
We might consider to do bugware and implement such a check, to save the users' time (I stress the "might", as I would really be bugware).
If we do this, we should:
Thanks for the feedback. I apologize if I am making things harder than they need to be. I still have a few questions:
Cosine is not well defined --
Hierarchical Clustering is buggy --
I made the additions (1) and (2) in your list, @GaelVaroquaux.
Regarding (3), as I mentioned in my previous post, I don't believe there to be a remaining bug in Scipy's hierarchical-clustering.
What I propose is: we put this merge on hold while we create an issue regarding: What should Agglomerative Clustering do when using Cosine Distance with zero vectors? Once this gets resolved, we can either go forward with this fix... or we can proceed with whatever gets decided.
If this sounds reasonable, I can submit this new Issue.
@@ Coverage Diff @@ ## master #7943 +/- ## ========================================== + Coverage 96.19% 96.26% +0.06% ========================================== Files 348 401 +53 Lines 64645 72877 +8232 Branches 0 7895 +7895 ========================================== + Hits 62187 70154 +7967 - Misses 2458 2699 +241 - Partials 0 24 +24