Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Error for cosine affinity when zero vectors present #7943

Merged
merged 5 commits into from Jun 21, 2019

Conversation

@mthorrell
Copy link
Contributor

mthorrell commented Nov 26, 2016

Reference Issue

Fixes #7689

What does this implement/fix? Explain your changes.

Previously, code would hang/create memory leak when trying to agglomerate observations/features when using cosine distance and zero vectors. Now an error is produced in this setting.

Cosine distance is generally undefined when taking distances to zero vectors--and in fact different answers are produced depending on the code used scipy's pdist vs. sklearn's cosine_distances vs. sklearn's paired_cosine_distances.

Any other comments?

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Nov 26, 2016

I think that this is not the right fix.

There are two problems here: one that cosine is not well defined when vectors are null, and the other that hierarchical clustering is buggy.

We might consider to do bugware and implement such a check, to save the users' time (I stress the "might", as I would really be bugware).

If we do this, we should:

  1. Use "np.any" instead of the second np.sum and comparison
  2. Add a test
  3. Create an issue in scipy's tracker as, if I don't get it wrong, the hierarchical-clustering bug is there.
@mthorrell

This comment has been minimized.

Copy link
Contributor Author

mthorrell commented Nov 26, 2016

Thanks for the feedback. I apologize if I am making things harder than they need to be. I still have a few questions:

Cosine is not well defined --

  • Should we make a separate issue for this at least for the agglomeration setting? So that we can get consensus on what cosine distance to zero should be in this setting?
  • I was planning on making even a different issue that points out that paired_cosine_distances and cosine_distances in metrics.pairwise handle zeros differently to hopefully resolve a part of this issue more generally.

Hierarchical Clustering is buggy --

  • An update to scipy as described here scipy#6774 will essentially produce the same output as this pull request: it will create an Error when a distance is nan or Inf. As far as I can tell, this is where they will leave their implementation of hierarchical clustering.
  • With this in mind, would you still call scipy's clustering buggy?
  • Also with this in mind (and assuming this is the desired behavior), is kind of preempting the scipy fix--as I have done with this PR--appropriate?
@mthorrell

This comment has been minimized.

Copy link
Contributor Author

mthorrell commented Nov 26, 2016

I made the additions (1) and (2) in your list, @GaelVaroquaux.

Regarding (3), as I mentioned in my previous post, I don't believe there to be a remaining bug in Scipy's hierarchical-clustering.

What I propose is: we put this merge on hold while we create an issue regarding: What should Agglomerative Clustering do when using Cosine Distance with zero vectors? Once this gets resolved, we can either go forward with this fix... or we can proceed with whatever gets decided.

If this sounds reasonable, I can submit this new Issue.

# 'cosine' affinity is used
X = np.array([[0, 1],
[0, 0]])
assert_raises(ValueError, linkage_tree, X, affinity='cosine')

This comment has been minimized.

Copy link
@jnothman

jnothman Nov 29, 2016

Member

please use assert_raises_message

@@ -379,6 +379,10 @@ def linkage_tree(X, connectivity=None, n_components=None,
'Unknown linkage option, linkage should be one '
'of %s, but %s was given' % (linkage_choices.keys(), linkage))

if np.sum(1 - np.any(X, axis=1)) > 0 and affinity == 'cosine':

This comment has been minimized.

Copy link
@jnothman

jnothman Nov 29, 2016

Member

Please put the faster condition first.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Dec 1, 2016

@GaelVaroquaux why do you consider this bugware (also I'm not sure what bugware is).
Erroring on invalid input seems ok for me?

If we agree this solution is ok, then LGTM from a code perspective.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented May 23, 2018

ping @GaelVaroquaux again?

@rth

This comment has been minimized.

Copy link
Member

rth commented Jun 20, 2019

re "ping @GaelVaroquaux again?"

If no objections +1 to merge.

@@ -379,6 +379,10 @@ def linkage_tree(X, connectivity=None, n_components=None,
'Unknown linkage option, linkage should be one '
'of %s, but %s was given' % (linkage_choices.keys(), linkage))

if affinity == 'cosine' and np.sum(1 - np.any(X, axis=1)) > 0:

This comment has been minimized.

Copy link
@GaelVaroquaux

GaelVaroquaux Jun 20, 2019

Member

I hate to be the person always complaining, but "np.sum(1 - np.any(X, axis=1)) > 0" is a strange test. What exactly is it testing? That X is zero everywhere?

This comment has been minimized.

Copy link
@mthorrell

mthorrell Jun 21, 2019

Author Contributor

It makes sure there are no rows with all zeros.

[[0,0],
[1,1]] 

for example will produce this error, but the following example won't.

[[1,0],
[1,1]] 

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 21, 2019

Member

Use ~ instead of 1- and use np.any instead of sum > 0.

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 21, 2019

Member

Then it would be readable as "any rows without any values"

This comment has been minimized.

Copy link
@GaelVaroquaux

GaelVaroquaux Jun 21, 2019

Member

I would have written that "np.any(np.all(X == 0, axis=1))". But what is written here is fine.

Thanks!

mthorrell
@GaelVaroquaux GaelVaroquaux self-requested a review Jun 21, 2019
Copy link
Member

GaelVaroquaux left a comment

LGTM.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Jun 21, 2019

+1 for merge once CI has ran.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Jun 21, 2019

Codecov Report

Merging #7943 into master will increase coverage by 0.06%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7943      +/-   ##
==========================================
+ Coverage   96.19%   96.26%   +0.06%     
==========================================
  Files         348      401      +53     
  Lines       64645    72877    +8232     
  Branches        0     7895    +7895     
==========================================
+ Hits        62187    70154    +7967     
- Misses       2458     2699     +241     
- Partials        0       24      +24
Impacted Files Coverage Δ
sklearn/cluster/hierarchical.py 99.69% <ø> (+0.05%) ⬆️
sklearn/cluster/tests/test_hierarchical.py 100% <100%> (ø) ⬆️
sklearn/__init__.py 67.64% <0%> (-30.09%) ⬇️
sklearn/ensemble/gradient_boosting.py 68.12% <0%> (-27.97%) ⬇️
sklearn/tests/test_common.py 85.83% <0%> (-10.64%) ⬇️
sklearn/_build_utils/__init__.py 52% <0%> (-9.23%) ⬇️
sklearn/datasets/covtype.py 46% <0%> (-6.18%) ⬇️
sklearn/feature_extraction/hashing.py 95.34% <0%> (-4.66%) ⬇️
sklearn/datasets/olivetti_faces.py 29.26% <0%> (-4.07%) ⬇️
sklearn/decomposition/sparse_pca.py 96.1% <0%> (-3.9%) ⬇️
... and 400 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e29334...67f3fdf. Read the comment docs.

@GaelVaroquaux GaelVaroquaux merged commit 3d99769 into scikit-learn:master Jun 21, 2019
12 checks passed
12 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
codecov/patch 100% of diff hit (target 96.19%)
Details
codecov/project 96.26% (+0.06%) compared to 3e29334
Details
scikit-learn.scikit-learn Build #20190621.11 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
scikit-learn.scikit-learn (Windows py35_32) Windows py35_32 succeeded
Details
scikit-learn.scikit-learn (Windows py37_64) Windows py37_64 succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details
koenvandevelde added a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
…rn#7943)

* cosine affinity cannot be used when X contains zero vectors

* fixed issue with tabs spaces

* changed to np.any and created a test for this new ValueError

* use assert_raise_message and flipped order of if conditions

* fixed 0 row calculation
@jnothman jnothman mentioned this pull request Oct 28, 2019
0 of 8 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.