New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] fix bug with negative values in cosine_distances #7732

Merged
merged 4 commits into from Oct 26, 2016

Conversation

Projects
None yet
4 participants
@asanakoy
Contributor

asanakoy commented Oct 24, 2016

Reference Issue

Fixes #5772

What does this implement/fix? Explain your changes.

Fix bug cosine_distances returning small negative values.

Essentially:

  • clip cosine distances to [0, 2].
  • set distances between vectors and themselves to 0.

Any other comments?

Did it analogously as it was implemented in euclidean_distances.

fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0
@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 24, 2016

I am in line with this solution (I did the same in a bigger, never merged PR #5333).
Could you please add a test that would break without your changes?

@asanakoy

This comment has been minimized.

Contributor

asanakoy commented Oct 24, 2016

Sure. Should I add it to sklearn/metrics/tests/test_pairwise.py as a separate function?

@giorgiop

This comment has been minimized.

Contributor

giorgiop commented Oct 24, 2016

I would put some both in parwise and T-SNE. You should be able to reuse some of the tests I wrote back then. But please read them critically :)

@asanakoy

This comment has been minimized.

Contributor

asanakoy commented Oct 24, 2016

The test will fail without the fix. It generates distances less than 0 and greater than 2.

D = cosine_distances(XA)
assert_array_almost_equal(D, [[0., 0.], [0., 0.]])
# check that all elements are in [0, 2]
assert_true(np.all(D >= 0.))

This comment has been minimized.

@amueller

amueller Oct 24, 2016

Member

This seems pretty vacuous after the test that all entries are zero? Maybe tests these with a larger random matrix?

This comment has been minimized.

@asanakoy

asanakoy Oct 24, 2016

Contributor

This is the specific random vectors that I found to produce negative values on diagonal if you don't clip them to [0,2].
I can add another test with

X = np.abs(rng.rand(1000, 5000))
D = cosine_distances(X)
# we get precisely 0 only if we clip previously negative values on the diagonal. So we are not sure here.
assert_array_almost_equal(D.flat[::D.shape[0] + 1], [0., 0.])
assert_true(np.all(D >= 0.))
assert_true(np.all(D <= 2.))

This comment has been minimized.

@amueller

amueller Oct 24, 2016

Member

I would like that. It's not saying that your test doesn't make sense, but after the first assert_array_almost_equal it is certain that the tests pass, right? I guess the >=0 checks if it's actually >= and not only "almost" but if it's "almost zero" it is certainly smaller than two.

This comment has been minimized.

@amueller

amueller Oct 24, 2016

Member

On the other hand checking both boundaries for both cases is fine. But having a large random matrix in addition would also be nice.

This comment has been minimized.

@asanakoy

asanakoy Oct 24, 2016

Contributor

I guess the >=0 checks if it's actually >= and not only "almost"

Absolutely. If the value is -1e-16 than it will pass assert_array_almost_equal, but the second check on >=0 will make it fail.

OK. I will add another test with random matrix 1000 x 5000 .

This comment has been minimized.

@amueller

amueller Oct 24, 2016

Member

thanks :)

@amueller amueller changed the title from fix bug with negative values in cosine_distances to [MRG] fix bug with negative values in cosine_distances Oct 24, 2016

D = cosine_distances(XA)
assert_array_almost_equal(D, [[0., 0.], [0., 0.]])
# check that all elements are in [0, 2]
assert_true(np.all(D >= 0.))

This comment has been minimized.

@asanakoy

asanakoy Oct 24, 2016

Contributor

This is the specific random vectors that I found to produce negative values on diagonal if you don't clip them to [0,2].
I can add another test with

X = np.abs(rng.rand(1000, 5000))
D = cosine_distances(X)
# we get precisely 0 only if we clip previously negative values on the diagonal. So we are not sure here.
assert_array_almost_equal(D.flat[::D.shape[0] + 1], [0., 0.])
assert_true(np.all(D >= 0.))
assert_true(np.all(D <= 2.))
assert_array_equal(D.flat[::D.shape[0] + 1], [0., 0.])
XB = np.vstack([x, -x])
D2 = cosine_distances(XB)

This comment has been minimized.

@asanakoy

asanakoy Oct 24, 2016

Contributor

For this specific vectors we will have values < 0 on the diagonal and > 2.0 off diagonal if you don't clip them to [0,2].

@amueller amueller changed the title from [MRG] fix bug with negative values in cosine_distances to [MRG + 1] fix bug with negative values in cosine_distances Oct 24, 2016

@amueller

This comment has been minimized.

Member

amueller commented Oct 24, 2016

LGTM, though you have a pep8 violation

@asanakoy

This comment has been minimized.

Contributor

asanakoy commented Oct 24, 2016

Sorry. Repushed ;)

@amueller amueller added this to the 0.18.1 milestone Oct 24, 2016

if X is Y or Y is None:
# Ensure that distances between vectors and themselves are set to 0.0.
# This may not be the case due to floating point rounding errors.
S.flat[::S.shape[0] + 1] = 0.0

This comment has been minimized.

@NelleV

NelleV Oct 25, 2016

Member

If I am not mistaken, this case leads to a squared matrix. Why not use np.diag_indices? It seems much more clear to me what the code does using the diagonal of the matrix instead of the slicing.

@NelleV

This comment has been minimized.

Member

NelleV commented Oct 26, 2016

Thanks for putting up with my annoying nitpick :)

@NelleV NelleV merged commit 0ea8e8b into scikit-learn:master Oct 26, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@asanakoy

This comment has been minimized.

Contributor

asanakoy commented Oct 26, 2016

No problem! Thank you!
My first contribution to sklearn :) I hope to be helpful further!

@amueller

This comment has been minimized.

Member

amueller commented Oct 26, 2016

Great, congratulations @asanakoy :)

@asanakoy asanakoy deleted the asanakoy:fix-negative-cosine-dist branch Oct 26, 2016

@asanakoy asanakoy restored the asanakoy:fix-negative-cosine-dist branch Oct 26, 2016

@lanzagar lanzagar referenced this pull request Oct 28, 2016

Merged

[ENH] Manifold Learning #1624

amueller added a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016

[MRG + 1] fix bug with negative values in cosine_distances (scikit-le…
…arn#7732)

* fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0

* add test

* add test on big random matrix

* use np.diag_indices_from instead of slicing

@VesnaT VesnaT referenced this pull request Nov 16, 2016

Merged

OWManifoldLearning: Add cosine distance to t-SNE #1756

1 of 3 tasks complete

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG + 1] fix bug with negative values in cosine_distances (scikit-le…
…arn#7732)

* fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0

* add test

* add test on big random matrix

* use np.diag_indices_from instead of slicing

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

[MRG + 1] fix bug with negative values in cosine_distances (scikit-le…
…arn#7732)

* fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0

* add test

* add test on big random matrix

* use np.diag_indices_from instead of slicing

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG + 1] fix bug with negative values in cosine_distances (scikit-le…
…arn#7732)

* fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0

* add test

* add test on big random matrix

* use np.diag_indices_from instead of slicing

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG + 1] fix bug with negative values in cosine_distances (scikit-le…
…arn#7732)

* fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0

* add test

* add test on big random matrix

* use np.diag_indices_from instead of slicing

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG + 1] fix bug with negative values in cosine_distances (scikit-le…
…arn#7732)

* fix bug with negative values in cosine_distances
clip distances to [0, 2]
set distances between vectors and themselves to 0

* add test

* add test on big random matrix

* use np.diag_indices_from instead of slicing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment