Skip to content

[MRG + 1] support X_norm_squared in euclidean_distances #2459

Merged
merged 1 commit into from Sep 1, 2015

6 participants

@dougalsutherland

There's a comment in euclidean_distances saying

# should not need X_norm_squared because if you could precompute that as
# well as Y, then you should just pre-compute the output and not even
# call this function.

That's not necessarily true, though. I've run into a situation today where I have a whole bunch of sets, and need to do something based on the distances between each pair of sets. It's helpful to cache the squared norms for each of the sets; if I did that and called it with just Y_norm_squared for each pair, that'd still be recomputing the norms for X all the time. (Of course, I can just do it without the helper function, which is what I'm doing now, but it's nicer to use helpers....)

Another situation is when you happen to already have the squared norms for a set X and then you want euclidean_distances(X). I guessed that maybe euclidean_distances(X, Y_norm_squared=X_norm_sq) would work, but looking at the code that doesn't actually use X_norm_sq. Now euclidean_distances can handle that case too.

This also adds an extremely simple test that passing X_norm_squared and/or Y_norm_squared gives the same result; previously there was no test that used Y_norm_squared.

As an aside: I have no idea why XX is being computed with X * X and YY with Y ** 2 (which necessitates the annoying copy code when it's sparse); it seems like it should be exactly the same situation, except for the very minor difference of the shape. I left it as-is, though.

@dougalsutherland

I don't know why the tests passed locally and failed on the server. Will look into that.

@larsmans
scikit-learn member
larsmans commented Oct 2, 2013

This is not mergeable anymore after my optimizations in 81950ba. The X*X and Y**2 difference is gone, as the squaring is now offloaded to a helper that calls einsum when possible.

@dougalsutherland

Okay, looks good. I still think passing in the square norms of X makes sense; I'll update to take your optimizations into account and track down the test problem sometime soon.

@coveralls

Coverage Status

Coverage remained the same when pulling 8d2978c on dougalsutherland:euclidean_distances into e54c54a on scikit-learn:master.

@dougalsutherland

Updated this to use the new version, with basically the same changeset as before. I put the X_norm_squared argument last in the spec, though it should probably go earlier, to keep backwards compatibility with positional arguments. Is that something that I should do?

@amueller
scikit-learn member

Not sure why this got stale, looks like a good contrib. Can you rebase?

@dougalsutherland

@amueller Okay, rebased and switched to use check_array.

I still kept X_norm_squared as the last argument in case people are calling it with positional arguments, despite the order looking ugly.

@amueller
scikit-learn member

I'm ok with that ordering. LGTM.

@amueller amueller changed the title from support X_norm_squared in euclidean_distances to [MRG + 1] support X_norm_squared in euclidean_distances Aug 27, 2015
@jnothman jnothman and 1 other commented on an outdated diff Aug 27, 2015
sklearn/metrics/tests/test_pairwise.py
@@ -334,6 +334,21 @@ def test_euclidean_distances():
D = euclidean_distances(X, Y)
assert_array_almost_equal(D, [[1., 2.]])
+ rng = np.random.RandomState(0)
+ X = rng.random_sample((10, 4))
+ Y = rng.random_sample((20, 4))
+ X_norm_sq = (X ** 2).sum(axis=1)
+ Y_norm_sq = (Y ** 2).sum(axis=1)
+
+ D1 = euclidean_distances(X, Y)
+ D2 = euclidean_distances(X, Y, X_norm_squared=X_norm_sq)
@jnothman
scikit-learn member
jnothman added a note Aug 27, 2015

These tests don't actually check that the args are being utilised. Do we care? Should we add tests with incorrect norms in order to ensure that there aren't bugs in the use of the norms? Or is code coverage sufficient?

@amueller
scikit-learn member
amueller added a note Aug 27, 2015

good point. It's too late in NYC. What we want to test is that it gets faster, right? But that is not nice to test.
Maybe adding a test that the squares are used would be ok. We didn't have that for Y_norm_squared, though, right?
Maybe just add a test that passes both as zero and see if it just computes -2 times the dot product?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman
scikit-learn member

Apart from that nitpick, this LGTM.

@dougalsutherland dougalsutherland support X_norm_squared in euclidean_distances
8d8b434
@dougalsutherland

Added a test that the answer is wrong with the wrong squared norms.

I tried @amueller's suggestion of passing zeros, but the function does maximum(distances, 0), so instead of -2 * X.dot(Y.T) it returned just all zeros. So I added a stupid test where it halves the squared norms and makes sure that at least one distance was changed by more than .01, because whatever.

@amueller
scikit-learn member

@jnothman merge?

@GaelVaroquaux
scikit-learn member

LGTM. Merging. Thanks!

@GaelVaroquaux GaelVaroquaux merged commit 4d39cf8 into scikit-learn:master Sep 1, 2015

2 checks passed

Details continuous-integration/appveyor AppVeyor build succeeded
Details continuous-integration/travis-ci/pr The Travis CI build passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.