Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pairwise_distances(X) should always have 0 diagonal #12628

Open
jnothman opened this issue Nov 20, 2018 · 6 comments · May be fixed by #12635
Open

pairwise_distances(X) should always have 0 diagonal #12628

jnothman opened this issue Nov 20, 2018 · 6 comments · May be fixed by #12635

Comments

@jnothman
Copy link
Member

jnothman commented Nov 20, 2018

In the case of euclidean distances, we explicitly set the diagonal of unary pairwise distances to 0:

distances.flat[::distances.shape[0] + 1] = 0.0
as well as in pairwise_distances_chunked:
D_chunk.flat[sl.start::_num_samples(X) + 1] = 0
. We also do so for cosine distances:
S[np.diag_indices_from(S)] = 0.0
.

We should zero the diagonal of the output for all metrics through pairwise_distances and pairwise_distances_chunked (where Y is None or Y is X) to reduce the effect of imprecision during distance calculation.

That is, for all metric we eventually want:

assert not np.any(pairwise_distances(X, metric=metric)[np.diag_indices(X.shape[0])])

and a similar assertion for pairwise_distances_chunked

I say eventually because I propose that:

  • if current output satisfies np.allclose(pairwise_distances(X, metric=metric)[np.diag_indices(X.shape[0])], 0), we just set it to 0
  • if current output does not satisfy that condition, we only raise a FutureWarning saying that "The specified metric is not a valid dissimilarity metric; it does not return metric(x, x) == 0 for some values. In version 0.23, the diagonal of pairwise_distances(X) will be set to 0." We do this because at the moment we do not strictly require that the metric be a valid dissimilarity measure.
  • we do the same for pairwise_distances_chunked
@eamanu
Copy link
Contributor

eamanu commented Nov 21, 2018

And should be delete the distances.flat[::distances.shape[0] + 1] = 0.0 for euclidean and cosine distances right?

I want to work on this

@jeremiedbb
Copy link
Member

I don't think so because you can use euclidean_distances instead of pairwise_distances and you still want 0s on the diagonal in that case.

Btw, np.fill_diagonal(array, 0) does the job and is more readable :)

@eamanu
Copy link
Contributor

eamanu commented Nov 21, 2018

So, the change must be added on the pairwise_distances() and whole metrics distances.

@jeremiedbb
Copy link
Member

Yes. However we rely on scipy's one for many of them, which are already consistent, so there shouldn't be too many to update

@jeremiedbb
Copy link
Member

We should zero the diagonal of the output for all metrics through pairwise_distances

@jnothman Actually there are tests in sklearn to test that metrics with metric(x,x) != 0 are allowed (which are not metrics then btw). For example test_pairwise_callable_nonstrict_metric().

I wonder why these kind of "metrics" are allowed. But if we want to keep that, we have to either add a parameter allow_non_metric to pairwise_distances, or make a new function pairwise_whatever_not_a_metric and force diag 0s in pairwise_distances.

@jnothman
Copy link
Member Author

jnothman commented Nov 21, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants