[WIP] Detect precision issues in euclidean distance calculations #12142

rth · 2018-09-24T09:54:17Z

Following the discussion in #9354
this is an attempt to warn the user when numerical precision issues may occur in the euclidean distances computed with quadratic expansion. This affects both 32 bit and 64bit.

The overall idea is that we need to detect when the distance between two vectors is much less then the vector norms (by a factor of 1e7 in 64bit).

Here we take a more simplified approach and instead only consider the global distribution of compared vectors. If all considered vectors are within a very thin shell such a the shell thinkness divided by the mean vector norm is below a given threshold, a warning is raised (cf image below).

If all vectors are within the cluster A, a warning is raised as it should. For instance, an example of this would [[100000001, 100000000]] and [[100000000, 100000001]], as reported by @kno10 in Numerical precision of euclidean_distances with float32 #9354 (comment)
If points belong to both cluster A and B (e.g. by adding a minus sign to one of the above vectors), a warning will be raised while it's shouldn't be (false postive)
if points belong to very tight clusters A and C, no warning will be raised even though it should be.

Here the cost of checking for precision issues is only O(n_samples_a + n_samples_b) which is negligible to the cost of computing pairwise distances O(n_samples_a*n_samples_b*n_features).

This it's more a proof of concept, more work would be needed to make this useful.

TODO

check how this solution is addressed in other libraries. It seem to be a common issue, e.g. from pytorch discussion forum,

This seems to be a well documented issue with this approach and existing libraries will use thresholding to default to the direct computation when needed.
contact the scipy community about it
generalize the above approach. I think, it might be possible to get an exhaustive list of problematic points fairly efficiently (possibly either with one pass through sorted vectors of ‖A‖₂, ‖B‖₂, or using e.g. BallTree.query_radius), then compute them accurately. The main idea is that comparing only vector norms, may give an idea when precision is problematic, with low computational effort, even though there are false positives.

Surprisingly bad precision, isn't it? Note that the traditional computation sqrt(sum((x-y)**2)) gets the results exact.

jnothman · 2018-09-26T02:47:45Z

sklearn/metrics/pairwise.py

+                warning_message += (
+                    "Consider standardizing features by removing the mean, "
+                    "or setting globally "
+                    "euclidean_distances_algorithm='exact' for slower but "


should we not just back off to using the exact algorithm in this case?

Yes, that would probably be the best outcome. But I would like to explore more fine-grained detection of problematic points, instead of considering just the min/max of the distribution for this to actually be useful.

Standardizing, centering etc. will not help in all cases. Consider the 1d data set -10001,-10000,10000,10001 with FP32; essentially the symmetric version of the example I provided before. The mean is zero; and scaling it will not resolve the relative closeness of the pairs of values.

I agree, sample statistics is not the solution here. This was merely an attempt at prototyping. Theoretically though since we already have the norms, we could use comparison between the norms as a proxy to determine potentially problematic points at the cost of O(n_samples_a*n_samples_b) as compared to the overall cost O(n_samples_a*n_samples_b*n_features) for the full calculation. Or less if we use some tree structure. Then recompute those points exactly, though it would have its own limitations...

jnothman · 2018-09-27T00:00:34Z

Yes, I've realised that coarse-grained detection may lead to subset invariance issues (i.e. taking the pairwise distances over a subset of the data may result in a different method and hence different results).

kno10 · 2018-10-06T14:15:19Z

Some additional good values for testing implementations include (note that these aim at a different thing - at causing the distances to become negative due to rounding - they are derived from test values for verifying variance implementations, and the values are 1 ULP apart, while the numeric accuracy of the dot-based equation used here is supposedly sqrt(1 ULP)):

fp32 = numpy.array([[2.8],[2.800001]], numpy.float32)
fp64 = numpy.array([[1.2],[1.2-2e-16]], numpy.float64)

These will cause havoc to a naive implementation:

print numpy.sqrt((fp32**2).sum(axis=1)-2*numpy.dot(fp32,fp32.transpose())+(fp32**2).sum(axis=1).reshape((-1,1)))
print numpy.sqrt((fp64**2).sum(axis=1)-2*numpy.dot(fp64,fp64.transpose())+(fp64**2).sum(axis=1).reshape((-1,1)))

Notice the NaNs:

cc.py:54: RuntimeWarning: invalid value encountered in sqrt
  print numpy.sqrt((fp32**2).sum(axis=1)-2*numpy.dot(fp32,fp32.transpose())+(fp32**2).sum(axis=1).reshape((-1,1)))
[[ 0. nan]
 [nan  0.]]
cc.py:55: RuntimeWarning: invalid value encountered in sqrt
  print numpy.sqrt((fp64**2).sum(axis=1)-2*numpy.dot(fp64,fp64.transpose())+(fp64**2).sum(axis=1).reshape((-1,1)))
[[ 0. nan]
 [nan  0.]]

The current released code should pass this test because of the np.maximum(0, hack; but when working on improved versions, make sure to not break this.

jnothman · 2019-02-28T13:58:47Z

Happy to close?

rth · 2019-02-28T18:10:42Z

Agreed, closing following discussion in #9354 (comment)

kno10 and others added 2 commits September 23, 2018 11:13

Add a test for numeric precision scikit-learn#9354

8f894a8

Surprisingly bad precision, isn't it? Note that the traditional computation sqrt(sum((x-y)**2)) gets the results exact.

Add heuristics for detecting precision issues with euclidean distance

7dcf3b2

This was referenced Sep 24, 2018

Add a test for numeric precision (see #9354) #12128

Closed

Numerical precision of euclidean_distances with float32 #9354

Closed

Fix CI

28bacf9

jnothman reviewed Sep 26, 2018

View reviewed changes

rth mentioned this pull request Sep 26, 2018

[MRG] Allow exact Euclidean distance calculations #12136

Closed

ogrisel mentioned this pull request Nov 13, 2018

Relax strict floating point equality assertion #12574

Closed

rth closed this Feb 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Detect precision issues in euclidean distance calculations #12142

[WIP] Detect precision issues in euclidean distance calculations #12142

rth commented Sep 24, 2018

jnothman Sep 26, 2018

rth Sep 26, 2018

kno10 Oct 6, 2018

rth Oct 6, 2018 •

edited

jnothman commented Sep 27, 2018 via email

kno10 commented Oct 6, 2018

jnothman commented Feb 28, 2019

rth commented Feb 28, 2019

[WIP] Detect precision issues in euclidean distance calculations #12142

[WIP] Detect precision issues in euclidean distance calculations #12142

Conversation

rth commented Sep 24, 2018

jnothman Sep 26, 2018

Choose a reason for hiding this comment

rth Sep 26, 2018

Choose a reason for hiding this comment

kno10 Oct 6, 2018

Choose a reason for hiding this comment

rth Oct 6, 2018 • edited

Choose a reason for hiding this comment

jnothman commented Sep 27, 2018 via email

kno10 commented Oct 6, 2018

jnothman commented Feb 28, 2019

rth commented Feb 28, 2019

rth Oct 6, 2018 •

edited