New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical precision of euclidean_distances with float32 #9354

Open
mikeroberts3000 opened this Issue Jul 13, 2017 · 50 comments

Comments

@mikeroberts3000

mikeroberts3000 commented Jul 13, 2017

Description

I noticed that sklearn.metrics.pairwise.pairwise_distances function agrees with np.linalg.norm when using np.float64 arrays, but disagrees when using np.float32 arrays. See the code snippet below.

Steps/Code to Reproduce

import numpy as np
import scipy
import sklearn.metrics.pairwise

# create 64-bit vectors a and b that are very similar to each other
a_64 = np.array([61.221637725830078125, 71.60662841796875,    -65.7512664794921875],  dtype=np.float64)
b_64 = np.array([61.221637725830078125, 71.60894012451171875, -65.72847747802734375], dtype=np.float64)

# create 32-bit versions of a and b
a_32 = a_64.astype(np.float32)
b_32 = b_64.astype(np.float32)

# compute the distance from a to b using numpy, for both 64-bit and 32-bit
dist_64_np = np.array([np.linalg.norm(a_64 - b_64)], dtype=np.float64)
dist_32_np = np.array([np.linalg.norm(a_32 - b_32)], dtype=np.float32)

# compute the distance from a to b using sklearn, for both 64-bit and 32-bit
dist_64_sklearn = sklearn.metrics.pairwise.pairwise_distances([a_64], [b_64])
dist_32_sklearn = sklearn.metrics.pairwise.pairwise_distances([a_32], [b_32])

# note that the 64-bit sklearn results agree exactly with numpy, but the 32-bit results disagree
np.set_printoptions(precision=200)

print(dist_64_np)
print(dist_32_np)
print(dist_64_sklearn)
print(dist_32_sklearn)

Expected Results

I expect that the results from sklearn.metrics.pairwise.pairwise_distances would agree with np.linalg.norm for both 64-bit and 32-bit. In other words, I expect the following output:

[ 0.0229059506440019884643266578905240749008953571319580078125]
[ 0.02290595136582851409912109375]
[[ 0.0229059506440019884643266578905240749008953571319580078125]]
[[ 0.02290595136582851409912109375]]

Actual Results

The code snippet above produces the following output for me:

[ 0.0229059506440019884643266578905240749008953571319580078125]
[ 0.02290595136582851409912109375]
[[ 0.0229059506440019884643266578905240749008953571319580078125]]
[[ 0.03125]]

Versions

Darwin-16.6.0-x86_64-i386-64bit
('Python', '2.7.11 | 64-bit | (default, Jun 11 2016, 03:41:56) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]')
('NumPy', '1.11.3')
('SciPy', '0.19.0')
('Scikit-Learn', '0.18.1')
@nvauquie

This comment has been minimized.

Show comment
Hide comment
@nvauquie

nvauquie Jul 18, 2017

Same results with python 3.5 :

Darwin-15.6.0-x86_64-i386-64bit
Python 3.5.1 (v3.5.1:37a07cee5969, Dec  5 2015, 21:12:44) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
NumPy 1.11.0
SciPy 0.18.1
Scikit-Learn 0.17.1

It happens only with euclidean distance and can be reproduced using directly sklearn.metrics.pairwise.euclidean_distances :

import scipy
import sklearn.metrics.pairwise

# create 64-bit vectors a and b that are very similar to each other
a_64 = np.array([61.221637725830078125, 71.60662841796875,    -65.7512664794921875],  dtype=np.float64)
b_64 = np.array([61.221637725830078125, 71.60894012451171875, -65.72847747802734375], dtype=np.float64)

# create 32-bit versions of a and b
a_32 = a_64.astype(np.float32)
b_32 = b_64.astype(np.float32)

# compute the distance from a to b using sklearn, for both 64-bit and 32-bit
dist_64_sklearn = sklearn.metrics.pairwise.euclidean_distances([a_64], [b_64])
dist_32_sklearn = sklearn.metrics.pairwise.euclidean_distances([a_32], [b_32])

np.set_printoptions(precision=200)

print(dist_64_sklearn)
print(dist_32_sklearn)

I couldn't track down further the error.
I hope this can help.

nvauquie commented Jul 18, 2017

Same results with python 3.5 :

Darwin-15.6.0-x86_64-i386-64bit
Python 3.5.1 (v3.5.1:37a07cee5969, Dec  5 2015, 21:12:44) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
NumPy 1.11.0
SciPy 0.18.1
Scikit-Learn 0.17.1

It happens only with euclidean distance and can be reproduced using directly sklearn.metrics.pairwise.euclidean_distances :

import scipy
import sklearn.metrics.pairwise

# create 64-bit vectors a and b that are very similar to each other
a_64 = np.array([61.221637725830078125, 71.60662841796875,    -65.7512664794921875],  dtype=np.float64)
b_64 = np.array([61.221637725830078125, 71.60894012451171875, -65.72847747802734375], dtype=np.float64)

# create 32-bit versions of a and b
a_32 = a_64.astype(np.float32)
b_32 = b_64.astype(np.float32)

# compute the distance from a to b using sklearn, for both 64-bit and 32-bit
dist_64_sklearn = sklearn.metrics.pairwise.euclidean_distances([a_64], [b_64])
dist_32_sklearn = sklearn.metrics.pairwise.euclidean_distances([a_32], [b_32])

np.set_printoptions(precision=200)

print(dist_64_sklearn)
print(dist_32_sklearn)

I couldn't track down further the error.
I hope this can help.

@amueller amueller added this to the 0.20 milestone Jul 18, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 18, 2017

Member
Member

jnothman commented Jul 18, 2017

@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Sep 21, 2017

Contributor

I'd like to work on this if possible

Contributor

ragnerok commented Sep 21, 2017

I'd like to work on this if possible

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 21, 2017

Member

Go for it!

Member

lesteve commented Sep 21, 2017

Go for it!

@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Sep 24, 2017

Contributor

So I think the problem lies around the fact that we are using sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y)) for computing euclidean distance
Because if I try - (-2 * np.dot(X, Y.T) + (X * X).sum(axis=1) + (Y * Y).sum(axis=1) I get the answer 0 for np.float32, while I get the correct ans for np.float 64.

Contributor

ragnerok commented Sep 24, 2017

So I think the problem lies around the fact that we are using sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y)) for computing euclidean distance
Because if I try - (-2 * np.dot(X, Y.T) + (X * X).sum(axis=1) + (Y * Y).sum(axis=1) I get the answer 0 for np.float32, while I get the correct ans for np.float 64.

@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Sep 28, 2017

Contributor

@jnothman What do you think I should do then ? As mentioned in my comment above the problem is probably computing euclidean distance using sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))

Contributor

ragnerok commented Sep 28, 2017

@jnothman What do you think I should do then ? As mentioned in my comment above the problem is probably computing euclidean distance using sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 3, 2017

Member

So you're saying that dot is returning a less precise result than product-then-sum?

Member

jnothman commented Oct 3, 2017

So you're saying that dot is returning a less precise result than product-then-sum?

@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Oct 3, 2017

Contributor

No, what I'm trying to say is dot is returning more precise result than product-then-sum
-2 * np.dot(X, Y.T) + (X * X).sum(axis=1) + (Y * Y).sum(axis=1) gives output [[0.]]
while np.sqrt(((X-Y) * (X-Y)).sum(axis=1)) gives output [ 0.02290595]

Contributor

ragnerok commented Oct 3, 2017

No, what I'm trying to say is dot is returning more precise result than product-then-sum
-2 * np.dot(X, Y.T) + (X * X).sum(axis=1) + (Y * Y).sum(axis=1) gives output [[0.]]
while np.sqrt(((X-Y) * (X-Y)).sum(axis=1)) gives output [ 0.02290595]

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 3, 2017

Member

It is not clear what you are doing, partly because you are not posting a fully stand-alone snippet.

Quickly looking at your last post the two things you are trying to compare [[0.]] and [0.022...] do not have the same dimensions (maybe a copy and paste problem but again hard to know because we don't have a full snippet).

Member

lesteve commented Oct 3, 2017

It is not clear what you are doing, partly because you are not posting a fully stand-alone snippet.

Quickly looking at your last post the two things you are trying to compare [[0.]] and [0.022...] do not have the same dimensions (maybe a copy and paste problem but again hard to know because we don't have a full snippet).

@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Oct 3, 2017

Contributor

Ok sorry my bad

import numpy as np
import scipy
from sklearn.metrics.pairwise import check_pairwise_arrays, row_norms
from sklearn.utils.extmath import safe_sparse_dot

# create 64-bit vectors a and b that are very similar to each other
a_64 = np.array([61.221637725830078125, 71.60662841796875,    -65.7512664794921875],  dtype=np.float64)
b_64 = np.array([61.221637725830078125, 71.60894012451171875, -65.72847747802734375], dtype=np.float64)

# create 32-bit versions of a and b
X = a_64.astype(np.float32)
Y = b_64.astype(np.float32)

X, Y = check_pairwise_arrays(X, Y)
XX = row_norms(X, squared=True)[:, np.newaxis]
YY = row_norms(Y, squared=True)[np.newaxis, :]

#Euclidean distance computed using product-then-sum
distances = safe_sparse_dot(X, Y.T, dense_output=True)
distances *= -2
distances += XX
distances += YY
print(np.sqrt(distances))

#Euclidean distance computed using (X-Y)^2
print(np.sqrt(row_norms(X-Y, squared=True)[:, np.newaxis]))

OUTPUT

[[ 0.03125]]
[[ 0.02290595136582851409912109375]]

The first method is how it is computed by the euclidean distance function.
Also to clarify what I meant above was the fact that sum-then-product has lower precision even when we use numpy functions to do it

Contributor

ragnerok commented Oct 3, 2017

Ok sorry my bad

import numpy as np
import scipy
from sklearn.metrics.pairwise import check_pairwise_arrays, row_norms
from sklearn.utils.extmath import safe_sparse_dot

# create 64-bit vectors a and b that are very similar to each other
a_64 = np.array([61.221637725830078125, 71.60662841796875,    -65.7512664794921875],  dtype=np.float64)
b_64 = np.array([61.221637725830078125, 71.60894012451171875, -65.72847747802734375], dtype=np.float64)

# create 32-bit versions of a and b
X = a_64.astype(np.float32)
Y = b_64.astype(np.float32)

X, Y = check_pairwise_arrays(X, Y)
XX = row_norms(X, squared=True)[:, np.newaxis]
YY = row_norms(Y, squared=True)[np.newaxis, :]

#Euclidean distance computed using product-then-sum
distances = safe_sparse_dot(X, Y.T, dense_output=True)
distances *= -2
distances += XX
distances += YY
print(np.sqrt(distances))

#Euclidean distance computed using (X-Y)^2
print(np.sqrt(row_norms(X-Y, squared=True)[:, np.newaxis]))

OUTPUT

[[ 0.03125]]
[[ 0.02290595136582851409912109375]]

The first method is how it is computed by the euclidean distance function.
Also to clarify what I meant above was the fact that sum-then-product has lower precision even when we use numpy functions to do it

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 3, 2017

Member
Member

jnothman commented Oct 3, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 3, 2017

Member
Member

jnothman commented Oct 3, 2017

@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Oct 5, 2017

Contributor

So for this example product-then-sum works perfectly fine for np.float64, so a possible solution could be to convert the input to float64 then compute the result and return the result converted back to float32. I guess this would be more efficient, but not sure if this would work fine for some other example.

Contributor

ragnerok commented Oct 5, 2017

So for this example product-then-sum works perfectly fine for np.float64, so a possible solution could be to convert the input to float64 then compute the result and return the result converted back to float32. I guess this would be more efficient, but not sure if this would work fine for some other example.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 7, 2017

Member
Member

jnothman commented Oct 7, 2017

@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Oct 9, 2017

Contributor

Oh yeah you are right sorry about that, but I think using float64 and then doing product-then-sum would be more efficient computationally if not memory wise.

Contributor

ragnerok commented Oct 9, 2017

Oh yeah you are right sorry about that, but I think using float64 and then doing product-then-sum would be more efficient computationally if not memory wise.

@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Oct 9, 2017

Contributor

And the reason for using product-then-sum was to have more computational efficiency and not memory efficiency.

Contributor

ragnerok commented Oct 9, 2017

And the reason for using product-then-sum was to have more computational efficiency and not memory efficiency.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 9, 2017

Member
Member

jnothman commented Oct 9, 2017

@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Oct 19, 2017

Contributor

Ok so I created a python script to compare the time taken by subtraction-then-squaring and conversion to float64 then product-then-sum and it turns out if we choose an X and Y as very big vectors then the 2 results are very different. Also @jnothman you were right subtraction-then-squaring is faster.
Here's the script that I wrote, if there's any problem please let me know

import numpy as np
import scipy
from sklearn.metrics.pairwise import check_pairwise_arrays, row_norms
from sklearn.utils.extmath import safe_sparse_dot
from timeit import default_timer as timer

for i in range(9):
	X = np.random.rand(1,3 * (10**i)).astype(np.float32)
	Y = np.random.rand(1,3 * (10**i)).astype(np.float32)

	X, Y = check_pairwise_arrays(X, Y)
	XX = row_norms(X, squared=True)[:, np.newaxis]
	YY = row_norms(Y, squared=True)[np.newaxis, :]

	#Euclidean distance computed using product-then-sum
	distances = safe_sparse_dot(X, Y.T, dense_output=True)
	distances *= -2
	distances += XX
	distances += YY

	ans1 = np.sqrt(distances)

	start = timer()
	ans2 = np.sqrt(row_norms(X-Y, squared=True)[:, np.newaxis])
	end = timer()
	if ans1 != ans2:
		print(end-start)

		start = timer()
		X = X.astype(np.float64)
		Y = Y.astype(np.float64)
		X, Y = check_pairwise_arrays(X, Y)
		XX = row_norms(X, squared=True)[:, np.newaxis]
		YY = row_norms(Y, squared=True)[np.newaxis, :]
		distances = safe_sparse_dot(X, Y.T, dense_output=True)
		distances *= -2
		distances += XX
		distances += YY
		distances = np.sqrt(distances)
		end = timer()
		print(end-start)
		print('')
		if abs(ans2 - distances) > 1e-3:
			# np.set_printoptions(precision=200)
			print(ans2)
			print(np.sqrt(distances))

			print(X, Y)
			break
Contributor

ragnerok commented Oct 19, 2017

Ok so I created a python script to compare the time taken by subtraction-then-squaring and conversion to float64 then product-then-sum and it turns out if we choose an X and Y as very big vectors then the 2 results are very different. Also @jnothman you were right subtraction-then-squaring is faster.
Here's the script that I wrote, if there's any problem please let me know

import numpy as np
import scipy
from sklearn.metrics.pairwise import check_pairwise_arrays, row_norms
from sklearn.utils.extmath import safe_sparse_dot
from timeit import default_timer as timer

for i in range(9):
	X = np.random.rand(1,3 * (10**i)).astype(np.float32)
	Y = np.random.rand(1,3 * (10**i)).astype(np.float32)

	X, Y = check_pairwise_arrays(X, Y)
	XX = row_norms(X, squared=True)[:, np.newaxis]
	YY = row_norms(Y, squared=True)[np.newaxis, :]

	#Euclidean distance computed using product-then-sum
	distances = safe_sparse_dot(X, Y.T, dense_output=True)
	distances *= -2
	distances += XX
	distances += YY

	ans1 = np.sqrt(distances)

	start = timer()
	ans2 = np.sqrt(row_norms(X-Y, squared=True)[:, np.newaxis])
	end = timer()
	if ans1 != ans2:
		print(end-start)

		start = timer()
		X = X.astype(np.float64)
		Y = Y.astype(np.float64)
		X, Y = check_pairwise_arrays(X, Y)
		XX = row_norms(X, squared=True)[:, np.newaxis]
		YY = row_norms(Y, squared=True)[np.newaxis, :]
		distances = safe_sparse_dot(X, Y.T, dense_output=True)
		distances *= -2
		distances += XX
		distances += YY
		distances = np.sqrt(distances)
		end = timer()
		print(end-start)
		print('')
		if abs(ans2 - distances) > 1e-3:
			# np.set_printoptions(precision=200)
			print(ans2)
			print(np.sqrt(distances))

			print(X, Y)
			break
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 19, 2017

Member
Member

jnothman commented Oct 19, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 21, 2017

Member

anyway, would you like to submit a PR, @ragnerok?

Member

jnothman commented Oct 21, 2017

anyway, would you like to submit a PR, @ragnerok?

@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Oct 21, 2017

Contributor

yeah sure, what do you want me to do ?

Contributor

ragnerok commented Oct 21, 2017

yeah sure, what do you want me to do ?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 22, 2017

Member
Member

jnothman commented Oct 22, 2017

@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Nov 3, 2017

Contributor

I wanted to ask if it is possible to find distance between each pair of rows with vectorisation. I cannot think about how to do it vectorised.

Contributor

ragnerok commented Nov 3, 2017

I wanted to ask if it is possible to find distance between each pair of rows with vectorisation. I cannot think about how to do it vectorised.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 4, 2017

Member

You mean difference (not distance) between pairs of rows? Sure you can do that if you're working with numpy arrays. If you have arrays with shapes (n_samples1, n_features) and (n_samples2, n_features), you just need to reshape it to (n_samples1, 1, n_features) and (1, n_samples2, n_features) and do the subtraction:

>>> X = np.random.randint(10, size=(10, 5))
>>> Y = np.random.randint(10, size=(11, 5))
X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1])
Member

jnothman commented Nov 4, 2017

You mean difference (not distance) between pairs of rows? Sure you can do that if you're working with numpy arrays. If you have arrays with shapes (n_samples1, n_features) and (n_samples2, n_features), you just need to reshape it to (n_samples1, 1, n_features) and (1, n_samples2, n_features) and do the subtraction:

>>> X = np.random.randint(10, size=(10, 5))
>>> Y = np.random.randint(10, size=(11, 5))
X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1])
@ragnerok

This comment has been minimized.

Show comment
Hide comment
@ragnerok

ragnerok Nov 4, 2017

Contributor

Yeah thanks that really helped 😄

Contributor

ragnerok commented Nov 4, 2017

Yeah thanks that really helped 😄

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 14, 2018

Member

Remove the blocker?

Member

amueller commented Sep 14, 2018

Remove the blocker?

@kno10

This comment has been minimized.

Show comment
Hide comment
@kno10

kno10 Sep 21, 2018

Contributor

sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))

is numerically unstable, if dot(x,x) and dot(y,y) are of similar magnitude as dot(x,y) because of what is known as catastrophic cancellation.

This not only affect FP32 precision, but it is of course more prominent, and will fail much earlier.

Here is a simple test case that shows how bad this is even with double precision:

import numpy
from sklearn.metrics.pairwise import euclidean_distances

a = numpy.array([[100000001, 100000000]])
b = numpy.array([[100000000, 100000001]])

print "skelarn:", euclidean_distances(a, b)[0,0]
print "correct:", numpy.sqrt(numpy.sum((a-b)**2))

a = numpy.array([[10001, 10000]], numpy.float32)
b = numpy.array([[10000, 10001]], numpy.float32)

print "skelarn:", euclidean_distances(a, b)[0,0]
print "correct:", numpy.sqrt(numpy.sum((a-b)**2))

sklearn computes a distance of 0 here both times, rather than sqrt(2).

A discussion of the numerical issues for variance and covariance - and this trivially carries over to this approach of accelerating euclidean distance - can be found here:

Erich Schubert, and Michael Gertz.
Numerically Stable Parallel Computation of (Co-)Variance.
In: Proceedings of the 30th International Conference on Scientific and Statistical Database Management (SSDBM), Bolzano-Bozen, Italy. 2018, 10:1–10:12

Contributor

kno10 commented Sep 21, 2018

sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))

is numerically unstable, if dot(x,x) and dot(y,y) are of similar magnitude as dot(x,y) because of what is known as catastrophic cancellation.

This not only affect FP32 precision, but it is of course more prominent, and will fail much earlier.

Here is a simple test case that shows how bad this is even with double precision:

import numpy
from sklearn.metrics.pairwise import euclidean_distances

a = numpy.array([[100000001, 100000000]])
b = numpy.array([[100000000, 100000001]])

print "skelarn:", euclidean_distances(a, b)[0,0]
print "correct:", numpy.sqrt(numpy.sum((a-b)**2))

a = numpy.array([[10001, 10000]], numpy.float32)
b = numpy.array([[10000, 10001]], numpy.float32)

print "skelarn:", euclidean_distances(a, b)[0,0]
print "correct:", numpy.sqrt(numpy.sum((a-b)**2))

sklearn computes a distance of 0 here both times, rather than sqrt(2).

A discussion of the numerical issues for variance and covariance - and this trivially carries over to this approach of accelerating euclidean distance - can be found here:

Erich Schubert, and Michael Gertz.
Numerically Stable Parallel Computation of (Co-)Variance.
In: Proceedings of the 30th International Conference on Scientific and Statistical Database Management (SSDBM), Bolzano-Bozen, Italy. 2018, 10:1–10:12

kno10 added a commit to kno10/scikit-learn that referenced this issue Sep 21, 2018

Add a test for numeric precision #9354
Surprisingly bad precision, isn't it?

Note that the traditional computation sqrt(sum((x-y)**2)) gets the results exact.
@kno10

This comment has been minimized.

Show comment
Hide comment
@kno10

kno10 Sep 21, 2018

Contributor

Actually the y coordinate can be removed from that test case, the correct distance then trivially becomes 1. I made a pull request that triggers this numeric problem:

    XA = np.array([[10000]], np.float32)
    XB = np.array([[10001]], np.float32)
    assert_equal(euclidean_distances(XA, XB)[0,0], 1)

I don't think my paper mentioned above provides a solution for this problem - just compute Euclidean distance as sqrt(sum(power())) and it is single-pass and has reasonable precision. The loss is in using the squares already, i.e., dot(x,x) itself already losing the precision.

@amueller as the problem may be more sever than expected, I suggest re-adding the blocker label...

Contributor

kno10 commented Sep 21, 2018

Actually the y coordinate can be removed from that test case, the correct distance then trivially becomes 1. I made a pull request that triggers this numeric problem:

    XA = np.array([[10000]], np.float32)
    XB = np.array([[10001]], np.float32)
    assert_equal(euclidean_distances(XA, XB)[0,0], 1)

I don't think my paper mentioned above provides a solution for this problem - just compute Euclidean distance as sqrt(sum(power())) and it is single-pass and has reasonable precision. The loss is in using the squares already, i.e., dot(x,x) itself already losing the precision.

@amueller as the problem may be more sever than expected, I suggest re-adding the blocker label...

@jeremiedbb

This comment has been minimized.

Show comment
Hide comment
@jeremiedbb

jeremiedbb Sep 21, 2018

Contributor

Thanks for this very simple example.

The reason it is implemented this way is because it's way faster. See below:

x = np.random.random_sample((1000, 1000))

%timeit euclidean_distances(x,x)
20.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit cdist(x,x)
695 ms ± 4.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Although the number of operations is of the same order in both methods (1.5x more in the second one), the speedup comes from the possibility to use well optimized BLAS libraries for matrix matrix multiplication.

This would be a huge slowdown for several estimators in scikit-learn.

Contributor

jeremiedbb commented Sep 21, 2018

Thanks for this very simple example.

The reason it is implemented this way is because it's way faster. See below:

x = np.random.random_sample((1000, 1000))

%timeit euclidean_distances(x,x)
20.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit cdist(x,x)
695 ms ± 4.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Although the number of operations is of the same order in both methods (1.5x more in the second one), the speedup comes from the possibility to use well optimized BLAS libraries for matrix matrix multiplication.

This would be a huge slowdown for several estimators in scikit-learn.

@kno10

This comment has been minimized.

Show comment
Hide comment
@kno10

kno10 Sep 21, 2018

Contributor

Yes, but just 3-4 digits of precision with FP32, and 7-8 digits with FP64 does cause substantial imprecision, doesn't it? In particular, since such errors tend to amplify...

Contributor

kno10 commented Sep 21, 2018

Yes, but just 3-4 digits of precision with FP32, and 7-8 digits with FP64 does cause substantial imprecision, doesn't it? In particular, since such errors tend to amplify...

@jeremiedbb

This comment has been minimized.

Show comment
Hide comment
@jeremiedbb

jeremiedbb Sep 21, 2018

Contributor

Well I'm not saying that it's fine right now. :)
I'm saying that we need to find a solution in between.
There is a PR (#11271) which proposes to cast on float64 to do the computations. In does not fix the problem for float64 but gives better precision for float32.

Do you have an example where using an estimator which uses euclidean_distances gives wrong results due to the loss of precision ?

Contributor

jeremiedbb commented Sep 21, 2018

Well I'm not saying that it's fine right now. :)
I'm saying that we need to find a solution in between.
There is a PR (#11271) which proposes to cast on float64 to do the computations. In does not fix the problem for float64 but gives better precision for float32.

Do you have an example where using an estimator which uses euclidean_distances gives wrong results due to the loss of precision ?

@jnothman jnothman added the Blocker label Sep 22, 2018

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 22, 2018

Member

I certainly still think this is a big deal and should be a blocker for 0.21. It was an issue introduced for 32 bit in 0.19, and it's not a nice state of affairs to leave. I wish we had resolved it earlier in 0.20, and I would be okay, or even keen, to see #11271 merged in the interim. The only issues in that PR that I know of surround optimisation of memory efficiency, which is a deep rabbit hole.

We've had this "fast" version for a long time, but always in float64. I know, @kno10, that it's got issues with precision. Do you have a good and fast heuristic for us to work out when that might be a problem and use a slower-but-surer solution?

Member

jnothman commented Sep 22, 2018

I certainly still think this is a big deal and should be a blocker for 0.21. It was an issue introduced for 32 bit in 0.19, and it's not a nice state of affairs to leave. I wish we had resolved it earlier in 0.20, and I would be okay, or even keen, to see #11271 merged in the interim. The only issues in that PR that I know of surround optimisation of memory efficiency, which is a deep rabbit hole.

We've had this "fast" version for a long time, but always in float64. I know, @kno10, that it's got issues with precision. Do you have a good and fast heuristic for us to work out when that might be a problem and use a slower-but-surer solution?

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Sep 22, 2018

Member

Yes, but just 3-4 digits of precision with FP32, and 7-8 digits with FP64 does cause substantial imprecision, doesn't it

Thanks for illustrating this issue with very simple example!

I don't think the issue is as widespread as you suggest, however -- it mostly affects samples whose mutual distance small with respect to their norms.

The below figure illustrates this, for 2e6 random sample pairs, where each 1D samples is in the interval [-100, 100]. The relative error between the scikit-learn and scipy implementation is plotted as a function of the distance between samples, normalized by their L2 norms, i.e.,

d_norm(A, B) = d(A, B) / sqrt(‖A‖₂*‖B‖₂)

(not sure it's the right parametrization, but just to get results somewhat invariant to the data scale),
euclidean_distance_precision_1d

For instance,

  1. if one takes [10000] and [10001] the L2 normalized distance is 1e-4 and the relative error on the distance calculation will be 1e-8 in 64 bit, and >1 in 32 bit (Or 1e-8 and >1 in absolute value respectively). In 32 bit this case is indeed quite terrible.
  2. on the other hand for [1] and [10001], the relative error will be ~1e-7 in 32 bit, or the maximum possible precision.

The question is how often the case 1. will happen in practice in ML applications.

Interestingly, if we go to 2D, again with a uniform random distribution, it will be difficult to find points that are very close,
euclidean_distance_precision_2d

Of course, in reality our data will not be uniformly sampled, but for any distribution because of the curse of dimensionality the distance between any two points will slowly converge to very similar values (different from 0) as the dimentionality increases. While it's a general ML issue, here it may mitigate somewhat this accuracy problem, even for relatively low dimensionality. Below the results for n_features=5,
euclidean_distance_precision_5d.

So for centered data, at least in 64 bit, it may not be so much of an issue in practice (assuming there are more then 2 features). The 50x computational speed-up (as illustrated above) may be worth it (in 64 bit). Of course one can always add 1e6 to some data normalized in [-1, 1] and say that the results are not accurate, but I would argue that the same applies to a number of numerical algorithms, and working with data expressed in the 6th significant digit is just looking for trouble.

(The code for the above figures can be found here).

Member

rth commented Sep 22, 2018

Yes, but just 3-4 digits of precision with FP32, and 7-8 digits with FP64 does cause substantial imprecision, doesn't it

Thanks for illustrating this issue with very simple example!

I don't think the issue is as widespread as you suggest, however -- it mostly affects samples whose mutual distance small with respect to their norms.

The below figure illustrates this, for 2e6 random sample pairs, where each 1D samples is in the interval [-100, 100]. The relative error between the scikit-learn and scipy implementation is plotted as a function of the distance between samples, normalized by their L2 norms, i.e.,

d_norm(A, B) = d(A, B) / sqrt(‖A‖₂*‖B‖₂)

(not sure it's the right parametrization, but just to get results somewhat invariant to the data scale),
euclidean_distance_precision_1d

For instance,

  1. if one takes [10000] and [10001] the L2 normalized distance is 1e-4 and the relative error on the distance calculation will be 1e-8 in 64 bit, and >1 in 32 bit (Or 1e-8 and >1 in absolute value respectively). In 32 bit this case is indeed quite terrible.
  2. on the other hand for [1] and [10001], the relative error will be ~1e-7 in 32 bit, or the maximum possible precision.

The question is how often the case 1. will happen in practice in ML applications.

Interestingly, if we go to 2D, again with a uniform random distribution, it will be difficult to find points that are very close,
euclidean_distance_precision_2d

Of course, in reality our data will not be uniformly sampled, but for any distribution because of the curse of dimensionality the distance between any two points will slowly converge to very similar values (different from 0) as the dimentionality increases. While it's a general ML issue, here it may mitigate somewhat this accuracy problem, even for relatively low dimensionality. Below the results for n_features=5,
euclidean_distance_precision_5d.

So for centered data, at least in 64 bit, it may not be so much of an issue in practice (assuming there are more then 2 features). The 50x computational speed-up (as illustrated above) may be worth it (in 64 bit). Of course one can always add 1e6 to some data normalized in [-1, 1] and say that the results are not accurate, but I would argue that the same applies to a number of numerical algorithms, and working with data expressed in the 6th significant digit is just looking for trouble.

(The code for the above figures can be found here).

@kno10

This comment has been minimized.

Show comment
Hide comment
@kno10

kno10 Sep 22, 2018

Contributor

Any fast approach using the dot(x,x)+dot(y,y)-2*dot(x,y) version will likely have the same issue for all I can tell, but you'd better ask some real expert on numerics for this. I believe you'll need to double the precision of the dot products to get to approx. the precision of the input data (and I'd assume that if a user provides float32 data, then they'll want float32 precision, with float64, they'll want float64 precision). You may be able to do this with some tricks (think of Kahan summation), but it will very likely cost you much more than you gained in the first place.

I can't tell how much overhead you get from converting float32 to float64 on the fly for using this approach. At least for float32, to my understanding, doing all the computations and storing the dot products as float64 should be fine.

IMHO, the performance gains (which are not exponential, just a constant factor) are not worth the loss in precision (which can bite you unexpectedly) and the proper way is to not use this problematic trick. It may, however, be well possible to further optimize code doing the "traditional" computation, for example to use AVX. Because sum( (x-y)**2 ) is all but difficult to implement in AVX.
At the minimum, I would suggest renaming the method to approximate_euclidean_distances, because of the sometimes low precision (which gets worse the closer two values are, which may be fine initially then begin to matter when converging to some optimum), so that users are aware of this issue.

Contributor

kno10 commented Sep 22, 2018

Any fast approach using the dot(x,x)+dot(y,y)-2*dot(x,y) version will likely have the same issue for all I can tell, but you'd better ask some real expert on numerics for this. I believe you'll need to double the precision of the dot products to get to approx. the precision of the input data (and I'd assume that if a user provides float32 data, then they'll want float32 precision, with float64, they'll want float64 precision). You may be able to do this with some tricks (think of Kahan summation), but it will very likely cost you much more than you gained in the first place.

I can't tell how much overhead you get from converting float32 to float64 on the fly for using this approach. At least for float32, to my understanding, doing all the computations and storing the dot products as float64 should be fine.

IMHO, the performance gains (which are not exponential, just a constant factor) are not worth the loss in precision (which can bite you unexpectedly) and the proper way is to not use this problematic trick. It may, however, be well possible to further optimize code doing the "traditional" computation, for example to use AVX. Because sum( (x-y)**2 ) is all but difficult to implement in AVX.
At the minimum, I would suggest renaming the method to approximate_euclidean_distances, because of the sometimes low precision (which gets worse the closer two values are, which may be fine initially then begin to matter when converging to some optimum), so that users are aware of this issue.

@kno10

This comment has been minimized.

Show comment
Hide comment
@kno10

kno10 Sep 22, 2018

Contributor

@rth thanks for the illustrations. But what if you are trying to optimize, e.g., x towards some optimum. Most likely the optimum will not be at zero (if it would always be your data center, life would be great), and eventually the deltas you are computing for gradients etc. may have some very small differences.
Similarly, in clustering, clusters will not all have their centers close to zero, but in particular with many clusters, x ≈ center with a few digits is quite possible.

Contributor

kno10 commented Sep 22, 2018

@rth thanks for the illustrations. But what if you are trying to optimize, e.g., x towards some optimum. Most likely the optimum will not be at zero (if it would always be your data center, life would be great), and eventually the deltas you are computing for gradients etc. may have some very small differences.
Similarly, in clustering, clusters will not all have their centers close to zero, but in particular with many clusters, x ≈ center with a few digits is quite possible.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Sep 22, 2018

Member

Overall however, I agree this issue needs fixing. In any case we need to document the precision issues of the current implementation as soon as possible.

In general though I don't think the this discussion should happen in scikit-learn. Euclidean distance is used in various fields of scientific computing and IMO scipy mailing list or issues is a better place to discuss it: that community has also more experience with such numerical precision issues. In fact what we have here is a fast but somewhat approximate algorithm. We may have to implement some fixes workarounds in the short term, but in the long term it would be good to know that this will be contributed there.

For 32 bit, #11271 may indeed be a solution, I'm just not so keen of multiple levels of chunking all through the library as that increases code complexity, and want to make sure there is no better way around it.

Member

rth commented Sep 22, 2018

Overall however, I agree this issue needs fixing. In any case we need to document the precision issues of the current implementation as soon as possible.

In general though I don't think the this discussion should happen in scikit-learn. Euclidean distance is used in various fields of scientific computing and IMO scipy mailing list or issues is a better place to discuss it: that community has also more experience with such numerical precision issues. In fact what we have here is a fast but somewhat approximate algorithm. We may have to implement some fixes workarounds in the short term, but in the long term it would be good to know that this will be contributed there.

For 32 bit, #11271 may indeed be a solution, I'm just not so keen of multiple levels of chunking all through the library as that increases code complexity, and want to make sure there is no better way around it.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Sep 22, 2018

Member

Thanks for your response @kno10! (My above comments doesn't take it into account yet) I'll respond a bit later.

Member

rth commented Sep 22, 2018

Thanks for your response @kno10! (My above comments doesn't take it into account yet) I'll respond a bit later.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Sep 22, 2018

Member

Yes, convergence to some point outside of the origin may be an issue.

IMHO, the performance gains (which are not exponential, just a constant factor) are not worth the loss in precision (which can bite you unexpectedly) and the proper way is to not use this problematic trick.

Well a >10x slow down for their calculation in 64 bit will have a very real effect on users.

It may, however, be well possible to further optimize code doing the "traditional" computation, for example to use AVX. Because sum( (x-y)**2 ) is all but difficult to implement in AVX.

Tried a quick naive implementation with numba (which should use SSE),

@numba.jit(nopython=True, fastmath=True)              
def pairwise_distance_naive(A, B):
    n_samples_a, n_features_a = A.shape
    n_samples_b, n_features_b = B.shape
    assert n_features_a == n_features_b
    distance = np.empty((n_samples_a, n_samples_b), dtype=A.dtype)
    for i in range(n_samples_a):
        for j in range(n_samples_b):
            psum = 0.0
            for k in range(n_features_a):
                psum += (A[i, k] - B[j, k])**2
            distance[i, j] = math.sqrt(psum)
    return distance

getting a similar speed to scipy cdist so far (but I'm not a numba expert), and also not sure about the effect of fastmath.

using the dot(x,x)+dot(y,y)-2*dot(x,y) version

Just for future reference, what we are currently doing is roughly the following (because there is a dimension that doesn't show in the above notation),

def quadratic_pairwise_distance(A, B):
    A2 = np.einsum('ij,ij->i', A, A)
    B2 = np.einsum('ij,ij->i', B, B)
    return np.sqrt(A2[:, None] + B2[None, :] - 2*np.dot(A, B.T))

where both einsum and dot now use BLAS. I wonder, if aside from using BLAS, this also actually does the same number of mathematical operations as the first version above.

Member

rth commented Sep 22, 2018

Yes, convergence to some point outside of the origin may be an issue.

IMHO, the performance gains (which are not exponential, just a constant factor) are not worth the loss in precision (which can bite you unexpectedly) and the proper way is to not use this problematic trick.

Well a >10x slow down for their calculation in 64 bit will have a very real effect on users.

It may, however, be well possible to further optimize code doing the "traditional" computation, for example to use AVX. Because sum( (x-y)**2 ) is all but difficult to implement in AVX.

Tried a quick naive implementation with numba (which should use SSE),

@numba.jit(nopython=True, fastmath=True)              
def pairwise_distance_naive(A, B):
    n_samples_a, n_features_a = A.shape
    n_samples_b, n_features_b = B.shape
    assert n_features_a == n_features_b
    distance = np.empty((n_samples_a, n_samples_b), dtype=A.dtype)
    for i in range(n_samples_a):
        for j in range(n_samples_b):
            psum = 0.0
            for k in range(n_features_a):
                psum += (A[i, k] - B[j, k])**2
            distance[i, j] = math.sqrt(psum)
    return distance

getting a similar speed to scipy cdist so far (but I'm not a numba expert), and also not sure about the effect of fastmath.

using the dot(x,x)+dot(y,y)-2*dot(x,y) version

Just for future reference, what we are currently doing is roughly the following (because there is a dimension that doesn't show in the above notation),

def quadratic_pairwise_distance(A, B):
    A2 = np.einsum('ij,ij->i', A, A)
    B2 = np.einsum('ij,ij->i', B, B)
    return np.sqrt(A2[:, None] + B2[None, :] - 2*np.dot(A, B.T))

where both einsum and dot now use BLAS. I wonder, if aside from using BLAS, this also actually does the same number of mathematical operations as the first version above.

@jeremiedbb

This comment has been minimized.

Show comment
Hide comment
@jeremiedbb

jeremiedbb Sep 22, 2018

Contributor

I wonder, if aside from using BLAS, this also actually does the same number of mathematical operations as the first version above.

No. The ((x - y)**2.sum()) performs
n_samples_x * n_samples_y * n_features * (1 substraction + 1 addition + 1 multiplication)
whereas the x.x + y.y -2x.y performs
n_samples_x * n_samples_y * n_features * (1 addition + 1 multiplication).
There is a 2/3 ratio for the number of operations between the 2 versions.

Contributor

jeremiedbb commented Sep 22, 2018

I wonder, if aside from using BLAS, this also actually does the same number of mathematical operations as the first version above.

No. The ((x - y)**2.sum()) performs
n_samples_x * n_samples_y * n_features * (1 substraction + 1 addition + 1 multiplication)
whereas the x.x + y.y -2x.y performs
n_samples_x * n_samples_y * n_features * (1 addition + 1 multiplication).
There is a 2/3 ratio for the number of operations between the 2 versions.

rth added a commit to rth/scikit-learn that referenced this issue Sep 23, 2018

Add a test for numeric precision #9354
Surprisingly bad precision, isn't it?

Note that the traditional computation sqrt(sum((x-y)**2)) gets the results exact.

@rth rth changed the title from sklearn.metrics.pairwise.pairwise_distances agrees exactly with np.linalg.norm when using np.float64 arrays, but disagrees when using np.float32 arrays to Numerical precision of euclidean_distances with float32 Sep 24, 2018

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Sep 24, 2018

Member

Following the above discussion,

  • Made a PR to optionally allow computing euclidean distances exactly #12136
  • Some WIP to see if we can detect and mitigate the problematic points in #12142

For 32 bit, we still need to merge #11271 in some form though IMO, the above PRs are somewhat orthogonal to it.

Member

rth commented Sep 24, 2018

Following the above discussion,

  • Made a PR to optionally allow computing euclidean distances exactly #12136
  • Some WIP to see if we can detect and mitigate the problematic points in #12142

For 32 bit, we still need to merge #11271 in some form though IMO, the above PRs are somewhat orthogonal to it.

@kno10

This comment has been minimized.

Show comment
Hide comment
@kno10

kno10 Sep 24, 2018

Contributor

FYI: when fixing some issues in OPTICS, and refreshing the test to use reference results from ELKI, these fail with metric="euclidean" but succeed with metric="minkowski". The numerical differences are large enough to cause a different processing order (just decreasing the threshold is not enough).

https://github.com/kno10/scikit-learn/blob/ab544709a392e4dc7b22e9fd60c4755baf3d3053/sklearn/cluster/tests/test_optics.py#L588

Contributor

kno10 commented Sep 24, 2018

FYI: when fixing some issues in OPTICS, and refreshing the test to use reference results from ELKI, these fail with metric="euclidean" but succeed with metric="minkowski". The numerical differences are large enough to cause a different processing order (just decreasing the threshold is not enough).

https://github.com/kno10/scikit-learn/blob/ab544709a392e4dc7b22e9fd60c4755baf3d3053/sklearn/cluster/tests/test_optics.py#L588

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment