BUG: spatial: improve dtype stability for all distance metrics #14909

AtsushiSakai · 2021-10-23T14:08:51Z

Reference issue

What does this implement/fix?

As reported in #9961, current distance metric functions might have different results with different dtypes (integer vs float).
So, I moved the type conversion codes from sqeuclidean to _validate_vector in order to apply it for every distance metric.
I added a test for it.

tylerjereddy · 2021-10-24T02:43:05Z

scipy/spatial/distance.py

+        # or the dtype is not string.
+        if not (hasattr(u, "dtype") and
+                (np.issubdtype(u.dtype, np.inexact)
+                 or (u.dtype.kind == 'S'))):


Do we want to use i.e., np.character subdtype check here in case of the use of another flexible dtype like Unicode?

tylerjereddy · 2021-10-24T03:04:34Z

scipy/spatial/distance.py

+        # for stability only when the dtype is not a subtype of inexact
+        # or the dtype is not string.
+        if not (hasattr(u, "dtype") and
+                (np.issubdtype(u.dtype, np.inexact)


So we always convert array-like inputs (even a list of "characters") as long as they are not a NumPy array proper?

For proper NumPy arrays, we do coerce to a floating point type when it makes sense, and that includes additional checking for things like characters that can't easily be converted to floats (which we don't check for with other array-like).

I don't know if it matters too much--perhaps attempting an asarray() temporary conversion for the array-like values just for this check might be overkill anyway.

Thank you. Right, distance.hamming might receive a string list. I think only a NumPy array should be converted to a floating array. I fixed it and add some tests for it. PTAL.

tylerjereddy · 2021-10-24T16:28:25Z

scipy/spatial/tests/test_distance.py

+        elif metric == seuclidean:
+            assert_almost_equal(metric(ai, bi, vo), metric(af, bf, vo))
+        else:
+            assert_almost_equal(metric(ai, bi), metric(af, bf))


here and above I suppose we may wish to use the recommended assert_allclose when adding new code per: https://numpy.org/doc/stable/reference/generated/numpy.testing.assert_almost_equal.html

tylerjereddy · 2021-10-24T17:39:24Z

scipy/spatial/distance.py

@@ -298,6 +298,15 @@ def _validate_seuclidean_kwargs(X, m, n, **kwargs):


 def _validate_vector(u, dtype=None):
+
+    if dtype is None and type(u) is np.ndarray:


Doesn't this change the behavior for sqeuclidean() from what it was before? I'm not sure if we need to worry about it too much, but I'll just outline this below with an example.

This would mean two changes I think:

we would no longer coerce subtypes of ndarray to floats because of the new is np.ndarray check

we would no longer coerce array-like to float64 arrays, which it looks like we did before because of the short-circuit on hasattr(u, "dtype")

import numpy as np from scipy.spatial.distance import sqeuclidean np.random.seed(31415) a = np.random.randint(100, size=100, dtype='uint8') b = np.random.randint(100, size=100, dtype='uint8') class C(np.ndarray): pass c_arr_a = a.view(C) c_arr_b = b.view(C) print(sqeuclidean(c_arr_a, c_arr_b)) print(sqeuclidean(c_arr_a.astype(float), c_arr_b.astype(float))) print(sqeuclidean(c_arr_a.tolist(), c_arr_b.tolist())) print(type(sqeuclidean(c_arr_a.tolist(), c_arr_b.tolist()))) # issue9961 feature branch: # 248 # 166136.0 # 166136 # <class 'numpy.int64'> # master: # 166136.0 # 166136.0 # 166136.0 # <class 'numpy.float64'>

Thank you again. I missed the a.view(C) case and the type mismatch case. I extended these test cases.
The only solution I could come up with is to convert the input u to ndarray beforehand. Do you have any other better idea?

peterbell10 · 2021-10-26T15:43:07Z

scipy/spatial/distance.py

+        # or the dtype is not string.
+        u = np.asarray(u)
+        if not (np.issubdtype(u.dtype, np.inexact)
+                or (np.issubdtype(u.dtype, np.character))):


I don't think this is a good idea because it effects metrics expecting boolean input, where floating point input isn't necessary or helpful. Instead, how about creating a new function _validate_real_vector and call it in only the functions that require real inputs.

AtsushiSakai · 2021-10-29T11:05:29Z

The jensenshannon does not use _validate_vector now. I'm not sure this is intended or not.

scipy/scipy/spatial/distance.py

Lines 1369 to 1370 in eca0cd3

    
           p = np.asarray(p) 
        
           q = np.asarray(q)

peterbell10 · 2021-10-29T11:07:25Z

It can't use _validate_vector because it supports an axis argument and higher-dimensional input.

AtsushiSakai · 2021-11-06T06:37:02Z

I rebased to fix CI failures.

…r input distance functions.

AtsushiSakai requested review from peterbell10 and tylerjereddy as code owners October 23, 2021 14:08

github-actions bot added the scipy.spatial label Oct 23, 2021

AtsushiSakai added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial and removed scipy.spatial labels Oct 23, 2021

tylerjereddy reviewed Oct 24, 2021

View reviewed changes

AtsushiSakai requested a review from tylerjereddy October 24, 2021 06:13

tylerjereddy reviewed Oct 24, 2021

View reviewed changes

peterbell10 reviewed Oct 26, 2021

View reviewed changes

AtsushiSakai force-pushed the issue9961 branch from eca0cd3 to 7656282 Compare November 6, 2021 04:59

AtsushiSakai requested review from tylerjereddy and peterbell10 November 6, 2021 06:35

AtsushiSakai added 9 commits December 28, 2021 12:28

BUG: spatial: fix dtype handling for all distance metrics and add a test

81f939a

BUG: spatial: use np.issubdtype(np.caracter) for string check.

ce6a52b

BUG: spatial: fix float casting and add tests

693e778

BUG: spatial: remove unnecessary condition check

1173b43

BUG: spatial: fix lint error

6f195ff

BUG: spatial: using assert_allclose instead of assert_almost_equal

f066e35

BUG: spatial: use ndarray convert and extend tests.

d0263f8

BUG: spatial: add _validate_real_vector and apply it for real vecto…

aba5522

…r input distance functions.

BUG: spatial: fix conflict

cc21277

AtsushiSakai force-pushed the issue9961 branch from d98490f to cc21277 Compare December 28, 2021 03:51

AtsushiSakai closed this May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: spatial: improve dtype stability for all distance metrics #14909

BUG: spatial: improve dtype stability for all distance metrics #14909

AtsushiSakai commented Oct 23, 2021

tylerjereddy Oct 24, 2021

tylerjereddy Oct 24, 2021

AtsushiSakai Oct 24, 2021 •

edited

Loading

tylerjereddy Oct 24, 2021

tylerjereddy Oct 24, 2021

AtsushiSakai Oct 26, 2021

peterbell10 Oct 26, 2021 •

edited

Loading

AtsushiSakai commented Oct 29, 2021

peterbell10 commented Oct 29, 2021

AtsushiSakai commented Nov 6, 2021

		@@ -298,6 +298,15 @@ def _validate_seuclidean_kwargs(X, m, n, **kwargs):


		def _validate_vector(u, dtype=None):

		if dtype is None and type(u) is np.ndarray:

BUG: spatial: improve dtype stability for all distance metrics #14909

BUG: spatial: improve dtype stability for all distance metrics #14909

Conversation

AtsushiSakai commented Oct 23, 2021

Reference issue

What does this implement/fix?

tylerjereddy Oct 24, 2021

Choose a reason for hiding this comment

tylerjereddy Oct 24, 2021

Choose a reason for hiding this comment

AtsushiSakai Oct 24, 2021 • edited Loading

Choose a reason for hiding this comment

tylerjereddy Oct 24, 2021

Choose a reason for hiding this comment

tylerjereddy Oct 24, 2021

Choose a reason for hiding this comment

AtsushiSakai Oct 26, 2021

Choose a reason for hiding this comment

peterbell10 Oct 26, 2021 • edited Loading

Choose a reason for hiding this comment

AtsushiSakai commented Oct 29, 2021

peterbell10 commented Oct 29, 2021

AtsushiSakai commented Nov 6, 2021

AtsushiSakai Oct 24, 2021 •

edited

Loading

peterbell10 Oct 26, 2021 •

edited

Loading