ENH: spatial.distance.jaccard: raise `ValueError` for non-boolean arrays #14357

yacth · 2021-07-06T03:01:02Z

Reference issue

What does this implement/fix?

According to the description of the Jaccard distance, it should take only 1-D booleans arrays, plus according to #14304 (comment), we should add a ValueError, for the case we don't have booleans arrays.

rkern · 2021-07-06T03:38:39Z

scipy/spatial/distance.py

@@ -861,6 +861,9 @@ def jaccard(u, v, w=None):
    0.66666666666666663

    """
+    if np.isin(u, [0, 1]).all() == False or np.isin(v, [0, 1]).all() == False:
+        raise ValueError('u and v should be boolean 1-D arrays')


Any input validation should probably come after the _validate_vector() calls to minimize data copying.

Note that the code that follows this uses u != 0 and v != 0 to coerce the values to bool Trues and Falses, so it was written to expect numerical values other than strict 0s and 1s. I would probably fix this bug by explicitly casting these arrays to bool arrays (and removing the superfluous != 0s) rather than doing relatively expensive checking for exact 0s and 1s. I think that comports with the intended functionality and documentation best.

rkern · 2021-07-06T03:58:15Z

The test failures are real. Note that some of the tests for the weight handling do modify the data in ways that make them not 0s and 1s. While I think that we should continue to have this behavior, if we do decide to restrict the inputs to just 0s and 1s (and Falses and Trues), then you would also have to modify the tests to do something else in the case of the boolean-restricted input domain.

yacth · 2021-07-06T12:21:30Z

The test failures are real. Note that some of the tests for the weight handling do modify the data in ways that make them not 0s and 1s. While I think that we should continue to have this behavior, if we do decide to restrict the inputs to just 0s and 1s (and Falses and Trues), then you would also have to modify the tests to do something else in the case of the boolean-restricted input domain.

Of course, I just did a draft to start the discussion. If it is validated that the behavior should be restricted to {0, 1}/{False, True}. I will change some test cases because they will be then invalidated.

EDIT :
Actually why do we want that the Jaccard distance to be only for boolean arrays as specified in the parameter definition in the function :

    Parameters
    ----------
    u : (N,) array_like, bool
        Input array.
    v : (N,) array_like, bool
        Input array.
    w : (N,) array_like, optional
        The weights for each value in `u` and `v`. Default is None,
        which gives each value a weight of 1.0

and then give a working example with non boolean value :

    >>> distance.jaccard([1, 0, 0], [1, 2, 0])
    0.5

peterbell10 · 2021-07-06T18:58:54Z

IMO restricting to boolean dtype makes far more sense to me than allowing any dtypes as long as their values are 0 or 1.

Actually why do we want that the Jaccard distance to be only for boolean arrays as specified in the parameter definition in the function [...] and then give a working example with non boolean value.

It shows that the != 0 behavior was by design and advertised in the docs. So, it would definitely require a deprecation cycle to change this behavior.

rkern · 2021-07-06T19:24:18Z

Just change the != 0s to np.nonzero() and you'll solve #14304 without having to change the tests or documentation.

yacth · 2021-07-07T02:51:42Z

Just change the != 0s to np.nonzero() and you'll solve #14304 without having to change the tests or documentation.

!=0 gives us a boolean array that checks if the value of the input arrays are non zero, but np.nonzero() gives us the indices of the values which are non zero. Thus it will change the whole architecture of the code.
As it is right now it works perfecly fine.
Do you think it will work better with np.nonzero()

IMO restricting to boolean dtype makes far more sense to me than allowing any dtypes as long as their values are 0 or 1.

Actually why do we want that the Jaccard distance to be only for boolean arrays as specified in the parameter definition in the function [...] and then give a working example with non boolean value.

It shows that the != 0 behavior was by design and advertised in the docs. So, it would definitely require a deprecation cycle to change this behavior.

Sorry I might have not understood, I believe that the input arrays u and v for which we want to calculate the Jaccard distance should be boolean 1-D arrays ? (For the implemented code, even if the mathematical definition allows us to use anything if I understood it correctly)
Thus I wanted to check using the following code because it recognizes values in {0, 1} and {False, True} for booleans:

>>> u = [1, 0, 0]
>>> np.isin(u, [0, 1]).all()
True
>>> v = [True, False, False]
>>> np.isin(v, [0, 1]).all()
True
>>>

Using the following code doesn't recognize values in {0, 1} as booleans even when they should be considered.

>>> u = [1, 0, 0]
>>> [type(u_i) == bool for u_i in u]
[False, False, False]

rkern · 2021-07-07T02:58:23Z

Oh, sorry, I thought nonzero was doing something else. Cast with .astype(bool) instead.

rkern · 2021-07-07T03:00:55Z

Allow values that aren't exactly 0 or 1. We document and test for such cases.

yacth · 2021-07-07T22:37:56Z

Oh, sorry, I thought nonzero was doing something else. Cast with .astype(bool) instead.

For which purpose do you want to cast it ?
Because it will transform any of the input array to boolean arrays i.e. 0 -> False Any number -> True, plus it doesn't recognizes lists and we have lists as input:

>>> import numpy as np
>>> u = [1, 0, 1]
>>> u.astype(bool)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'astype'
>>> np.array(u).astype(bool)
array([ True, False,  True])
>>> v = [0, 1, 2]
>>> np.array(v).astype(bool)
array([False,  True,  True])

If I am not mistaken the first idea I had checks in a good manner if any list is a list of booleans or not (except if we have a list of 1 and 0 which are integer to fool the function).

rkern · 2021-07-07T22:53:18Z

Because it will transform any of the input array to boolean arrays i.e. 0 -> False Any number -> True

Yes, this is the desired, documented, and tested behavior.

plus it doesn't recognizes lists and we have lists as input:

You would the casting after the _validate_vector() calls, which coerce the inputs to arrays.

… !=0

yacth · 2021-07-08T19:17:35Z

Because it will transform any of the input array to boolean arrays i.e. 0 -> False Any number -> True

Yes, this is the desired, documented, and tested behavior.

plus it doesn't recognizes lists and we have lists as input:

You would the casting after the _validate_vector() calls, which coerce the inputs to arrays.

Oh I see, thank you for your guidance.

But I am wondering if we have to return a specific ValueError ?

rkern · 2021-07-08T19:32:48Z

No, no particular ValueError. The errors from casting should be sufficient.

scipy/spatial/distance.py

melissawm · 2023-07-24T13:44:23Z

Hi all - is this ready to go?

lucascolley

Placing a blocking review here as this still exhibits incorrect behaviour.

@yacth, if you'd like to return to this, please could you add a unit test based on @rkern's example in #12174 (comment) ? In particular,

jaccard([1, 0, 0], [1, 2, 0]) should return the same value as

jaccard([True, False, False], [True, True, False]), while

jaccard([1, 1, 0], [1, 2, 0]) should return the same value as

jaccard([True, True, False], [True, True, False]).

The fix in the function should be rather simple - move the casting above, to

u = _validate_vector(u).astype(bool, copy=False)
v = _validate_vector(v).astype(bool, copy=False)

and make the changed line nonzero = np.bitwise_or(u, v), if I'm not mistaken. This means that the u != v will give False in the case of 1 and 2, since they are both truthy.

Robert's comment

I think it would be a good idea to define (and test) the behavior of all of these functions that they will convert the input arrays to booleans by the truthiness of the inputs.

still stands, but that can be left to a different issue and future PRs.

P.S. apologies for the long review time!

lucascolley · 2024-04-05T23:41:14Z

The alternative is to simply require boolean input, as mentioned by Peter above. I think that would be an unnecessary deprecation though - while the poorly defined behaviour on arrays of numbers is quite nasty here, it wouldn't surprise me if there is lots of harmless use of arrays of 0s and 1s out in the wild.

yacth added 2 commits July 5, 2021 22:55

Making the Jaccard distance accept only booleans arrays

56c4f35

Fixing PEP8 issues

f769e5b

yacth requested review from peterbell10 and tylerjereddy as code owners July 6, 2021 03:01

yacth changed the title ~~Jaccard distance~~ Jaccard distance : ValueError for non-booleans arrays Jul 6, 2021

yacth changed the title ~~Jaccard distance : ValueError for non-booleans arrays~~ ENH : Jaccard distance : ValueError for non-booleans arrays Jul 6, 2021

rkern reviewed Jul 6, 2021

View reviewed changes

tylerjereddy added the scipy.spatial label Jul 6, 2021

yacth added 2 commits July 7, 2021 23:55

Changing the transformation by casting .astype(bool) instead of using…

c119cf0

… !=0

Removing a development line

01bb181

peterbell10 reviewed Jul 10, 2021

View reviewed changes

scipy/spatial/distance.py Outdated Show resolved Hide resolved

Preventing .astype() to copy the vectors

62acf67

yacth closed this Jul 11, 2021

yacth deleted the Jaccard-distance branch July 11, 2021 18:39

yacth restored the Jaccard-distance branch July 11, 2021 18:44

yacth reopened this Jul 11, 2021

rkern mentioned this pull request Jul 23, 2021

Wrong documentation for Jaccard #12174

Open

aricooperdavis mentioned this pull request Jul 24, 2021

DOC: Correct spatial.distance.jaccard() documentation, see #12174 #14464

Closed

lucascolley added the enhancement A new feature or improvement label Dec 23, 2023

lucascolley added this to the 1.14.0 milestone Apr 5, 2024

lucascolley requested changes Apr 5, 2024

View reviewed changes

lucascolley removed this from the 1.14.0 milestone Apr 6, 2024

lucascolley changed the title ~~ENH : Jaccard distance : ValueError for non-booleans arrays~~ ENH: spatial.distance.jaccard: raise ValueError for non-boolean arrays May 18, 2024

lucascolley changed the title ~~ENH: spatial.distance.jaccard: raise ValueError for non-boolean arrays~~ ENH: spatial.distance.jaccard: raise ValueError for non-boolean arrays May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: spatial.distance.jaccard: raise `ValueError` for non-boolean arrays #14357

ENH: spatial.distance.jaccard: raise `ValueError` for non-boolean arrays #14357

yacth commented Jul 6, 2021 •

edited by tupui

Loading

rkern Jul 6, 2021

rkern commented Jul 6, 2021

yacth commented Jul 6, 2021 •

edited

Loading

peterbell10 commented Jul 6, 2021

rkern commented Jul 6, 2021

yacth commented Jul 7, 2021

rkern commented Jul 7, 2021

rkern commented Jul 7, 2021

yacth commented Jul 7, 2021

rkern commented Jul 7, 2021

yacth commented Jul 8, 2021

rkern commented Jul 8, 2021

melissawm commented Jul 24, 2023

lucascolley left a comment •

edited

Loading

lucascolley commented Apr 5, 2024

ENH: spatial.distance.jaccard: raise ValueError for non-boolean arrays #14357

Are you sure you want to change the base?

ENH: spatial.distance.jaccard: raise ValueError for non-boolean arrays #14357

Conversation

yacth commented Jul 6, 2021 • edited by tupui Loading

Reference issue

What does this implement/fix?

rkern Jul 6, 2021

Choose a reason for hiding this comment

rkern commented Jul 6, 2021

yacth commented Jul 6, 2021 • edited Loading

peterbell10 commented Jul 6, 2021

rkern commented Jul 6, 2021

yacth commented Jul 7, 2021

rkern commented Jul 7, 2021

rkern commented Jul 7, 2021

yacth commented Jul 7, 2021

rkern commented Jul 7, 2021

yacth commented Jul 8, 2021

rkern commented Jul 8, 2021

melissawm commented Jul 24, 2023

lucascolley left a comment • edited Loading

Choose a reason for hiding this comment

lucascolley commented Apr 5, 2024

ENH: spatial.distance.jaccard: raise `ValueError` for non-boolean arrays #14357

ENH: spatial.distance.jaccard: raise `ValueError` for non-boolean arrays #14357

yacth commented Jul 6, 2021 •

edited by tupui

Loading

yacth commented Jul 6, 2021 •

edited

Loading

lucascolley left a comment •

edited

Loading