Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: spatial.distance.jaccard: raise ValueError for non-boolean arrays #14357

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

yacth
Copy link

@yacth yacth commented Jul 6, 2021

Reference issue

Closes #14304.

What does this implement/fix?

According to the description of the Jaccard distance, it should take only 1-D booleans arrays, plus according to #14304 (comment), we should add a ValueError, for the case we don't have booleans arrays.

@yacth yacth changed the title Jaccard distance Jaccard distance : ValueError for non-booleans arrays Jul 6, 2021
@yacth yacth changed the title Jaccard distance : ValueError for non-booleans arrays ENH : Jaccard distance : ValueError for non-booleans arrays Jul 6, 2021
@@ -861,6 +861,9 @@ def jaccard(u, v, w=None):
0.66666666666666663

"""
if np.isin(u, [0, 1]).all() == False or np.isin(v, [0, 1]).all() == False:
raise ValueError('u and v should be boolean 1-D arrays')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any input validation should probably come after the _validate_vector() calls to minimize data copying.

Note that the code that follows this uses u != 0 and v != 0 to coerce the values to bool Trues and Falses, so it was written to expect numerical values other than strict 0s and 1s. I would probably fix this bug by explicitly casting these arrays to bool arrays (and removing the superfluous != 0s) rather than doing relatively expensive checking for exact 0s and 1s. I think that comports with the intended functionality and documentation best.

@rkern
Copy link
Member

rkern commented Jul 6, 2021

The test failures are real. Note that some of the tests for the weight handling do modify the data in ways that make them not 0s and 1s. While I think that we should continue to have this behavior, if we do decide to restrict the inputs to just 0s and 1s (and Falses and Trues), then you would also have to modify the tests to do something else in the case of the boolean-restricted input domain.

@yacth
Copy link
Author

yacth commented Jul 6, 2021

The test failures are real. Note that some of the tests for the weight handling do modify the data in ways that make them not 0s and 1s. While I think that we should continue to have this behavior, if we do decide to restrict the inputs to just 0s and 1s (and Falses and Trues), then you would also have to modify the tests to do something else in the case of the boolean-restricted input domain.

Of course, I just did a draft to start the discussion. If it is validated that the behavior should be restricted to {0, 1}/{False, True}. I will change some test cases because they will be then invalidated.

EDIT :
Actually why do we want that the Jaccard distance to be only for boolean arrays as specified in the parameter definition in the function :

    Parameters
    ----------
    u : (N,) array_like, bool
        Input array.
    v : (N,) array_like, bool
        Input array.
    w : (N,) array_like, optional
        The weights for each value in `u` and `v`. Default is None,
        which gives each value a weight of 1.0

and then give a working example with non boolean value :

    >>> distance.jaccard([1, 0, 0], [1, 2, 0])
    0.5

@peterbell10
Copy link
Member

IMO restricting to boolean dtype makes far more sense to me than allowing any dtypes as long as their values are 0 or 1.

Actually why do we want that the Jaccard distance to be only for boolean arrays as specified in the parameter definition in the function [...] and then give a working example with non boolean value.

It shows that the != 0 behavior was by design and advertised in the docs. So, it would definitely require a deprecation cycle to change this behavior.

@rkern
Copy link
Member

rkern commented Jul 6, 2021

Just change the != 0s to np.nonzero() and you'll solve #14304 without having to change the tests or documentation.

@yacth
Copy link
Author

yacth commented Jul 7, 2021

Just change the != 0s to np.nonzero() and you'll solve #14304 without having to change the tests or documentation.

!=0 gives us a boolean array that checks if the value of the input arrays are non zero, but np.nonzero() gives us the indices of the values which are non zero. Thus it will change the whole architecture of the code.
As it is right now it works perfecly fine.
Do you think it will work better with np.nonzero()

IMO restricting to boolean dtype makes far more sense to me than allowing any dtypes as long as their values are 0 or 1.

Actually why do we want that the Jaccard distance to be only for boolean arrays as specified in the parameter definition in the function [...] and then give a working example with non boolean value.

It shows that the != 0 behavior was by design and advertised in the docs. So, it would definitely require a deprecation cycle to change this behavior.

Sorry I might have not understood, I believe that the input arrays u and v for which we want to calculate the Jaccard distance should be boolean 1-D arrays ? (For the implemented code, even if the mathematical definition allows us to use anything if I understood it correctly)
Thus I wanted to check using the following code because it recognizes values in {0, 1} and {False, True} for booleans:

>>> u = [1, 0, 0]
>>> np.isin(u, [0, 1]).all()
True
>>> v = [True, False, False]
>>> np.isin(v, [0, 1]).all()
True
>>>

Using the following code doesn't recognize values in {0, 1} as booleans even when they should be considered.

>>> u = [1, 0, 0]
>>> [type(u_i) == bool for u_i in u]
[False, False, False]

@rkern
Copy link
Member

rkern commented Jul 7, 2021

Oh, sorry, I thought nonzero was doing something else. Cast with .astype(bool) instead.

@rkern
Copy link
Member

rkern commented Jul 7, 2021

Allow values that aren't exactly 0 or 1. We document and test for such cases.

@yacth
Copy link
Author

yacth commented Jul 7, 2021

Oh, sorry, I thought nonzero was doing something else. Cast with .astype(bool) instead.

For which purpose do you want to cast it ?
Because it will transform any of the input array to boolean arrays i.e. 0 -> False Any number -> True, plus it doesn't recognizes lists and we have lists as input:

>>> import numpy as np
>>> u = [1, 0, 1]
>>> u.astype(bool)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'astype'
>>> np.array(u).astype(bool)
array([ True, False,  True])
>>> v = [0, 1, 2]
>>> np.array(v).astype(bool)
array([False,  True,  True])

If I am not mistaken the first idea I had checks in a good manner if any list is a list of booleans or not (except if we have a list of 1 and 0 which are integer to fool the function).

@rkern
Copy link
Member

rkern commented Jul 7, 2021

Because it will transform any of the input array to boolean arrays i.e. 0 -> False Any number -> True

Yes, this is the desired, documented, and tested behavior.

plus it doesn't recognizes lists and we have lists as input:

You would the casting after the _validate_vector() calls, which coerce the inputs to arrays.

@yacth
Copy link
Author

yacth commented Jul 8, 2021

Because it will transform any of the input array to boolean arrays i.e. 0 -> False Any number -> True

Yes, this is the desired, documented, and tested behavior.

plus it doesn't recognizes lists and we have lists as input:

You would the casting after the _validate_vector() calls, which coerce the inputs to arrays.

Oh I see, thank you for your guidance.

But I am wondering if we have to return a specific ValueError ?

@rkern
Copy link
Member

rkern commented Jul 8, 2021

No, no particular ValueError. The errors from casting should be sufficient.

@yacth yacth closed this Jul 11, 2021
@yacth yacth deleted the Jaccard-distance branch July 11, 2021 18:39
@yacth yacth restored the Jaccard-distance branch July 11, 2021 18:44
@yacth yacth reopened this Jul 11, 2021
@melissawm
Copy link
Member

Hi all - is this ready to go?

@lucascolley lucascolley added the enhancement A new feature or improvement label Dec 23, 2023
@lucascolley lucascolley added this to the 1.14.0 milestone Apr 5, 2024
Copy link
Member

@lucascolley lucascolley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placing a blocking review here as this still exhibits incorrect behaviour.

@yacth, if you'd like to return to this, please could you add a unit test based on @rkern's example in #12174 (comment) ? In particular,

jaccard([1, 0, 0], [1, 2, 0]) should return the same value as

jaccard([True, False, False], [True, True, False]), while

jaccard([1, 1, 0], [1, 2, 0]) should return the same value as

jaccard([True, True, False], [True, True, False]).


The fix in the function should be rather simple - move the casting above, to

u = _validate_vector(u).astype(bool, copy=False)
v = _validate_vector(v).astype(bool, copy=False)

and make the changed line nonzero = np.bitwise_or(u, v), if I'm not mistaken. This means that the u != v will give False in the case of 1 and 2, since they are both truthy.


Robert's comment

I think it would be a good idea to define (and test) the behavior of all of these functions that they will convert the input arrays to booleans by the truthiness of the inputs.

still stands, but that can be left to a different issue and future PRs.

P.S. apologies for the long review time!

@lucascolley
Copy link
Member

The alternative is to simply require boolean input, as mentioned by Peter above. I think that would be an unnecessary deprecation though - while the poorly defined behaviour on arrays of numbers is quite nasty here, it wouldn't surprise me if there is lots of harmless use of arrays of 0s and 1s out in the wild.

@lucascolley lucascolley removed this from the 1.14.0 milestone Apr 6, 2024
@lucascolley lucascolley changed the title ENH : Jaccard distance : ValueError for non-booleans arrays ENH: spatial.distance.jaccard: raise ValueError for non-boolean arrays May 18, 2024
@lucascolley lucascolley changed the title ENH: spatial.distance.jaccard: raise ValueError for non-boolean arrays ENH: spatial.distance.jaccard: raise ValueError for non-boolean arrays May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.spatial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jaccard distance greater than 1 if elements are strings
6 participants