-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: spatial.distance.jaccard: raise ValueError
for non-boolean arrays
#14357
base: main
Are you sure you want to change the base?
Conversation
scipy/spatial/distance.py
Outdated
@@ -861,6 +861,9 @@ def jaccard(u, v, w=None): | |||
0.66666666666666663 | |||
|
|||
""" | |||
if np.isin(u, [0, 1]).all() == False or np.isin(v, [0, 1]).all() == False: | |||
raise ValueError('u and v should be boolean 1-D arrays') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any input validation should probably come after the _validate_vector()
calls to minimize data copying.
Note that the code that follows this uses u != 0
and v != 0
to coerce the values to bool
True
s and False
s, so it was written to expect numerical values other than strict 0s and 1s. I would probably fix this bug by explicitly casting these arrays to bool
arrays (and removing the superfluous != 0
s) rather than doing relatively expensive checking for exact 0s and 1s. I think that comports with the intended functionality and documentation best.
The test failures are real. Note that some of the tests for the weight handling do modify the data in ways that make them not 0s and 1s. While I think that we should continue to have this behavior, if we do decide to restrict the inputs to just 0s and 1s (and |
Of course, I just did a draft to start the discussion. If it is validated that the behavior should be restricted to {0, 1}/{False, True}. I will change some test cases because they will be then invalidated. EDIT : Parameters
----------
u : (N,) array_like, bool
Input array.
v : (N,) array_like, bool
Input array.
w : (N,) array_like, optional
The weights for each value in `u` and `v`. Default is None,
which gives each value a weight of 1.0 and then give a working example with non boolean value : >>> distance.jaccard([1, 0, 0], [1, 2, 0])
0.5 |
IMO restricting to boolean dtype makes far more sense to me than allowing any dtypes as long as their values are 0 or 1.
It shows that the |
Just change the |
Sorry I might have not understood, I believe that the input arrays >>> u = [1, 0, 0]
>>> np.isin(u, [0, 1]).all()
True
>>> v = [True, False, False]
>>> np.isin(v, [0, 1]).all()
True
>>> Using the following code doesn't recognize values in {0, 1} as booleans even when they should be considered. >>> u = [1, 0, 0]
>>> [type(u_i) == bool for u_i in u]
[False, False, False] |
Oh, sorry, I thought |
Allow values that aren't exactly |
For which purpose do you want to cast it ? >>> import numpy as np
>>> u = [1, 0, 1]
>>> u.astype(bool)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'astype'
>>> np.array(u).astype(bool)
array([ True, False, True])
>>> v = [0, 1, 2]
>>> np.array(v).astype(bool)
array([False, True, True]) If I am not mistaken the first idea I had checks in a good manner if any list is a list of booleans or not (except if we have a list of |
Yes, this is the desired, documented, and tested behavior.
You would the casting after the |
Oh I see, thank you for your guidance. But I am wondering if we have to return a specific |
No, no particular |
Hi all - is this ready to go? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placing a blocking review here as this still exhibits incorrect behaviour.
@yacth, if you'd like to return to this, please could you add a unit test based on @rkern's example in #12174 (comment) ? In particular,
jaccard([1, 0, 0], [1, 2, 0])
should return the same value as
jaccard([True, False, False], [True, True, False])
, while
jaccard([1, 1, 0], [1, 2, 0])
should return the same value as
jaccard([True, True, False], [True, True, False])
.
The fix in the function should be rather simple - move the casting above, to
u = _validate_vector(u).astype(bool, copy=False)
v = _validate_vector(v).astype(bool, copy=False)
and make the changed line nonzero = np.bitwise_or(u, v)
, if I'm not mistaken. This means that the u != v
will give False
in the case of 1
and 2
, since they are both truthy.
Robert's comment
I think it would be a good idea to define (and test) the behavior of all of these functions that they will convert the input arrays to booleans by the truthiness of the inputs.
still stands, but that can be left to a different issue and future PRs.
P.S. apologies for the long review time!
The alternative is to simply require boolean input, as mentioned by Peter above. I think that would be an unnecessary deprecation though - while the poorly defined behaviour on arrays of numbers is quite nasty here, it wouldn't surprise me if there is lots of harmless use of arrays of |
ValueErro
r for non-boolean arrays
ValueErro
r for non-boolean arraysValueError
for non-boolean arrays
Reference issue
Closes #14304.
What does this implement/fix?
According to the description of the Jaccard distance, it should take only 1-D booleans arrays, plus according to #14304 (comment), we should add a
ValueError
, for the case we don't have booleans arrays.