-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: stats: add nan_policy' argument for
stats.rankdata`
#13215
Conversation
@@ -8122,6 +8122,12 @@ def rankdata(a, method='average', *, axis=None): | |||
axis : {None, int}, optional | |||
Axis along which to perform the ranking. If ``None``, the data array | |||
is first flattened. | |||
nan_policy : {'propagate', 'raise', 'omit'}, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest trying to keep this as consistent as possible with other nan_policy
documentation.
In gh-13166, @WarrenWeckesser wrote:
Defines how to handle the occurrence of nans in `compare`.
'propagate' returns nan, 'raise' raises an exception, 'omit'
performs the calculations ignoring nan values. Default is
'propagate'. Note that when the value is 'omit', nans in `scores`
also propagate to the output, but they do not affect the z-scores
computed for the non-nan values.
So maybe start with that, and change it as little as possible to make appropriate to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will fix it. Do you think 'propagate' should returns nan? This might break backward compatibility. The doc recommends that "this means just execute the function without checking for nan", this is current behavior . Do you think which is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're probably right. Maybe you should make sure "propagate" does not change the behavior when you add this nan_policy, and that's why it should be the default. In this case, it will not return NaNs, because it didn't in the past.
On the other hand, maybe you could consider what the code does now to be "omit", and that should be the default. I'll have to take a closer look.
scipy/stats/stats.py
Outdated
the function without checking for nan, 'raise' throws an error, | ||
'omit' performs the calculations ignoring nan values. | ||
Default is 'propagate'. Note that when the value is 'omit', | ||
the input after removing nans needs not to be ragged shape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can add that restriction if it wasn't imposed before.
import numpy as np
from scipy.stats import rankdata
np.random.seed(0)
x = np.random.rand(2,3)
x[1, 0] = np.nan
y = rankdata(x, axis=1)
works, apparently giving NaNs the highest possible rank, and existing code might rely on its behavior.
So I think adding this restriction would not be backwards compatible.
I see your approach is to remove the NaNs before doing the ranking, and that's why you need this restriction. It would be nice if we could do that because it is faster to work on the smaller array. But consider a different approach that doesn't require the restriction: perform rankdata
with the NaNs, then put NaNs in the output where there were NaNs in the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc says "nan_policy='omit': Ignore occurrences of nan in the input". I understand the proper behavior is
x = np.array([[1, np.nan], [3, 4]])
print(rankdata(x, axis=1, nan_policy="omit"))
# array([[1], [2, 1]]) <- ragged array
Do you think is it OK the output can include nan when "omit" is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take as look at what gh-13166 does, and what zscore
currently does. The docstring says that it includes nans in the output, but that they do not affect the calculation.
> import numpy as np
> from scipy.stats import zscore
> x = np.random.rand(10)
> x[1] = np.nan
> zscore(x, nan_policy='omit')
array([-0.09782085, nan, 0.95341851, -1.31112352, -1.26847725,
-1.44578773, 0.70703474, 0.56271032, 0.80612244, 1.09392333])
So yes, I think it's OK for the array to be rectangular, and there would be NaNs in the rankings where there were NaNs in the data. The way rankdata
already works makes this easy, because it already assigns NaNs the highest ranks. Even after you replace these with NaNs, all the valid rankings have numbers 1 - (N-m) (where N is the length along the axis
, and m
is the number of NaNs along the axis).
I addressed your comments. PTAL. |
@AtsushiSakai would you be interested in reviving this? I don't think the decorator will work here because |
OK. I will try it again. |
I would benefit a lot from adding this feature to handle nan - thank you!! |
Reference issue
fix #11790
What does this implement/fix?
I added
nan_policy' argument for
stats.rankdata` based on A Design Specification for nan_policy and its test.