Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats: add nan_policy' argument for stats.rankdata` #13215

Closed
wants to merge 3 commits into from

Conversation

AtsushiSakai
Copy link
Member

Reference issue

fix #11790

What does this implement/fix?

I added nan_policy' argument for stats.rankdata` based on A Design Specification for nan_policy and its test.

@AtsushiSakai AtsushiSakai added enhancement A new feature or improvement scipy.stats defect A clear bug or issue that prevents SciPy from being installed or used as expected labels Dec 8, 2020
@@ -8122,6 +8122,12 @@ def rankdata(a, method='average', *, axis=None):
axis : {None, int}, optional
Axis along which to perform the ranking. If ``None``, the data array
is first flattened.
nan_policy : {'propagate', 'raise', 'omit'}, optional
Copy link
Contributor

@mdhaber mdhaber Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest trying to keep this as consistent as possible with other nan_policy documentation.

In gh-13166, @WarrenWeckesser wrote:

        Defines how to handle the occurrence of nans in `compare`.
        'propagate' returns nan, 'raise' raises an exception, 'omit'
        performs the calculations ignoring nan values. Default is
        'propagate'. Note that when the value is 'omit', nans in `scores`
        also propagate to the output, but they do not affect the z-scores
        computed for the non-nan values.

So maybe start with that, and change it as little as possible to make appropriate to this function.

Copy link
Member Author

@AtsushiSakai AtsushiSakai Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will fix it. Do you think 'propagate' should returns nan? This might break backward compatibility. The doc recommends that "this means just execute the function without checking for nan", this is current behavior . Do you think which is better?

Copy link
Contributor

@mdhaber mdhaber Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're probably right. Maybe you should make sure "propagate" does not change the behavior when you add this nan_policy, and that's why it should be the default. In this case, it will not return NaNs, because it didn't in the past.

On the other hand, maybe you could consider what the code does now to be "omit", and that should be the default. I'll have to take a closer look.

the function without checking for nan, 'raise' throws an error,
'omit' performs the calculations ignoring nan values.
Default is 'propagate'. Note that when the value is 'omit',
the input after removing nans needs not to be ragged shape.
Copy link
Contributor

@mdhaber mdhaber Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can add that restriction if it wasn't imposed before.

import numpy as np
from scipy.stats import rankdata
np.random.seed(0)
x = np.random.rand(2,3)
x[1, 0] = np.nan
y = rankdata(x, axis=1)

works, apparently giving NaNs the highest possible rank, and existing code might rely on its behavior.

So I think adding this restriction would not be backwards compatible.

I see your approach is to remove the NaNs before doing the ranking, and that's why you need this restriction. It would be nice if we could do that because it is faster to work on the smaller array. But consider a different approach that doesn't require the restriction: perform rankdata with the NaNs, then put NaNs in the output where there were NaNs in the input.

Copy link
Member Author

@AtsushiSakai AtsushiSakai Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc says "nan_policy='omit': Ignore occurrences of nan in the input". I understand the proper behavior is

x = np.array([[1, np.nan], [3, 4]])
print(rankdata(x, axis=1, nan_policy="omit"))
# array([[1], [2, 1]]) <- ragged array

Do you think is it OK the output can include nan when "omit" is used?

Copy link
Contributor

@mdhaber mdhaber Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take as look at what gh-13166 does, and what zscore currently does. The docstring says that it includes nans in the output, but that they do not affect the calculation.

> import numpy as np
> from scipy.stats import zscore
> x = np.random.rand(10)
> x[1] = np.nan
> zscore(x, nan_policy='omit')
array([-0.09782085,         nan,  0.95341851, -1.31112352, -1.26847725,
       -1.44578773,  0.70703474,  0.56271032,  0.80612244,  1.09392333])

So yes, I think it's OK for the array to be rectangular, and there would be NaNs in the rankings where there were NaNs in the data. The way rankdata already works makes this easy, because it already assigns NaNs the highest ranks. Even after you replace these with NaNs, all the valid rankings have numbers 1 - (N-m) (where N is the length along the axis, and m is the number of NaNs along the axis).

@AtsushiSakai
Copy link
Member Author

I addressed your comments. PTAL.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 12, 2022

@AtsushiSakai would you be interested in reviving this? I don't think the decorator will work here because rankdata is not a reduction operation.

@AtsushiSakai
Copy link
Member Author

OK. I will try it again.

@chenrocky
Copy link

I would benefit a lot from adding this feature to handle nan - thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NaN handling of stats.rankdata
3 participants