ENH: stats: add `nan_policy' argument for` stats.rankdata` #13215

AtsushiSakai · 2020-12-08T13:49:32Z

Reference issue

What does this implement/fix?

I added nan_policy' argument for stats.rankdata` based on A Design Specification for nan_policy and its test.

mdhaber · 2020-12-11T01:40:40Z

scipy/stats/stats.py

@@ -8122,6 +8122,12 @@ def rankdata(a, method='average', *, axis=None):
    axis : {None, int}, optional
        Axis along which to perform the ranking. If ``None``, the data array
        is first flattened.
+    nan_policy : {'propagate', 'raise', 'omit'}, optional


I'd suggest trying to keep this as consistent as possible with other nan_policy documentation.

In gh-13166, @WarrenWeckesser wrote:

Defines how to handle the occurrence of nans in `compare`. 'propagate' returns nan, 'raise' raises an exception, 'omit' performs the calculations ignoring nan values. Default is 'propagate'. Note that when the value is 'omit', nans in `scores` also propagate to the output, but they do not affect the z-scores computed for the non-nan values.

So maybe start with that, and change it as little as possible to make appropriate to this function.

OK. I will fix it. Do you think 'propagate' should returns nan? This might break backward compatibility. The doc recommends that "this means just execute the function without checking for nan", this is current behavior . Do you think which is better?

I think you're probably right. Maybe you should make sure "propagate" does not change the behavior when you add this nan_policy, and that's why it should be the default. In this case, it will not return NaNs, because it didn't in the past.

On the other hand, maybe you could consider what the code does now to be "omit", and that should be the default. I'll have to take a closer look.

mdhaber · 2020-12-11T01:42:02Z

scipy/stats/stats.py

+        the function without checking for nan, 'raise' throws an error,
+        'omit' performs the calculations ignoring nan values.
+        Default is 'propagate'.  Note that when the value is 'omit',
+        the input after removing nans needs not to be ragged shape.


I don't think we can add that restriction if it wasn't imposed before.

import numpy as np from scipy.stats import rankdata np.random.seed(0) x = np.random.rand(2,3) x[1, 0] = np.nan y = rankdata(x, axis=1)

works, apparently giving NaNs the highest possible rank, and existing code might rely on its behavior.

So I think adding this restriction would not be backwards compatible.

I see your approach is to remove the NaNs before doing the ranking, and that's why you need this restriction. It would be nice if we could do that because it is faster to work on the smaller array. But consider a different approach that doesn't require the restriction: perform rankdata with the NaNs, then put NaNs in the output where there were NaNs in the input.

The doc says "nan_policy='omit': Ignore occurrences of nan in the input". I understand the proper behavior is

x = np.array([[1, np.nan], [3, 4]]) print(rankdata(x, axis=1, nan_policy="omit")) # array([[1], [2, 1]]) <- ragged array

Do you think is it OK the output can include nan when "omit" is used?

Take as look at what gh-13166 does, and what zscore currently does. The docstring says that it includes nans in the output, but that they do not affect the calculation.

> import numpy as np > from scipy.stats import zscore > x = np.random.rand(10) > x[1] = np.nan > zscore(x, nan_policy='omit') array([-0.09782085, nan, 0.95341851, -1.31112352, -1.26847725, -1.44578773, 0.70703474, 0.56271032, 0.80612244, 1.09392333])

So yes, I think it's OK for the array to be rectangular, and there would be NaNs in the rankings where there were NaNs in the data. The way rankdata already works makes this easy, because it already assigns NaNs the highest ranks. Even after you replace these with NaNs, all the valid rankings have numbers 1 - (N-m) (where N is the length along the axis, and m is the number of NaNs along the axis).

AtsushiSakai · 2020-12-12T06:06:40Z

I addressed your comments. PTAL.

mdhaber · 2022-03-12T05:44:26Z

@AtsushiSakai would you be interested in reviving this? I don't think the decorator will work here because rankdata is not a reduction operation.

AtsushiSakai · 2022-03-12T23:23:16Z

OK. I will try it again.

chenrocky · 2022-04-01T19:50:32Z

I would benefit a lot from adding this feature to handle nan - thank you!!

ENH: stats: add nan_policy' argument for stats.rankdata`

55c18be

AtsushiSakai added enhancement A new feature or improvement scipy.stats defect A clear bug or issue that prevents SciPy from being installed or used as expected labels Dec 8, 2020

AtsushiSakai mentioned this pull request Dec 10, 2020

BUG: fix NaN handing of stats.rankdata, stats.spearmanr, and stats.wilcoxon #11791

Closed

mdhaber reviewed Dec 11, 2020

View reviewed changes

mdhaber mentioned this pull request Dec 11, 2020

ENH: stats: add nan_policy' argument for stats.wilcoxon` #13223

Closed

ENH: stats: fix nan_policy handling and its test for stats.rankdata

b811cb8

AtsushiSakai requested a review from mdhaber December 12, 2020 06:07

ENH: stats: fix CI for stats.rankdata

deab18d

mdhaber mentioned this pull request Aug 19, 2021

ENH: stats: add axis and nan_policy parameters to functions with decorator #13312

Merged

mdhaber mentioned this pull request Aug 27, 2021

ENH: stats: consistent nan_policy, axis, masked array, and dtype support #14651

Open

13 tasks

AtsushiSakai closed this Nov 6, 2021

AtsushiSakai mentioned this pull request May 8, 2022

ENH: stats: Add nan_policy optional argument for stats.rankdata #16140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: stats: add `nan_policy' argument for` stats.rankdata` #13215

ENH: stats: add `nan_policy' argument for` stats.rankdata` #13215

AtsushiSakai commented Dec 8, 2020

mdhaber Dec 11, 2020 •

edited

Loading

AtsushiSakai Dec 11, 2020 •

edited

Loading

mdhaber Dec 11, 2020 •

edited

Loading

mdhaber Dec 11, 2020 •

edited

Loading

AtsushiSakai Dec 11, 2020 •

edited

Loading

mdhaber Dec 11, 2020 •

edited

Loading

AtsushiSakai commented Dec 12, 2020

mdhaber commented Mar 12, 2022

AtsushiSakai commented Mar 12, 2022

chenrocky commented Apr 1, 2022

ENH: stats: add nan_policy' argument for stats.rankdata` #13215

ENH: stats: add nan_policy' argument for stats.rankdata` #13215

Conversation

AtsushiSakai commented Dec 8, 2020

Reference issue

What does this implement/fix?

mdhaber Dec 11, 2020 • edited Loading

Choose a reason for hiding this comment

AtsushiSakai Dec 11, 2020 • edited Loading

Choose a reason for hiding this comment

mdhaber Dec 11, 2020 • edited Loading

Choose a reason for hiding this comment

mdhaber Dec 11, 2020 • edited Loading

Choose a reason for hiding this comment

AtsushiSakai Dec 11, 2020 • edited Loading

Choose a reason for hiding this comment

mdhaber Dec 11, 2020 • edited Loading

Choose a reason for hiding this comment

AtsushiSakai commented Dec 12, 2020

mdhaber commented Mar 12, 2022

AtsushiSakai commented Mar 12, 2022

chenrocky commented Apr 1, 2022

ENH: stats: add `nan_policy' argument for` stats.rankdata` #13215

ENH: stats: add `nan_policy' argument for` stats.rankdata` #13215

mdhaber Dec 11, 2020 •

edited

Loading

AtsushiSakai Dec 11, 2020 •

edited

Loading

mdhaber Dec 11, 2020 •

edited

Loading

mdhaber Dec 11, 2020 •

edited

Loading

AtsushiSakai Dec 11, 2020 •

edited

Loading

mdhaber Dec 11, 2020 •

edited

Loading