Update docs for the weighted τ #13224

vigna · 2020-12-10T13:15:01Z

This short pull request update the documentation of stats.weightedtau() to clarify the notion of "rank" used.

Update

mdhaber · 2020-12-10T17:43:40Z

scipy/stats/stats.py

-    Note that if you are computing the weighted :math:`\tau` on arrays of
-    ranks, rather than of scores (i.e., a larger value implies a lower
-    rank) you must negate the ranks, so that elements of higher rank are
-    associated with a larger value.


Just to check my understanding - this comment was intended for users who were not passing in a rank argument directly, right? If the users were passing in their own rank, weightedtau would produce the same result whether the x and y were "scores" or "ranks", right?
Does this test that correctly?

import numpy as np from scipy.stats import weightedtau, kendalltau, rankdata np.random.seed(0) # "scores" x = np.random.rand(10) y = np.random.rand(10) # "ranks", SciPy convention x2 = rankdata(x) y2 = rankdata(y) # "ranks", opposite convention x3 = 11 - x2 y3 = 11 - y2 rank = np.arange(10) print(weightedtau(x, y, rank)) print(weightedtau(x2, y2, rank)) print(weightedtau(x3, y3, rank))

The outputs are identical.

So the sentiment of this comment is really "especially if you're not passing in your own ranks, pay attention to how this function calculates weights to make sure it makes sense for your data", right?

Another way to put it is - neither kendalltau nor weightedtau inherently care if the data are specified as "scores" or "ranks" (or which convention for ranks is used); but it may affect the results of weightedtau because of the way it assigns weights by default.

Yes. If you pass a rank array as an external source of rank, you must follow the conventions of weightedtau(). Or, you can compute the weighted τ between two rank arrays. In the second case, you can pass the rank arrays as they are if they follow the "descending" convention, but you must negate them if the follow the "ascending" convention. It is a subtle point that the need for negation is only due to the fact that I'm assuming ascending ranks. But under the SciPy convention you can pass array of ranks and they will work as scores. That's why I removed that part. But I added some clarification on the fact that the an external rank source follows a different convention. BTW, I don't expect anybody to every supply such a source in real-world applications.

If you change the sign of two scores vectors, τ will not change, because all out-of-order pairs have the same cost. This is not true for a weighted τ if you do not provide an external source of rank, because exchanges between more important elements cost more, and lacking a source of rank, importance will be induced by sorting the scores. If you pass an external rank array, you can change the sign of the score vectors and the weighted τ will not change, because only the relative order of the vectors will be relevant—the importance of the elements will be given by the rank array.

mdhaber

I think that this is an improvement and that it closes gh-12778.

@stde It would be great to have your feedback, since you reported the issue.

stde · 2020-12-11T15:17:23Z

@mdhaber:
thanks for all the effort regarding the issue that was brought up. I think removing the note about using ranks instead of scores (4638-4641) will clear up the confusion about usage however this may creates a new issue:

The documentation of weightedtau will then read like it will always expect scores. This seems inconsistent with the documentation/usage of kendalltau which will expect ranks.
The correct solution in my opinion would be to make everything consistent: either use scores everywhere (weightedtau and kendalltau) or ranks.

vigna · 2020-12-11T15:49:23Z

Good point. Kendall's original paper consider rankings, but since the correlation is between ranks, there is no difference between passing to the method scores or the associated ranks. Indeed, the SciPy code does not assume anywhere that the argument are rankings. I'm sure no user of stats.kendalltau() ever refrained from using vectors of reals—that's how τ is defined in every book today. But changing the wording might make this clear.

mdhaber · 2021-01-20T06:16:30Z

But changing the wording might make this clear.

@vigna Did you want to change that, then?

vigna · 2021-01-20T06:23:51Z

Yes, I think it would be clearer and more up to date with typical usage, but it's not my code.

mdhaber · 2021-01-20T06:54:08Z

I see, so you think this should be merged as-is and kendalltau's documentation should be updated separately?

vigna · 2021-01-21T08:54:19Z

That's a judgment call, and not mine—as I said, I'm not the author of kendalltau(). scipy's documentation refers to "ranking" without defining what a ranking is, and reports a formula quoting "concordant" and "discordant" pairs without defining what they are. So I'm pretty sure users of kendalltau() know what they're doing from other sources (the formula is also written in a uselessly complicated way—look, e.g., at a simpler, alternative form).

It also depends on how you want to be close to the quoted paper by Kendall, which indeed defines the measure for rankings. Wikipedia's page, for example, follows the modern usage—it's a measure of correlation on arbitrary data that depends only on the order of the values, rather on the values themselves.

If you ask me: yes, the documentation of kendalltau() does not follow the current definition and applications of Kendall's τ. I see this as a very minor problem because, hopefully, if you're doing statistics you're not learning measures of ordinal association from scipy's docs :).

mdhaber · 2021-01-22T08:20:32Z

Ok. I agree that kendalltau documentation could be improved. It's odd that it says "ranks" when the example clearly shows scores. But this PR is an improvement to the weightedtau documentation, so merging. Thanks @vigna, @stde!

vigna added 2 commits December 10, 2020 14:07

Merge pull request #1 from scipy/master

fb4b3cf

Update

Updated docs

d50a6c2

vigna mentioned this pull request Dec 10, 2020

Confusing documentation of scipy.stats.weightedtau #12778

Closed

Updated docs

a0a2ce1

mdhaber reviewed Dec 10, 2020

View reviewed changes

mdhaber approved these changes Dec 10, 2020

View reviewed changes

rgommers added Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org scipy.stats labels Dec 12, 2020

mdhaber merged commit bab1925 into scipy:master Jan 22, 2021

tylerjereddy added this to the 1.7.0 milestone Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update docs for the weighted τ #13224

Update docs for the weighted τ #13224

vigna commented Dec 10, 2020

mdhaber Dec 10, 2020

mdhaber Dec 10, 2020 •

edited

vigna Dec 10, 2020

mdhaber left a comment

stde commented Dec 11, 2020 •

edited

vigna commented Dec 11, 2020

mdhaber commented Jan 20, 2021 •

edited

vigna commented Jan 20, 2021

mdhaber commented Jan 20, 2021

vigna commented Jan 21, 2021

mdhaber commented Jan 22, 2021

Update docs for the weighted τ #13224

Update docs for the weighted τ #13224

Conversation

vigna commented Dec 10, 2020

mdhaber Dec 10, 2020

Choose a reason for hiding this comment

mdhaber Dec 10, 2020 • edited

Choose a reason for hiding this comment

vigna Dec 10, 2020

Choose a reason for hiding this comment

mdhaber left a comment

Choose a reason for hiding this comment

stde commented Dec 11, 2020 • edited

vigna commented Dec 11, 2020

mdhaber commented Jan 20, 2021 • edited

vigna commented Jan 20, 2021

mdhaber commented Jan 20, 2021

vigna commented Jan 21, 2021

mdhaber commented Jan 22, 2021

mdhaber Dec 10, 2020 •

edited

stde commented Dec 11, 2020 •

edited

mdhaber commented Jan 20, 2021 •

edited