New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update docs for the weighted τ #13224
Conversation
Note that if you are computing the weighted :math:`\tau` on arrays of | ||
ranks, rather than of scores (i.e., a larger value implies a lower | ||
rank) you must negate the ranks, so that elements of higher rank are | ||
associated with a larger value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to check my understanding - this comment was intended for users who were not passing in a rank
argument directly, right? If the users were passing in their own rank
, weightedtau
would produce the same result whether the x
and y
were "scores" or "ranks", right?
Does this test that correctly?
import numpy as np
from scipy.stats import weightedtau, kendalltau, rankdata
np.random.seed(0)
# "scores"
x = np.random.rand(10)
y = np.random.rand(10)
# "ranks", SciPy convention
x2 = rankdata(x)
y2 = rankdata(y)
# "ranks", opposite convention
x3 = 11 - x2
y3 = 11 - y2
rank = np.arange(10)
print(weightedtau(x, y, rank))
print(weightedtau(x2, y2, rank))
print(weightedtau(x3, y3, rank))
The outputs are identical.
So the sentiment of this comment is really "especially if you're not passing in your own ranks, pay attention to how this function calculates weights to make sure it makes sense for your data", right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to put it is - neither kendalltau
nor weightedtau
inherently care if the data are specified as "scores" or "ranks" (or which convention for ranks is used); but it may affect the results of weightedtau
because of the way it assigns weights by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If you pass a rank array as an external source of rank, you must follow the conventions of weightedtau(). Or, you can compute the weighted τ between two rank arrays. In the second case, you can pass the rank arrays as they are if they follow the "descending" convention, but you must negate them if the follow the "ascending" convention. It is a subtle point that the need for negation is only due to the fact that I'm assuming ascending ranks. But under the SciPy convention you can pass array of ranks and they will work as scores. That's why I removed that part. But I added some clarification on the fact that the an external rank source follows a different convention. BTW, I don't expect anybody to every supply such a source in real-world applications.
If you change the sign of two scores vectors, τ will not change, because all out-of-order pairs have the same cost. This is not true for a weighted τ if you do not provide an external source of rank, because exchanges between more important elements cost more, and lacking a source of rank, importance will be induced by sorting the scores. If you pass an external rank array, you can change the sign of the score vectors and the weighted τ will not change, because only the relative order of the vectors will be relevant—the importance of the elements will be given by the rank array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdhaber: The documentation of weightedtau will then read like it will always expect scores. This seems inconsistent with the documentation/usage of kendalltau which will expect ranks. |
Good point. Kendall's original paper consider rankings, but since the correlation is between ranks, there is no difference between passing to the method scores or the associated ranks. Indeed, the SciPy code does not assume anywhere that the argument are rankings. I'm sure no user of stats.kendalltau() ever refrained from using vectors of reals—that's how τ is defined in every book today. But changing the wording might make this clear. |
@vigna Did you want to change that, then? |
Yes, I think it would be clearer and more up to date with typical usage, but it's not my code. |
I see, so you think this should be merged as-is and |
That's a judgment call, and not mine—as I said, I'm not the author of kendalltau(). scipy's documentation refers to "ranking" without defining what a ranking is, and reports a formula quoting "concordant" and "discordant" pairs without defining what they are. So I'm pretty sure users of kendalltau() know what they're doing from other sources (the formula is also written in a uselessly complicated way—look, e.g., at a simpler, alternative form). It also depends on how you want to be close to the quoted paper by Kendall, which indeed defines the measure for rankings. Wikipedia's page, for example, follows the modern usage—it's a measure of correlation on arbitrary data that depends only on the order of the values, rather on the values themselves. If you ask me: yes, the documentation of kendalltau() does not follow the current definition and applications of Kendall's τ. I see this as a very minor problem because, hopefully, if you're doing statistics you're not learning measures of ordinal association from scipy's docs :). |
This short pull request update the documentation of stats.weightedtau() to clarify the notion of "rank" used.