Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs for the weighted τ #13224

Merged
merged 3 commits into from Jan 22, 2021
Merged

Update docs for the weighted τ #13224

merged 3 commits into from Jan 22, 2021

Conversation

vigna
Copy link
Contributor

@vigna vigna commented Dec 10, 2020

This short pull request update the documentation of stats.weightedtau() to clarify the notion of "rank" used.

Comment on lines -4638 to -4641
Note that if you are computing the weighted :math:`\tau` on arrays of
ranks, rather than of scores (i.e., a larger value implies a lower
rank) you must negate the ranks, so that elements of higher rank are
associated with a larger value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check my understanding - this comment was intended for users who were not passing in a rank argument directly, right? If the users were passing in their own rank, weightedtau would produce the same result whether the x and y were "scores" or "ranks", right?
Does this test that correctly?

import numpy as np
from scipy.stats import weightedtau, kendalltau, rankdata

np.random.seed(0)

# "scores"
x = np.random.rand(10)
y = np.random.rand(10)

# "ranks", SciPy convention
x2 = rankdata(x)
y2 = rankdata(y)

# "ranks", opposite convention
x3 = 11 - x2
y3 = 11 - y2

rank = np.arange(10)

print(weightedtau(x, y, rank))
print(weightedtau(x2, y2, rank))
print(weightedtau(x3, y3, rank))

The outputs are identical.

So the sentiment of this comment is really "especially if you're not passing in your own ranks, pay attention to how this function calculates weights to make sure it makes sense for your data", right?

Copy link
Contributor

@mdhaber mdhaber Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to put it is - neither kendalltau nor weightedtau inherently care if the data are specified as "scores" or "ranks" (or which convention for ranks is used); but it may affect the results of weightedtau because of the way it assigns weights by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If you pass a rank array as an external source of rank, you must follow the conventions of weightedtau(). Or, you can compute the weighted τ between two rank arrays. In the second case, you can pass the rank arrays as they are if they follow the "descending" convention, but you must negate them if the follow the "ascending" convention. It is a subtle point that the need for negation is only due to the fact that I'm assuming ascending ranks. But under the SciPy convention you can pass array of ranks and they will work as scores. That's why I removed that part. But I added some clarification on the fact that the an external rank source follows a different convention. BTW, I don't expect anybody to every supply such a source in real-world applications.

If you change the sign of two scores vectors, τ will not change, because all out-of-order pairs have the same cost. This is not true for a weighted τ if you do not provide an external source of rank, because exchanges between more important elements cost more, and lacking a source of rank, importance will be induced by sorting the scores. If you pass an external rank array, you can change the sign of the score vectors and the weighted τ will not change, because only the relative order of the vectors will be relevant—the importance of the elements will be given by the rank array.

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is an improvement and that it closes gh-12778.

@stde It would be great to have your feedback, since you reported the issue.

@stde
Copy link

stde commented Dec 11, 2020

@mdhaber:
thanks for all the effort regarding the issue that was brought up. I think removing the note about using ranks instead of scores (4638-4641) will clear up the confusion about usage however this may creates a new issue:

The documentation of weightedtau will then read like it will always expect scores. This seems inconsistent with the documentation/usage of kendalltau which will expect ranks.
The correct solution in my opinion would be to make everything consistent: either use scores everywhere (weightedtau and kendalltau) or ranks.

@vigna
Copy link
Contributor Author

vigna commented Dec 11, 2020

Good point. Kendall's original paper consider rankings, but since the correlation is between ranks, there is no difference between passing to the method scores or the associated ranks. Indeed, the SciPy code does not assume anywhere that the argument are rankings. I'm sure no user of stats.kendalltau() ever refrained from using vectors of reals—that's how τ is defined in every book today. But changing the wording might make this clear.

@rgommers rgommers added Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org scipy.stats labels Dec 12, 2020
@mdhaber
Copy link
Contributor

mdhaber commented Jan 20, 2021

But changing the wording might make this clear.

@vigna Did you want to change that, then?

@vigna
Copy link
Contributor Author

vigna commented Jan 20, 2021

Yes, I think it would be clearer and more up to date with typical usage, but it's not my code.

@mdhaber
Copy link
Contributor

mdhaber commented Jan 20, 2021

I see, so you think this should be merged as-is and kendalltau's documentation should be updated separately?

@vigna
Copy link
Contributor Author

vigna commented Jan 21, 2021

That's a judgment call, and not mine—as I said, I'm not the author of kendalltau(). scipy's documentation refers to "ranking" without defining what a ranking is, and reports a formula quoting "concordant" and "discordant" pairs without defining what they are. So I'm pretty sure users of kendalltau() know what they're doing from other sources (the formula is also written in a uselessly complicated way—look, e.g., at a simpler, alternative form).

It also depends on how you want to be close to the quoted paper by Kendall, which indeed defines the measure for rankings. Wikipedia's page, for example, follows the modern usage—it's a measure of correlation on arbitrary data that depends only on the order of the values, rather on the values themselves.

If you ask me: yes, the documentation of kendalltau() does not follow the current definition and applications of Kendall's τ. I see this as a very minor problem because, hopefully, if you're doing statistics you're not learning measures of ordinal association from scipy's docs :).

@mdhaber
Copy link
Contributor

mdhaber commented Jan 22, 2021

Ok. I agree that kendalltau documentation could be improved. It's odd that it says "ranks" when the example clearly shows scores. But this PR is an improvement to the weightedtau documentation, so merging. Thanks @vigna, @stde!

@mdhaber mdhaber merged commit bab1925 into scipy:master Jan 22, 2021
@tylerjereddy tylerjereddy added this to the 1.7.0 milestone Jan 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants