New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kendall tau distance #7089

Open
erelsgl opened this Issue Feb 24, 2017 · 6 comments

Comments

Projects
None yet
6 participants
@erelsgl

erelsgl commented Feb 24, 2017

I do not understand the output of scipy.stats.kendalltau.

>>> from scipy import stats
>>> stats.kendalltau([1,3,2], [3,1,2]).correlation
-1.0000000000000002

Ignoring the rounding error, the correlation is -1, which means that the ranks are totally reversed. But, [1,3,2] and [3,1,2] are only one swap apart. So, shouldn't the correlation be 0.3333333, like here?:

 >>> stats.kendalltau([1,3,2], [1,2,3]).correlation
 0.33333333333333337
@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Feb 24, 2017

Even if it is only one swap apart in the first example, the ranking is exactly reversed

>>> x = np.column_stack([[1,3,2], [3,1,2]])
>>> x[np.argsort(x[:,0])]
array([[1, 3],
       [2, 2],
       [3, 1]])

e.g. using pearson correlation

>>> np.corrcoef(x[np.argsort(x[:,0])].T)
array([[ 1., -1.],
       [-1.,  1.]])
@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Feb 24, 2017

maybe that doesn't answer the issue, I'm thinking of spearman and don't have much intuition for kendall

(edit: I guess there should be a theorem that if pearson correlation = 1 or -1, then it's a line and any other correlation should be the same 1 or -1)

@mirca

This comment has been minimized.

Contributor

mirca commented Feb 25, 2017

Hi @erelsgl, the results looks correct. In your first example you have x = [1, 3, 2] and y = [3, 1, 2] which gives #{discordant pairs} = 3, #{concordant pairs} = 0, resulting in tau = (0 - 3) / 3 = -1. In the second example, x = [1, 3, 2] and y = [1, 2, 3], so #{discordant pairs} = 1 and #{concordant pairs} = 2 , hence tau = (2 - 1) / 3 = 1 / 3.
Take a look at https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient

@erelsgl

This comment has been minimized.

erelsgl commented Feb 25, 2017

Hmmm... the Wikipedia page on https://en.wikipedia.org/wiki/Kendall_tau_distance says that "Kendall tau distance is also called bubble-sort distance since it is equivalent to the number of swaps that the bubble sort algorithm would make to place one list in the same order as the other list". The number of swaps in my example is obviously small (1), while the number of discordant pairs is large (3). They are not the same. Is this a mistake in the Wikipedia page?

@evgenyzhurko

This comment has been minimized.

Contributor

evgenyzhurko commented Feb 27, 2017

Kendall rank correlation coefficient and Kendall tau distance are the different measurement.

@erelsgl say true information about Kendall tau distance. kendall_tau_distance = m / (n * (n - 1) / 2), where m - pairs whose values are in opposite order and n - size of list. As a result, Kendall tau distance therefore lies in the interval [0,1], because m in never less than 0.

@mirca also say true information, but about Kendall rank correlation coefficient. kendall_tau_correlation = (p - q) / (n * (n - 1) / 2), where p - number of concordant pairs, q - number of discordant pairs, n - size of list. kendall_tau_correlation lies in the interval [-1;1].

scipy.stats.kendalltau is a function to find Kendall rank correlation coefficient but not for Kendall tau distance. I didn't find the implementation of Kendall tau distance in SciPy.

If the implementation is not exist in SciPy actually, I can implement it.

@masdeseiscaracteres

This comment has been minimized.

masdeseiscaracteres commented Apr 20, 2018

Hi all! Interesting discussion. For those interested (and answering @erelsgl question) I wrote a post trying to shed some light on what is going on with the Kendall Tau distance and the Wikipedia page. Basically the confusion roots in two distinct usages of the word "ranking". Sometimes it refers to "ranking lists", and, some others, to "ranked lists of items". We get weird results when the type of "ranking" we have at hand does not match the one used in the definition we are applying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment