ENH: correct p-values for stats.kendalltau and stats.mstats.kendalltau #8614

Konrad0 · 2018-03-25T22:18:49Z

The exact method of computing the p-value is implemented (see #8456).
It is used automatically for small samples or when the user
explicitly demands it. Otherwise the existing asymptotic method
is used.

chrisb83 · 2018-03-30T18:26:37Z

Do you know a reference that is available online where the formulas you added are shown? I could not find the book by Kendall to read online.

Konrad0 · 2018-03-31T12:31:20Z

@chrisb83 Sorry, I'm not aware of an online source. If necessary I could summarize the method here, but it's really just a counting scheme to determine how many permutations of a set of distinct numbers lead to a given number of concordances/discordances. The resulting number is then divided by the total number of permutations n!.

Konrad0 · 2018-06-05T09:23:26Z

Is there anything else that still needs to be done here? Please let me know, thanks.

rgommers · 2018-06-05T15:38:43Z

@Konrad0 could you please rebase this? Also, API wise I'd prefer method='auto' over 'automatic', because this is used in a couple of other places already (signal.correlate/convolve, sparse.csgraph.shortest_path).

The changes look good to me overall. Unfortunate that there's no online reference. @josef-pkt do you happen to be familiar with this exact method?

josef-pkt

test coverage against R looks good, so we don't need to proof-read the algorithm. (I don't have a reference either)

Using "auto" as default is ok, but is not backwards compatible in that it switches the algorithm for small values.

code duplication: outsourcing the exact p-value computation would avoid increasing even more the code duplication between stats and mstats.

otherwise look good to me.

The "exact" p-value computation might be a candidate for numba or cython.

josef-pkt · 2018-06-05T15:54:41Z

scipy/stats/mstats_basic.py

-        xties = count_tied_groups(x)
-        yties = count_tied_groups(y)
+        #xties = count_tied_groups(x)
+        #yties = count_tied_groups(y)


delete instead of commenting out

josef-pkt · 2018-06-05T15:58:20Z

scipy/stats/tests/test_mstats_basic.py

        # Tests some computations of Kendall's tau
        x = ma.fix_invalid([5.05, 6.75, 3.21, 2.66,np.nan])
        y = ma.fix_invalid([1.65, 26.5, -5.93, 7.96, np.nan])
        z = ma.fix_invalid([1.65, 2.64, 2.64, 6.95, np.nan])
        assert_almost_equal(np.asarray(mstats.kendalltau(x,y)),
-                            [+0.3333333,0.4969059])
+                            [+0.3333333,0.75])


IIUC, then we should keep the old numbers for "asymptotic" and add a separate test for "auto"
same for test_stats

josef-pkt · 2018-06-05T16:18:31Z

scipy/stats/mstats_basic.py

@@ -536,38 +547,79 @@ def kendalltau(x, y, use_ties=True, use_missing=False):
                for i in range(len(ry)-1)], dtype=float)
    D = np.sum([((ry[i+1:] < ry[i])*(rx[i+1:] > rx[i])).filled(0).sum()
                for i in range(len(ry)-1)], dtype=float)
+    xties = count_tied_groups(x)


In terms of performance
Taking this outside of the if use_ties will increase the redundant work if use_ties is False. The stats version does not have the use_ties option and always computes them.

one possibility would be to compute count_tied_groups if use_ties or method="exact" but then the program flow would need to be changes.

i.e. the question is whether there should be a fast path for floating point numbers, where we know that we don't have ties, and that is large enough for "asymptotic", and how much difference in performance that would actually make.

@josef-pkt Good catch, I agree that this should be improved. This goes hand in hand with the outsourcing you mentioned above since restructuring the flow will be much easier once the functionality is outsourced. I'll get to work on it as soon as possible, but for now I have implemented the more urgent issues.

rgommers · 2018-06-10T05:12:18Z

@josef-pkt thanks for the review! @Konrad0 I did the rebase and pushed it to your branch.

Konrad0 · 2018-06-11T21:36:46Z

@rgommers, @josef-pkt: Sorry for the delay, thanks to you both for the rebase and review! I'll implement your suggestions in the next few days.

Konrad0 · 2018-06-14T22:07:12Z

@rgommers, @josef-pkt: Alright, I finally managed to implement your suggestions, see the latest commit. Let me know if there's anything else to be fixed, besides the performance improvement discussed above.

rgommers · 2018-06-15T06:00:24Z

scipy/stats/mstats_basic.py

+                    new[k] += new[k-1]
+                for k in range(j,c+1):
+                    new[k] += new[k-1] - old[k-j]
+            prob = 2.0*sum(new)/np.math.factorial(n)
    else:


We will now also end up in this else block for method='somerandomtypo. Making this elif method == 'asymptotic' and raising an error when method is an incorrect string would be good.

Good point, fixed in the latest commit for both stats and mstats.

rgommers · 2018-06-15T06:02:16Z

Using "auto" as default is ok, but is not backwards compatible in that it switches the algorithm for small values.

@josef-pkt if I understand correctly, with the exact method we'll get non-identical but "more correct/accurate" answers in this case. I wouldn't call that a backwards compatibility break.

rgommers · 2018-06-15T06:02:43Z

looks good to me overall now, just made one more minor comment.

Konrad0 · 2018-06-15T15:59:08Z

@rgommers Thanks for the comment, I implemented the change for stats as well as mstats and added tests.

Konrad0 · 2018-06-22T12:32:45Z

So are we good to merge @rgommers, @josef-pkt?

The exact method of computing the p-value is implemented (see scipy#8456). It is used automatically for small samples or when the user explicitely demands it. Otherwise the existing asymptotic method is used.

rgommers · 2018-06-23T07:28:12Z

LGTM now, merged.

ignore the test re-runs, I just rebased for a conflict in THANKS.txt; tests of the last commit were all green 10 minutes ago.

rgommers · 2018-06-23T07:28:25Z

Thanks @Konrad0 and @josef-pkt!

Konrad0 · 2018-06-24T13:00:56Z

Thanks to you both for your support, I'll work on the suggested performance improvement as soon as time permits.

larsoner · 2018-06-26T15:14:51Z

It looks like this caused errors in master due to using a np.sum(generator). PR forthcoming to fix

Extended the exact method for calculating the p-value in stats.kendalltau() (originally added in scipy#8614) to sample sizes greater than 171. Some roundoff error is unavoidable. Tests are also added. (All also for the masked version.)

rgommers added scipy.stats enhancement A new feature or improvement labels Jun 5, 2018

josef-pkt reviewed Jun 5, 2018

View reviewed changes

rgommers force-pushed the stats_kendalltest_small_samples branch from f6e72e4 to 6869e60 Compare June 10, 2018 05:08

rgommers reviewed Jun 15, 2018

View reviewed changes

Konrad0 added 5 commits June 23, 2018 19:25

ENH: correct p-values for stats.kendalltau and stats.mstats.kendalltau

b412986

The exact method of computing the p-value is implemented (see scipy#8456). It is used automatically for small samples or when the user explicitely demands it. Otherwise the existing asymptotic method is used.

Travis-CI style suggestions implemented

28e422f

kendalltest: changed method name automatic to auto, added tests

e2bda30

minor cleanup in mstats_basic.kendalltau

9f42221

check for invalid method keyword and tests added

e651993

rgommers force-pushed the stats_kendalltest_small_samples branch from f590f00 to e651993 Compare June 23, 2018 07:26

rgommers merged commit 31597b5 into scipy:master Jun 23, 2018

rgommers added this to the 1.2.0 milestone Jun 23, 2018

chrisb83 mentioned this pull request Dec 4, 2019

stats.kendalltau() returns invalid p-value for some inputs #11175

Closed

Konrad0 mentioned this pull request Feb 20, 2020

ENH: exact p-value in stats.kendalltau() for sample sizes > 171 #11563

Merged

Konrad0 deleted the stats_kendalltest_small_samples branch February 20, 2020 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: correct p-values for stats.kendalltau and stats.mstats.kendalltau #8614

ENH: correct p-values for stats.kendalltau and stats.mstats.kendalltau #8614

Konrad0 commented Mar 25, 2018

chrisb83 commented Mar 30, 2018

Konrad0 commented Mar 31, 2018

Konrad0 commented Jun 5, 2018

rgommers commented Jun 5, 2018

josef-pkt left a comment

josef-pkt Jun 5, 2018

josef-pkt Jun 5, 2018

josef-pkt Jun 5, 2018

Konrad0 Jun 14, 2018

rgommers commented Jun 10, 2018

Konrad0 commented Jun 11, 2018

Konrad0 commented Jun 14, 2018

rgommers Jun 15, 2018

Konrad0 Jun 15, 2018

rgommers commented Jun 15, 2018

rgommers commented Jun 15, 2018

Konrad0 commented Jun 15, 2018

Konrad0 commented Jun 22, 2018

rgommers commented Jun 23, 2018

rgommers commented Jun 23, 2018

Konrad0 commented Jun 24, 2018

larsoner commented Jun 26, 2018

ENH: correct p-values for stats.kendalltau and stats.mstats.kendalltau #8614

ENH: correct p-values for stats.kendalltau and stats.mstats.kendalltau #8614

Conversation

Konrad0 commented Mar 25, 2018

chrisb83 commented Mar 30, 2018

Konrad0 commented Mar 31, 2018

Konrad0 commented Jun 5, 2018

rgommers commented Jun 5, 2018

josef-pkt left a comment

Choose a reason for hiding this comment

josef-pkt Jun 5, 2018

Choose a reason for hiding this comment

josef-pkt Jun 5, 2018

Choose a reason for hiding this comment

josef-pkt Jun 5, 2018

Choose a reason for hiding this comment

Konrad0 Jun 14, 2018

Choose a reason for hiding this comment

rgommers commented Jun 10, 2018

Konrad0 commented Jun 11, 2018

Konrad0 commented Jun 14, 2018

rgommers Jun 15, 2018

Choose a reason for hiding this comment

Konrad0 Jun 15, 2018

Choose a reason for hiding this comment

rgommers commented Jun 15, 2018

rgommers commented Jun 15, 2018

Konrad0 commented Jun 15, 2018

Konrad0 commented Jun 22, 2018

rgommers commented Jun 23, 2018

rgommers commented Jun 23, 2018

Konrad0 commented Jun 24, 2018

larsoner commented Jun 26, 2018