Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: correct p-values for stats.kendalltau and stats.mstats.kendalltau #8614

Merged
merged 5 commits into from
Jun 23, 2018

Conversation

Konrad0
Copy link
Contributor

@Konrad0 Konrad0 commented Mar 25, 2018

The exact method of computing the p-value is implemented (see #8456).
It is used automatically for small samples or when the user
explicitly demands it. Otherwise the existing asymptotic method
is used.

@chrisb83
Copy link
Member

Do you know a reference that is available online where the formulas you added are shown? I could not find the book by Kendall to read online.

@Konrad0
Copy link
Contributor Author

Konrad0 commented Mar 31, 2018

@chrisb83 Sorry, I'm not aware of an online source. If necessary I could summarize the method here, but it's really just a counting scheme to determine how many permutations of a set of distinct numbers lead to a given number of concordances/discordances. The resulting number is then divided by the total number of permutations n!.

@Konrad0
Copy link
Contributor Author

Konrad0 commented Jun 5, 2018

Is there anything else that still needs to be done here? Please let me know, thanks.

@rgommers rgommers added scipy.stats enhancement A new feature or improvement labels Jun 5, 2018
@rgommers
Copy link
Member

rgommers commented Jun 5, 2018

@Konrad0 could you please rebase this? Also, API wise I'd prefer method='auto' over 'automatic', because this is used in a couple of other places already (signal.correlate/convolve, sparse.csgraph.shortest_path).

The changes look good to me overall. Unfortunate that there's no online reference. @josef-pkt do you happen to be familiar with this exact method?

Copy link
Member

@josef-pkt josef-pkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test coverage against R looks good, so we don't need to proof-read the algorithm. (I don't have a reference either)

Using "auto" as default is ok, but is not backwards compatible in that it switches the algorithm for small values.

code duplication: outsourcing the exact p-value computation would avoid increasing even more the code duplication between stats and mstats.

otherwise look good to me.

The "exact" p-value computation might be a candidate for numba or cython.

xties = count_tied_groups(x)
yties = count_tied_groups(y)
#xties = count_tied_groups(x)
#yties = count_tied_groups(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete instead of commenting out

# Tests some computations of Kendall's tau
x = ma.fix_invalid([5.05, 6.75, 3.21, 2.66,np.nan])
y = ma.fix_invalid([1.65, 26.5, -5.93, 7.96, np.nan])
z = ma.fix_invalid([1.65, 2.64, 2.64, 6.95, np.nan])
assert_almost_equal(np.asarray(mstats.kendalltau(x,y)),
[+0.3333333,0.4969059])
[+0.3333333,0.75])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, then we should keep the old numbers for "asymptotic" and add a separate test for "auto"
same for test_stats

@@ -536,38 +547,79 @@ def kendalltau(x, y, use_ties=True, use_missing=False):
for i in range(len(ry)-1)], dtype=float)
D = np.sum([((ry[i+1:] < ry[i])*(rx[i+1:] > rx[i])).filled(0).sum()
for i in range(len(ry)-1)], dtype=float)
xties = count_tied_groups(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of performance
Taking this outside of the if use_ties will increase the redundant work if use_ties is False. The stats version does not have the use_ties option and always computes them.

one possibility would be to compute count_tied_groups if use_ties or method="exact" but then the program flow would need to be changes.

i.e. the question is whether there should be a fast path for floating point numbers, where we know that we don't have ties, and that is large enough for "asymptotic", and how much difference in performance that would actually make.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josef-pkt Good catch, I agree that this should be improved. This goes hand in hand with the outsourcing you mentioned above since restructuring the flow will be much easier once the functionality is outsourced. I'll get to work on it as soon as possible, but for now I have implemented the more urgent issues.

@rgommers rgommers force-pushed the stats_kendalltest_small_samples branch from f6e72e4 to 6869e60 Compare June 10, 2018 05:08
@rgommers
Copy link
Member

@josef-pkt thanks for the review! @Konrad0 I did the rebase and pushed it to your branch.

@Konrad0
Copy link
Contributor Author

Konrad0 commented Jun 11, 2018

@rgommers, @josef-pkt: Sorry for the delay, thanks to you both for the rebase and review! I'll implement your suggestions in the next few days.

@Konrad0
Copy link
Contributor Author

Konrad0 commented Jun 14, 2018

@rgommers, @josef-pkt: Alright, I finally managed to implement your suggestions, see the latest commit. Let me know if there's anything else to be fixed, besides the performance improvement discussed above.

new[k] += new[k-1]
for k in range(j,c+1):
new[k] += new[k-1] - old[k-j]
prob = 2.0*sum(new)/np.math.factorial(n)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will now also end up in this else block for method='somerandomtypo. Making this elif method == 'asymptotic' and raising an error when method is an incorrect string would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed in the latest commit for both stats and mstats.

@rgommers
Copy link
Member

Using "auto" as default is ok, but is not backwards compatible in that it switches the algorithm for small values.

@josef-pkt if I understand correctly, with the exact method we'll get non-identical but "more correct/accurate" answers in this case. I wouldn't call that a backwards compatibility break.

@rgommers
Copy link
Member

looks good to me overall now, just made one more minor comment.

@Konrad0
Copy link
Contributor Author

Konrad0 commented Jun 15, 2018

@rgommers Thanks for the comment, I implemented the change for stats as well as mstats and added tests.

@Konrad0
Copy link
Contributor Author

Konrad0 commented Jun 22, 2018

So are we good to merge @rgommers, @josef-pkt?

@rgommers rgommers force-pushed the stats_kendalltest_small_samples branch from f590f00 to e651993 Compare June 23, 2018 07:26
@rgommers rgommers merged commit 31597b5 into scipy:master Jun 23, 2018
@rgommers
Copy link
Member

LGTM now, merged.

ignore the test re-runs, I just rebased for a conflict in THANKS.txt; tests of the last commit were all green 10 minutes ago.

@rgommers
Copy link
Member

Thanks @Konrad0 and @josef-pkt!

@rgommers rgommers added this to the 1.2.0 milestone Jun 23, 2018
@Konrad0
Copy link
Contributor Author

Konrad0 commented Jun 24, 2018

Thanks to you both for your support, I'll work on the suggested performance improvement as soon as time permits.

@larsoner
Copy link
Member

It looks like this caused errors in master due to using a np.sum(generator). PR forthcoming to fix

Konrad0 added a commit to Konrad0/scipy that referenced this pull request Feb 20, 2020
Extended the exact method for calculating the p-value in stats.kendalltau()
(originally added in scipy#8614) to sample sizes greater than 171.
Some roundoff error is unavoidable. Tests are also added.
(All also for the masked version.)
@Konrad0 Konrad0 deleted the stats_kendalltest_small_samples branch February 20, 2020 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants