-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: correct p-values for stats.kendalltau and stats.mstats.kendalltau #8614
ENH: correct p-values for stats.kendalltau and stats.mstats.kendalltau #8614
Conversation
Do you know a reference that is available online where the formulas you added are shown? I could not find the book by Kendall to read online. |
@chrisb83 Sorry, I'm not aware of an online source. If necessary I could summarize the method here, but it's really just a counting scheme to determine how many permutations of a set of distinct numbers lead to a given number of concordances/discordances. The resulting number is then divided by the total number of permutations |
Is there anything else that still needs to be done here? Please let me know, thanks. |
@Konrad0 could you please rebase this? Also, API wise I'd prefer The changes look good to me overall. Unfortunate that there's no online reference. @josef-pkt do you happen to be familiar with this exact method? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test coverage against R looks good, so we don't need to proof-read the algorithm. (I don't have a reference either)
Using "auto" as default is ok, but is not backwards compatible in that it switches the algorithm for small values.
code duplication: outsourcing the exact p-value computation would avoid increasing even more the code duplication between stats and mstats.
otherwise look good to me.
The "exact" p-value computation might be a candidate for numba or cython.
scipy/stats/mstats_basic.py
Outdated
xties = count_tied_groups(x) | ||
yties = count_tied_groups(y) | ||
#xties = count_tied_groups(x) | ||
#yties = count_tied_groups(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete instead of commenting out
# Tests some computations of Kendall's tau | ||
x = ma.fix_invalid([5.05, 6.75, 3.21, 2.66,np.nan]) | ||
y = ma.fix_invalid([1.65, 26.5, -5.93, 7.96, np.nan]) | ||
z = ma.fix_invalid([1.65, 2.64, 2.64, 6.95, np.nan]) | ||
assert_almost_equal(np.asarray(mstats.kendalltau(x,y)), | ||
[+0.3333333,0.4969059]) | ||
[+0.3333333,0.75]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, then we should keep the old numbers for "asymptotic" and add a separate test for "auto"
same for test_stats
@@ -536,38 +547,79 @@ def kendalltau(x, y, use_ties=True, use_missing=False): | |||
for i in range(len(ry)-1)], dtype=float) | |||
D = np.sum([((ry[i+1:] < ry[i])*(rx[i+1:] > rx[i])).filled(0).sum() | |||
for i in range(len(ry)-1)], dtype=float) | |||
xties = count_tied_groups(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of performance
Taking this outside of the if use_ties
will increase the redundant work if use_ties is False. The stats
version does not have the use_ties option and always computes them.
one possibility would be to compute count_tied_groups if use_ties or method="exact"
but then the program flow would need to be changes.
i.e. the question is whether there should be a fast path for floating point numbers, where we know that we don't have ties, and that is large enough for "asymptotic", and how much difference in performance that would actually make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josef-pkt Good catch, I agree that this should be improved. This goes hand in hand with the outsourcing you mentioned above since restructuring the flow will be much easier once the functionality is outsourced. I'll get to work on it as soon as possible, but for now I have implemented the more urgent issues.
f6e72e4
to
6869e60
Compare
@josef-pkt thanks for the review! @Konrad0 I did the rebase and pushed it to your branch. |
@rgommers, @josef-pkt: Sorry for the delay, thanks to you both for the rebase and review! I'll implement your suggestions in the next few days. |
@rgommers, @josef-pkt: Alright, I finally managed to implement your suggestions, see the latest commit. Let me know if there's anything else to be fixed, besides the performance improvement discussed above. |
scipy/stats/mstats_basic.py
Outdated
new[k] += new[k-1] | ||
for k in range(j,c+1): | ||
new[k] += new[k-1] - old[k-j] | ||
prob = 2.0*sum(new)/np.math.factorial(n) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will now also end up in this else block for method='somerandomtypo
. Making this elif method == 'asymptotic'
and raising an error when method
is an incorrect string would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, fixed in the latest commit for both stats and mstats.
@josef-pkt if I understand correctly, with the |
looks good to me overall now, just made one more minor comment. |
@rgommers Thanks for the comment, I implemented the change for stats as well as mstats and added tests. |
So are we good to merge @rgommers, @josef-pkt? |
The exact method of computing the p-value is implemented (see scipy#8456). It is used automatically for small samples or when the user explicitely demands it. Otherwise the existing asymptotic method is used.
f590f00
to
e651993
Compare
LGTM now, merged. ignore the test re-runs, I just rebased for a conflict in THANKS.txt; tests of the last commit were all green 10 minutes ago. |
Thanks @Konrad0 and @josef-pkt! |
Thanks to you both for your support, I'll work on the suggested performance improvement as soon as time permits. |
It looks like this caused errors in |
Extended the exact method for calculating the p-value in stats.kendalltau() (originally added in scipy#8614) to sample sizes greater than 171. Some roundoff error is unavoidable. Tests are also added. (All also for the masked version.)
The exact method of computing the p-value is implemented (see #8456).
It is used automatically for small samples or when the user
explicitly demands it. Otherwise the existing asymptotic method
is used.