Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address concerns in gh-12999 for ks_2samp in exact mode. #13168

Closed
wants to merge 2 commits into from

Conversation

pvanmulbregt
Copy link
Contributor

Return an estimate of the relative error of the probability computed by
stats._compute_prob_inside_method(). As the desired probability is the
complement of the one computed, subtractive cancellation can occur.
Check if it may have, and declare the attempt a failure. This prevents
a probability ~ 1e-16 being reported when the actual probability is many orders
of magnitude lower.

Reference issue

Closes gh-12999

What does this implement/fix?

Checks if the computation of a probability in stats._compute_prob_inside_method(), using exact algorithm, may have lost too much precision to be used as a complementary probability. If so, back off to the asymptotic computation.
In stats. _compute_prob_outside_square(), do a pre-check to determine if the computation will definitely overflow. If so, return immediately.

Return an estimate of the relative error of the probability computed by
stats._compute_prob_inside_method(). As the desired probability is the
complement of the one computed, subtractive cancellation can occur.
Check if it may have, and declare the attempt a failure.  This prevents
a probability ~ 1e-16 being reported when the actual probability is many orders
of magnitude lower.
@rgommers rgommers added maintenance Items related to regular maintenance tasks scipy.stats labels Nov 29, 2020
@pvanmulbregt pvanmulbregt added the needs-work Items that are pending response from the author label Dec 5, 2020
@pvanmulbregt
Copy link
Contributor Author

Actually this cure is starting to look worse than the problem it is addressing...

@mdhaber
Copy link
Contributor

mdhaber commented Dec 9, 2020

Actually this cure is starting to look worse than the problem it is addressing...

Shall I hold off on review, then?

@pvanmulbregt
Copy link
Contributor Author

yes, hold off on any review, the "fix" affects too many probability calculations.
A better approach may be to compute an outside probability, scaling with every iteration to avoid underflow/ overflow.

@t-vi
Copy link
Contributor

t-vi commented Feb 17, 2021

@pvanmulbregt I wrote up a more stable recursion for the inner method / 2-sided 2-sample in
https://arxiv.org/abs/2102.08037

The basic trick that you can do instead of doing 1-A(m,n)/Binom(m+n,n) at the end, you can do the entire recursion on values of this form 1-A(i,j)/Binom(i+j,j). It seems to be about 4x slower than the current implementation, which I think it is due to the extra compute, but then both the current implementation and the stable recursion can be moved to numba / other ways to compile to speed them up.

@rgommers
Copy link
Member

This was superceded by gh-14432, closing. Thanks @pvanmulbregt, all

@rgommers rgommers closed this Sep 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks needs-work Items that are pending response from the author scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in scipy.stats.ks_2samp for two-sided auto and exact modes on a few thousand samples (when n1!=n2)
4 participants