-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add k-sample Anderson-Darling test to stats module #3183
Conversation
for i in arange(0, k): | ||
fij = np.array([(samples[i] == zj).sum() for zj in Zstar[:-1]]) | ||
Mij = fij.cumsum() | ||
inner = lj / float(N) * (N * Mij - Bj * n[i])**2 / (Bj * (N - Bj)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this equation (6) in Scholz and Stephens, and not equation (3) ?
discrete parent population ? not continuous as in the docstring, lines 1174, 1175
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is. The docstring needs to be fixed.
good I never tried my hand on the discrete Anderson Darling tests. I'm asking a question on the mailinglist about the signature. |
If we are planning on two different functions for discrete and continuous, then it might be better to outsource the pvalue calculation into a (private) helper function. |
I'll change the signature according to Ralf's suggestion. Regarding discrete and continuous distributions, I now have code to compute equations 3, 6, and 7 of Scholz and Stephens. So we could include all of them, or just 6 and 7, where the latter would also apply to continuous distributions. |
One question that's not clear to mean when I read the formulas: The two main questions are: how many versions of the calculation do we need? and what's the best way to implement them? To the second: I'm pretty sure the continuous version can be made without loops by using np.searchsorted. For the discrete version it might be possible to use np.searchsorted or a cython function for rankdata that is already in scipy.stats. I think getting a very fast version is not a requirement for this PR, but it's also possible that it's not very difficult to get a no python loop version with the existing tools. What are the options for a user if we have 3, 6, and 7? (I'm a bit distracted because I also have pull requests in statsmodels where I need to catch up with with some readings to understand the topics.) |
Related: From what I remember equ 5 a weighted sum of chisquare distributions shows up in some of the discrete gof tests. I finally wrote the code for getting the p-values from that, but I don't have any truncation rule for the infinite sum. I don't think it's really relevant in this case because according to the references that I looked at, the anderson darling statistic converges fast and the Stephen's approximation seems to work pretty well. However, I didn't look much at discrete cases. |
1. Change call signature to have array of arrays and optional keyword to specify which version of the k-sample AD test should be computed. 2. Get rid of all inner loops and list comprehensions by using np.searchsorted. 3. Both version given by Scholz & Stephens for discrete sample can be computed now.
I managed to rewrite everything inside the outer loop and the determination of the multiplicity using np.searchsorted. Thanks for this pointer, I wasn't aware of the power of this function. Unfortunately, it's only a speed-up of less than a percent per outer loop for a large range of sample sizes, so the list comprehension was doing pretty well already. This change mostly benefits large numbers of samples, which may not be very common. Eqs. 3, 6, 7 always disagree. |
The test failure is because a few days ago TravisCI decided to start using less precision when it calculates small cosine integrals, for some reason. |
Any further comments on this? |
one for continous distributions and two for discrete | ||
distributions, in which ties between samples may occur. The latter | ||
variant of the test is also applicable to continuous data. By | ||
default, this routine computes the test for continuous and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement looks incorrect (only one p-value returned) and would also be strange. Default is continuous only right?
The function needs to be added in |
tm = b0 + b1 / math.sqrt(m) + b2 / m | ||
pf = np.polyfit(tm, log(np.array([0.25, 0.1, 0.05, 0.025, 0.01])), 2) | ||
if Tk < tm.min() or Tk > tm.max(): | ||
warnings.warn("approximate p-value will be computed by extrapolation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this warning needed? It shows up in most of the test cases, so I'm guessing it's not that uncommon (didn't check). If so, adding a note in the docstring might make more sense. If the warning has to be kept, it shouldn't show up in the test output (can be silenced within a with warnings.catch_warnings()
block if needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the best pattern for cases like this is. I don't know how good the extrapolation is, it might have quite a large error in some ranges.
I have something similar for tables of p-values without extrapolation:
- mention only in docstring about the range of extrapolation (It's just lower precision than interpolation.)
- keep warning as here
- truncate (without extrapolation some packages, and some of my functions, just return the boundary value 0.25 or 0.01, for text return it would be
'<0.01'
or'>0.25'
)
For most use cases the exact p-value outside [0.01, 0.25] doesn't really matter and just mentioning in docstring would be enough. But I guess there would be multiple testing applications, where smaller p-values are relevant and users need to be aware that those are not very precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the quality of the interpolation is known. Scholz & Stephens vary the polynomial order depending on the number of samples and provide no guidance for what a general procedure should use. The test cases are taken from Scholz and Stephens and happen to be cases where the null hypothesis can be rejected at better than the 1% level. Given the unknown level of accuracy I'd prefer to keep the warning, unless there's a strong preference to move it to the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's fine with me then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning is fine with me too
I don't have a strong opinion given I don't know how good the extrapolation is.
Did you compare this against |
I don't have the reference and Josef understands this much better than I do anyway, so I'll let him judge the correctness of the stats. |
|
||
A2akN = 0. | ||
lj = Z.searchsorted(Zstar, 'right') - Z.searchsorted(Zstar, 'left') | ||
Bj = Z.searchsorted(Zstar) + lj / 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as searchsorted 'left' (default), save and reuse
Example:
Should be something like:
|
…or k-sample Anderson-Darling test 1. Change call signature to have array of arrays and optional keyword to specify which version of the k-sample AD test should be computed. 2. Get rid of all inner loops and list comprehensions by using np.searchsorted. 3. Both version given by Scholz & Stephens for discrete sample can be computed now.
…n_ksamp and stats.__anderson_ksamp_both
…in k-sample Anderson-Darling docstrings
- Replace discrete=False with midrank=True to better explain what the difference between tests is. Update the docstrings accordingly.
…_anderson_ksamp_midrank; fix docstring for parameters
I'm not particularly comfortable with rebasing. I hope I got this right and the commit messages are okay now. If anything else needs fixing it'll probably have to wait ~2 weeks. |
Anything missing to get this merged into 0.14.x? |
Ralf, I think this is fine to merge. I didn't look at the details again, but, last time I went through this, I didn't see anything that would hold this up from the statistics side. |
I don't think anything is missing. Looks like something went wrong with the rebase, I'll try to fix that tonight and merge this. 0.14.0 milestone is already set for this PR. |
Merged in a32a7ba. Thanks @joergdietrich, @josef-pkt |
This PR adds the k-sample Anderson-Darling test for continuous distributions as described by Scholz & Stephen (1987, Journal of the American Statistical Association, Vol. 82, pp. 918-924) to the stats module.