ENH: Add k-sample Anderson-Darling test to stats module #3183

joergdietrich · 2014-01-03T20:07:39Z

This PR adds the k-sample Anderson-Darling test for continuous distributions as described by Scholz & Stephen (1987, Journal of the American Statistical Association, Vol. 82, pp. 918-924) to the stats module.

josef-pkt · 2014-01-03T20:20:03Z

scipy/stats/morestats.py

+    for i in arange(0, k):
+        fij = np.array([(samples[i] == zj).sum() for zj in Zstar[:-1]])
+        Mij = fij.cumsum()
+        inner = lj / float(N) * (N * Mij - Bj * n[i])**2 / (Bj * (N - Bj))


Is this equation (6) in Scholz and Stephens, and not equation (3) ?

discrete parent population ? not continuous as in the docstring, lines 1174, 1175

Yes, it is. The docstring needs to be fixed.

josef-pkt · 2014-01-03T20:35:43Z

good I never tried my hand on the discrete Anderson Darling tests.
I think we should add _discrete to the name of the function.

I'm asking a question on the mailinglist about the signature.

josef-pkt · 2014-01-03T21:27:12Z

If we are planning on two different functions for discrete and continuous, then it might be better to outsource the pvalue calculation into a (private) helper function.

joergdietrich · 2014-01-05T18:48:36Z

I'll change the signature according to Ralf's suggestion. Regarding discrete and continuous distributions, I now have code to compute equations 3, 6, and 7 of Scholz and Stephens. So we could include all of them, or just 6 and 7, where the latter would also apply to continuous distributions.

josef-pkt · 2014-01-06T04:43:22Z

One question that's not clear to mean when I read the formulas:
Does equ 3 give the same answer as equ 6 even when the values are discrete and there are ties?

The two main questions are: how many versions of the calculation do we need? and what's the best way to implement them?

To the second: I'm pretty sure the continuous version can be made without loops by using np.searchsorted. For the discrete version it might be possible to use np.searchsorted or a cython function for rankdata that is already in scipy.stats.

I think getting a very fast version is not a requirement for this PR, but it's also possible that it's not very difficult to get a no python loop version with the existing tools.

What are the options for a user if we have 3, 6, and 7?
If 3 and 6 can use the same code, then it would only be whether to use mean rank in case of ties.
If only continuous equ 3 can get a fast algorithm, then we should also allow users to choose that.

(I'm a bit distracted because I also have pull requests in statsmodels where I need to catch up with with some readings to understand the topics.)

josef-pkt · 2014-01-06T04:49:37Z

Related: From what I remember equ 5 a weighted sum of chisquare distributions shows up in some of the discrete gof tests. I finally wrote the code for getting the p-values from that, but I don't have any truncation rule for the infinite sum.

I don't think it's really relevant in this case because according to the references that I looked at, the anderson darling statistic converges fast and the Stephen's approximation seems to work pretty well. However, I didn't look much at discrete cases.

1. Change call signature to have array of arrays and optional keyword to specify which version of the k-sample AD test should be computed. 2. Get rid of all inner loops and list comprehensions by using np.searchsorted. 3. Both version given by Scholz & Stephens for discrete sample can be computed now.

joergdietrich · 2014-01-06T20:29:43Z

I managed to rewrite everything inside the outer loop and the determination of the multiplicity using np.searchsorted. Thanks for this pointer, I wasn't aware of the power of this function. Unfortunately, it's only a speed-up of less than a percent per outer loop for a large range of sample sizes, so the list comprehension was doing pretty well already. This change mostly benefits large numbers of samples, which may not be very common.

Eqs. 3, 6, 7 always disagree.

argriffing · 2014-01-06T21:16:43Z

The test failure is because a few days ago TravisCI decided to start using less precision when it calculates small cosine integrals, for some reason.

joergdietrich · 2014-01-27T20:45:03Z

Any further comments on this?

rgommers · 2014-02-01T14:39:36Z

scipy/stats/morestats.py

+    one for continous distributions and two for discrete
+    distributions, in which ties between samples may occur. The latter
+    variant of the test is also applicable to continuous data. By
+    default, this routine computes the test for continuous and


This statement looks incorrect (only one p-value returned) and would also be strange. Default is continuous only right?

rgommers · 2014-02-01T14:46:41Z

The function needs to be added in stats/__init__.py in order for it to show up in the documentation.

rgommers · 2014-02-01T14:50:06Z

scipy/stats/morestats.py

+    tm = b0 + b1 / math.sqrt(m) + b2 / m
+    pf = np.polyfit(tm, log(np.array([0.25, 0.1, 0.05, 0.025, 0.01])), 2)
+    if Tk < tm.min() or Tk > tm.max():
+        warnings.warn("approximate p-value will be computed by extrapolation")


Is this warning needed? It shows up in most of the test cases, so I'm guessing it's not that uncommon (didn't check). If so, adding a note in the docstring might make more sense. If the warning has to be kept, it shouldn't show up in the test output (can be silenced within a with warnings.catch_warnings() block if needed).

I'm not sure what the best pattern for cases like this is. I don't know how good the extrapolation is, it might have quite a large error in some ranges.
I have something similar for tables of p-values without extrapolation:

mention only in docstring about the range of extrapolation (It's just lower precision than interpolation.)

keep warning as here

truncate (without extrapolation some packages, and some of my functions, just return the boundary value 0.25 or 0.01, for text return it would be '<0.01' or '>0.25')

For most use cases the exact p-value outside [0.01, 0.25] doesn't really matter and just mentioning in docstring would be enough. But I guess there would be multiple testing applications, where smaller p-values are relevant and users need to be aware that those are not very precise.

I don't think the quality of the interpolation is known. Scholz & Stephens vary the polynomial order depending on the number of samples and provide no guidance for what a general procedure should use. The test cases are taken from Scholz and Stephens and happen to be cases where the null hypothesis can be rejected at better than the 1% level. Given the unknown level of accuracy I'd prefer to keep the warning, unless there's a strong preference to move it to the docstring.

OK, that's fine with me then.

warning is fine with me too
I don't have a strong opinion given I don't know how good the extrapolation is.

rgommers · 2014-02-01T15:31:43Z

Did you compare this against anderson? If you draw one set of samples from a distribution and specify the same distribution for anderson, and then test those against a single other set of samples, then after enough runs the test statistic and p-values should be the same within some tolerance (at least that's what I expect).

rgommers · 2014-02-01T15:32:24Z

I don't have the reference and Josef understands this much better than I do anyway, so I'll let him judge the correctness of the stats.

josef-pkt · 2014-02-01T16:14:11Z

scipy/stats/morestats.py

+
+    A2akN = 0.
+    lj = Z.searchsorted(Zstar, 'right') - Z.searchsorted(Zstar, 'left')
+    Bj = Z.searchsorted(Zstar) + lj / 2.


same as searchsorted 'left' (default), save and reuse

coveralls · 2014-02-06T14:07:19Z

Coverage remained the same when pulling 24931cb on joergdietrich:k-sample-AD into fd99d3f on scipy:master.

rgommers · 2014-02-07T16:39:27Z

http://docs.scipy.org/doc/numpy-dev/dev/gitwash/development_workflow.html#writing-the-commit-message

rgommers · 2014-02-07T16:40:57Z

Example:

More verbose explanation of midrank parameter

Should be something like:

DOC: more verbose explanation of midrank parameter in stats.anderson_ksamp

…or k-sample Anderson-Darling test 1. Change call signature to have array of arrays and optional keyword to specify which version of the k-sample AD test should be computed. 2. Get rid of all inner loops and list comprehensions by using np.searchsorted. 3. Both version given by Scholz & Stephens for discrete sample can be computed now.

…n_ksamp and stats.__anderson_ksamp_both

…tines

…in k-sample Anderson-Darling docstrings

- Replace discrete=False with midrank=True to better explain what the difference between tests is. Update the docstrings accordingly.

…_anderson_ksamp_midrank; fix docstring for parameters

…ksamp

…ample-AD

joergdietrich · 2014-02-11T21:10:38Z

I'm not particularly comfortable with rebasing. I hope I got this right and the commit messages are okay now. If anything else needs fixing it'll probably have to wait ~2 weeks.

joergdietrich · 2014-02-24T00:05:38Z

Anything missing to get this merged into 0.14.x?

josef-pkt · 2014-02-24T00:49:00Z

Ralf, I think this is fine to merge. I didn't look at the details again, but, last time I went through this, I didn't see anything that would hold this up from the statistics side.

rgommers · 2014-02-24T07:17:24Z

I don't think anything is missing. Looks like something went wrong with the rebase, I'll try to fix that tonight and merge this. 0.14.0 milestone is already set for this PR.

Review at #3183

rgommers · 2014-02-24T21:45:16Z

Merged in a32a7ba. Thanks @joergdietrich, @josef-pkt

ENH: Add k-sample Anderson-Darling test to stats module

95ad13d

josef-pkt reviewed Jan 3, 2014
View reviewed changes

rgommers reviewed Feb 1, 2014
View reviewed changes

josef-pkt reviewed Feb 1, 2014
View reviewed changes

argriffing added scipy.sparse.linalg and removed scipy.sparse.linalg labels Feb 10, 2014

joergdietrich added 14 commits February 11, 2014 21:51

ENH: Add k-sample Anderson-Darling test to stats module

8a92e25

STY: Replace "&" with "and: in citation and add year in stats.anderso…

b2c5ef9

…n_ksamp and stats.__anderson_ksamp_both

MAINT: fix typo in docstring stats.anderson_ksamp

371dcc7

STY: add blank lines after if blocks in k-sample Anderson Darling rou…

b2682cd

…tines

STY: Use ... for line continuation in docstring in stats.anderson_ksamp

b138007

DOC: Add comment to explain interpolation values stats.anderson_ksamp

a12c26f

STY: capital letter for each parameter description and period at end …

f90390d

…in k-sample Anderson-Darling docstrings

MAINT: combine a few simple statements in stats.anderson_ksamp

a2a077b

API: Change call signature of stats.anderson_ksamp

1626e0f

- Replace discrete=False with midrank=True to better explain what the difference between tests is. Update the docstrings accordingly.

MAINT: Add missing word to docstring of stats.anderson_ksamp

aad0972

MAINT: avoid computation of lj for continuous distributions in stats.…

11b5a01

…_anderson_ksamp_midrank; fix docstring for parameters

DOC: More verbose explanation of midrank parameter in stats.anderson_…

7e8e03c

…ksamp

Merge branch 'k-sample-AD' of github.com:joergdietrich/scipy into k-s…

6bce442

…ample-AD

pv added the PR label Feb 19, 2014

DOC: Adapt calls in example to changed signature from 1626e0f

6231a8d

josef-pkt added this to the 0.14.0 milestone Feb 24, 2014

rgommers pushed a commit that referenced this pull request Feb 24, 2014

Merge branch 'pr/3183' into master.

a32a7ba

Review at #3183

rgommers closed this Feb 24, 2014

josef-pkt mentioned this pull request Jun 5, 2020

Include the option to add weights in scipy.stats.ks_2samp #12315

Open

Uh oh!

ENH: Add k-sample Anderson-Darling test to stats module #3183

ENH: Add k-sample Anderson-Darling test to stats module #3183

Uh oh!

Conversation

joergdietrich commented Jan 3, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

josef-pkt commented Jan 3, 2014

Uh oh!

josef-pkt commented Jan 3, 2014

Uh oh!

joergdietrich commented Jan 5, 2014

Uh oh!

josef-pkt commented Jan 6, 2014

Uh oh!

josef-pkt commented Jan 6, 2014

Uh oh!

joergdietrich commented Jan 6, 2014

Uh oh!

argriffing commented Jan 6, 2014

Uh oh!

joergdietrich commented Jan 27, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgommers commented Feb 1, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgommers commented Feb 1, 2014

Uh oh!

rgommers commented Feb 1, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Feb 6, 2014

Uh oh!

rgommers commented Feb 7, 2014

Uh oh!

rgommers commented Feb 7, 2014

Uh oh!

joergdietrich commented Feb 11, 2014

Uh oh!

joergdietrich commented Feb 24, 2014

Uh oh!

josef-pkt commented Feb 24, 2014

Uh oh!

rgommers commented Feb 24, 2014

Uh oh!

rgommers commented Feb 24, 2014

Uh oh!

Uh oh!