ENH: Add k-sample Anderson-Darling test to stats module #3183

Closed
wants to merge 31 commits into
from

Projects

None yet

6 participants

@joergdietrich
Contributor

This PR adds the k-sample Anderson-Darling test for continuous distributions as described by Scholz & Stephen (1987, Journal of the American Statistical Association, Vol. 82, pp. 918-924) to the stats module.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Jan 3, 2014
scipy/stats/morestats.py
+ Zstar = np.unique(Z)
+ L = Zstar.size
+ if not L > 1:
+ raise ValueError("anderson_ksamp needs more than one distinct "
+ "observation")
+ n = np.array([sample.size for sample in samples])
+ if any(n == 0):
+ raise ValueError("anderson_ksamp encountered sample without "
+ "observations")
+ A2kN = 0.
+ lj = np.array([(Z == zj).sum() for zj in Zstar[:-1]])
+ Bj = lj.cumsum()
+ for i in arange(0, k):
+ fij = np.array([(samples[i] == zj).sum() for zj in Zstar[:-1]])
+ Mij = fij.cumsum()
+ inner = lj / float(N) * (N * Mij - Bj * n[i])**2 / (Bj * (N - Bj))
@josef-pkt
josef-pkt Jan 3, 2014 Member

Is this equation (6) in Scholz and Stephens, and not equation (3) ?

discrete parent population ? not continuous as in the docstring, lines 1174, 1175

@joergdietrich
joergdietrich Jan 3, 2014 Contributor

Yes, it is. The docstring needs to be fixed.

@josef-pkt josef-pkt commented on the diff Jan 3, 2014
scipy/stats/morestats.py
+ fij = np.array([(samples[i] == zj).sum() for zj in Zstar[:-1]])
+ Mij = fij.cumsum()
+ inner = lj / float(N) * (N * Mij - Bj * n[i])**2 / (Bj * (N - Bj))
+ A2kN += inner.sum() / n[i]
+
+ h = (1. / arange(1, N)).sum()
+ H = (1. / n).sum()
+ g = 0
+ for l in arange(1, N-1):
+ inner = np.array([1. / ((N - l) * m) for m in arange(l+1, N)])
+ g += inner.sum()
+ a = (4*g - 6) * (k - 1) + (10 - 6*g)*H
+ b = (2*g - 4)*k**2 + 8*h*k + (2*g - 14*h - 4)*H - 8*h + 4*g - 6
+ c = (6*h + 2*g - 2)*k**2 + (4*h - 4*g + 6)*k + (2*h - 6)*H + 4*h
+ d = (2*h + 6)*k**2 - 4*h*k
+ sigmasq = (a*N**3 + b*N**2 + c*N + d) / ((N - 1.) * (N - 2.) * (N - 3.))
@josef-pkt
josef-pkt Jan 3, 2014 Member

this is based on equation 4?

@josef-pkt
Member

good I never tried my hand on the discrete Anderson Darling tests.
I think we should add _discrete to the name of the function.

I'm asking a question on the mailinglist about the signature.

@josef-pkt
Member

If we are planning on two different functions for discrete and continuous, then it might be better to outsource the pvalue calculation into a (private) helper function.

@joergdietrich
Contributor

I'll change the signature according to Ralf's suggestion. Regarding discrete and continuous distributions, I now have code to compute equations 3, 6, and 7 of Scholz and Stephens. So we could include all of them, or just 6 and 7, where the latter would also apply to continuous distributions.

@josef-pkt
Member

One question that's not clear to mean when I read the formulas:
Does equ 3 give the same answer as equ 6 even when the values are discrete and there are ties?

The two main questions are: how many versions of the calculation do we need? and what's the best way to implement them?

To the second: I'm pretty sure the continuous version can be made without loops by using np.searchsorted. For the discrete version it might be possible to use np.searchsorted or a cython function for rankdata that is already in scipy.stats.

I think getting a very fast version is not a requirement for this PR, but it's also possible that it's not very difficult to get a no python loop version with the existing tools.

What are the options for a user if we have 3, 6, and 7?
If 3 and 6 can use the same code, then it would only be whether to use mean rank in case of ties.
If only continuous equ 3 can get a fast algorithm, then we should also allow users to choose that.

(I'm a bit distracted because I also have pull requests in statsmodels where I need to catch up with with some readings to understand the topics.)

@josef-pkt
Member

Related: From what I remember equ 5 a weighted sum of chisquare distributions shows up in some of the discrete gof tests. I finally wrote the code for getting the p-values from that, but I don't have any truncation rule for the infinite sum.

I don't think it's really relevant in this case because according to the references that I looked at, the anderson darling statistic converges fast and the Stephen's approximation seems to work pretty well. However, I didn't look much at discrete cases.

@joergdietrich joergdietrich Speed up and implement both version for discrete distributions
1. Change call signature to have array of arrays and optional keyword to
   specify which version of the k-sample AD test should be computed.

2. Get rid of all inner loops and list comprehensions by using
   np.searchsorted.

3. Both version given by Scholz & Stephens for discrete sample can be
   computed now.
3f421bc
@joergdietrich
Contributor

I managed to rewrite everything inside the outer loop and the determination of the multiplicity using np.searchsorted. Thanks for this pointer, I wasn't aware of the power of this function. Unfortunately, it's only a speed-up of less than a percent per outer loop for a large range of sample sizes, so the list comprehension was doing pretty well already. This change mostly benefits large numbers of samples, which may not be very common.

Eqs. 3, 6, 7 always disagree.

@argriffing
Contributor

The test failure is because a few days ago TravisCI decided to start using less precision when it calculates small cosine integrals, for some reason.

@joergdietrich
Contributor

Any further comments on this?

@rgommers rgommers commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ ValueError
+ If less than 2 samples are provided, a sample is empty, or no
+ distinct observations are in the samples.
+
+ See Also
+ --------
+ ks_2samp : 2 sample Kolmogorov-Smirnov test
+ anderson : 1 sample Anderson-Darling test
+
+ Notes
+ -----
+ [1]_ Define three versions of the k-sample Anderson-Darling test:
+ one for continous distributions and two for discrete
+ distributions, in which ties between samples may occur. The latter
+ variant of the test is also applicable to continuous data. By
+ default, this routine computes the test for continuous and
@rgommers
rgommers Feb 1, 2014 Member

This statement looks incorrect (only one p-value returned) and would also be strange. Default is continuous only right?

@rgommers rgommers and 1 other commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+
+ Parameters
+ ----------
+ samples : array_like
+ array of sample data in arrays
+
+ discrete : bool, optional
+ type of Anderson-Darling test which is computed. Default is a test
+ applicable to discrete and continous distributions.
+
+ Returns
+ -------
+ Tk : float
+ Normalized k-sample Anderson-Darling test statistic, not adjusted for
+ ties
+ tm : array
@rgommers
rgommers Feb 1, 2014 Member

Could you use some more descriptive variable names? p is so widely used that it may be OK, but Tk and tm are not. Same comment for the internal variables, they're almost all one-letter which isn't readable.

@rgommers
rgommers Feb 1, 2014 Member

Actually, why not name the return values consistent with anderson?

@joergdietrich
joergdietrich Feb 3, 2014 Contributor

I changed the return values to be as consistent with anderson as possible. I prefer to keep the internal variable names unchanged. The implementation of methods from the literature can only be understood together with the paper describing the method. In such cases I find it much more helpful to have the variable names in the code match the variable names in the paper rather than trying to come up with descriptive names that anybody trying to match the code to the paper than has to translate back.

@rgommers
rgommers Feb 3, 2014 Member

OK, the paper is available online for free, so for internal variables that should be fine.

@rgommers
Member
rgommers commented Feb 1, 2014

The function needs to be added in stats/__init__.py in order for it to show up in the documentation.

@rgommers rgommers and 2 others commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ g += inner.sum()
+ a = (4*g - 6) * (k - 1) + (10 - 6*g)*H
+ b = (2*g - 4)*k**2 + 8*h*k + (2*g - 14*h - 4)*H - 8*h + 4*g - 6
+ c = (6*h + 2*g - 2)*k**2 + (4*h - 4*g + 6)*k + (2*h - 6)*H + 4*h
+ d = (2*h + 6)*k**2 - 4*h*k
+ sigmasq = (a*N**3 + b*N**2 + c*N + d) / ((N - 1.) * (N - 2.) * (N - 3.))
+ m = k - 1
+ Tk = (A2kN - m) / math.sqrt(sigmasq)
+
+ b0 = np.array([0.675, 1.281, 1.645, 1.96, 2.326])
+ b1 = np.array([-0.245, 0.25, 0.678, 1.149, 1.822])
+ b2 = np.array([-0.105, -0.305, -0.362, -0.391, -0.396])
+ tm = b0 + b1 / math.sqrt(m) + b2 / m
+ pf = np.polyfit(tm, log(np.array([0.25, 0.1, 0.05, 0.025, 0.01])), 2)
+ if Tk < tm.min() or Tk > tm.max():
+ warnings.warn("approximate p-value will be computed by extrapolation")
@rgommers
rgommers Feb 1, 2014 Member

Is this warning needed? It shows up in most of the test cases, so I'm guessing it's not that uncommon (didn't check). If so, adding a note in the docstring might make more sense. If the warning has to be kept, it shouldn't show up in the test output (can be silenced within a with warnings.catch_warnings() block if needed).

@josef-pkt
josef-pkt Feb 1, 2014 Member

I'm not sure what the best pattern for cases like this is. I don't know how good the extrapolation is, it might have quite a large error in some ranges.
I have something similar for tables of p-values without extrapolation:

  • mention only in docstring about the range of extrapolation (It's just lower precision than interpolation.)
  • keep warning as here
  • truncate (without extrapolation some packages, and some of my functions, just return the boundary value 0.25 or 0.01, for text return it would be '<0.01' or '>0.25')

For most use cases the exact p-value outside [0.01, 0.25] doesn't really matter and just mentioning in docstring would be enough. But I guess there would be multiple testing applications, where smaller p-values are relevant and users need to be aware that those are not very precise.

@joergdietrich
joergdietrich Feb 2, 2014 Contributor

I don't think the quality of the interpolation is known. Scholz & Stephens vary the polynomial order depending on the number of samples and provide no guidance for what a general procedure should use. The test cases are taken from Scholz and Stephens and happen to be cases where the null hypothesis can be rejected at better than the 1% level. Given the unknown level of accuracy I'd prefer to keep the warning, unless there's a strong preference to move it to the docstring.

@rgommers
rgommers Feb 2, 2014 Member

OK, that's fine with me then.

@josef-pkt
josef-pkt Feb 2, 2014 Member

warning is fine with me too
I don't have a strong opinion given I don't know how good the extrapolation is.

@rgommers rgommers and 2 others commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ array of sample arrays
+ Z : array_like
+ sorted array of all observations
+ Zstar : array_like
+ sorted array of unique observations
+ k : int
+ number of samples
+ n : array_like
+ number of observations in each sample
+ N : int
+ total number of observations
+
+ Returns
+ -------
+ A2aKN : float
+ The A2aKN statistics of Scholz & Stephens
@rgommers
rgommers Feb 1, 2014 Member

The & should be escaped, or use a raw docstring. Results in Sphinx warnings (or errors, can't remember) otherwise.

@josef-pkt
josef-pkt Feb 1, 2014 Member

spell it out and plus year

@argriffing
argriffing Feb 1, 2014 Contributor

This is another case where wider-than-function scope references in sphinx-formatted docstrings would be helpful. Then you could just add a ref link to https://github.com/joergdietrich/scipy/blob/k-sample-AD/scipy/stats/morestats.py#L1267

@rgommers rgommers commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+
+
+def anderson_ksamp(samples, discrete=False):
+ """The Anderson-Darling test for k-samples.
+
+ The k-sample Anderson-Darling test is a modification of the
+ one-sample Anderson-Darling test. It tests the null hypothesis
+ that k-samples are drawn from the same population without having
+ to specify the distribution function of that population. The
+ critical values depend on the number of samples.
+
+ Parameters
+ ----------
+ samples : array_like
+ array of sample data in arrays
+
@rgommers
rgommers Feb 1, 2014 Member

No blank line needed.

@rgommers rgommers commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ distinct observations are in the samples.
+
+ See Also
+ --------
+ ks_2samp : 2 sample Kolmogorov-Smirnov test
+ anderson : 1 sample Anderson-Darling test
+
+ Notes
+ -----
+ [1]_ Define three versions of the k-sample Anderson-Darling test:
+ one for continous distributions and two for discrete
+ distributions, in which ties between samples may occur. The latter
+ variant of the test is also applicable to continuous data. By
+ default, this routine computes the test for continuous and
+ discrete data. If discrete is set to True, the test for discrete
+ data is computed. According to [1]_, the two test statistics
@rgommers
rgommers Feb 1, 2014 Member

insert "discrete" in "two test"

@rgommers rgommers commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ Tests, Journal of the American Statistical Association, Vol. 82,
+ pp. 918-924.
+
+ Examples:
+ ---------
+ >>> from scipy import stats
+ >>> np.random.seed(314159)
+
+ The null hypothesis that the two random samples come from the same
+ distribution can be rejected at the 5% level because the returned
+ test value is greater than the critical value for 5% (1.961) but
+ not at the 2.5% level. The interpolation gives an approximate
+ significance level of 3.1%:
+
+ >>> stats.anderson_ksamp(np.random.normal(size=50), \
+ np.random.normal(loc=0.5, size=30))
@rgommers
rgommers Feb 1, 2014 Member

Use ... for the line continuation instead of \.

@rgommers rgommers commented on the diff Feb 1, 2014
scipy/stats/morestats.py
+
+ The null hypothesis cannot be rejected for three samples from an
+ identical distribution. The approximate p-value (87%) has to be
+ computed by extrapolation and may not be very accurate:
+
+ >>> stats.anderson_ksamp(np.random.normal(size=50), \
+ np.random.normal(size=30), np.random.normal(size=20))
+ (-0.72478622084152444,
+ array([ 0.44925884, 1.3052767, 1.9434184, 2.57696569, 3.41634856]),
+ 0.8732440333177699)
+
+ """
+
+ k = len(samples)
+ if (k < 2):
+ raise ValueError("anderson_ksamp needs at least two samples")
@rgommers
rgommers Feb 1, 2014 Member

blank lines below this and the next couple of if statements

@rgommers
rgommers Feb 1, 2014 Member

only 2 samples? IIRC other tests need 5 to continue with a warning and 20 for no warning. 2 certainly isn't enough for useful results.

@rgommers
rgommers Feb 1, 2014 Member

never mind, figured this one out. The wording is a bit confusing, I propose "two sets of samples". And then there should be the check for number of values per set of samples.

@josef-pkt
josef-pkt Feb 1, 2014 Member

samples is in used and defined in the first line of docstring
2-samp, k-samp, tests for k samples
I think it should be clear that we mean 2 samples and not 2 observations per sample

@rgommers rgommers commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ (-0.72478622084152444,
+ array([ 0.44925884, 1.3052767, 1.9434184, 2.57696569, 3.41634856]),
+ 0.8732440333177699)
+
+ """
+
+ k = len(samples)
+ if (k < 2):
+ raise ValueError("anderson_ksamp needs at least two samples")
+ samples = list(map(np.asarray, samples))
+ Z = np.hstack(samples)
+ N = Z.size
+ Z.sort()
+ Zstar = np.unique(Z)
+ L = Zstar.size
+ if not L > 1:
@rgommers
rgommers Feb 1, 2014 Member

L is only used here and not > is <, so I'd rewrite the above two lines as if Zstar.size < 2:

@rgommers rgommers commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+
+ >>> stats.anderson_ksamp(np.random.normal(size=50), \
+ np.random.normal(size=30), np.random.normal(size=20))
+ (-0.72478622084152444,
+ array([ 0.44925884, 1.3052767, 1.9434184, 2.57696569, 3.41634856]),
+ 0.8732440333177699)
+
+ """
+
+ k = len(samples)
+ if (k < 2):
+ raise ValueError("anderson_ksamp needs at least two samples")
+ samples = list(map(np.asarray, samples))
+ Z = np.hstack(samples)
+ N = Z.size
+ Z.sort()
@rgommers
rgommers Feb 1, 2014 Member

I'd combine this with the line above: Z = np.sort(np.hstack(samples)).

@rgommers rgommers commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ H = (1. / n).sum()
+ g = 0
+ for l in arange(1, N-1):
+ inner = np.array([1. / ((N - l) * m) for m in arange(l+1, N)])
+ g += inner.sum()
+ a = (4*g - 6) * (k - 1) + (10 - 6*g)*H
+ b = (2*g - 4)*k**2 + 8*h*k + (2*g - 14*h - 4)*H - 8*h + 4*g - 6
+ c = (6*h + 2*g - 2)*k**2 + (4*h - 4*g + 6)*k + (2*h - 6)*H + 4*h
+ d = (2*h + 6)*k**2 - 4*h*k
+ sigmasq = (a*N**3 + b*N**2 + c*N + d) / ((N - 1.) * (N - 2.) * (N - 3.))
+ m = k - 1
+ Tk = (A2kN - m) / math.sqrt(sigmasq)
+
+ b0 = np.array([0.675, 1.281, 1.645, 1.96, 2.326])
+ b1 = np.array([-0.245, 0.25, 0.678, 1.149, 1.822])
+ b2 = np.array([-0.105, -0.305, -0.362, -0.391, -0.396])
@rgommers
rgommers Feb 1, 2014 Member

This deserves a comment above the line b0 =, otherwise it's unclear what the magic numbers mean (and yes, I did figure it out from the docstring after some frowning).

@rgommers rgommers commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ provided samples can be rejected
+
+ Raises
+ ------
+ ValueError
+ If less than 2 samples are provided, a sample is empty, or no
+ distinct observations are in the samples.
+
+ See Also
+ --------
+ ks_2samp : 2 sample Kolmogorov-Smirnov test
+ anderson : 1 sample Anderson-Darling test
+
+ Notes
+ -----
+ [1]_ Define three versions of the k-sample Anderson-Darling test:
@rgommers
rgommers Feb 1, 2014 Member

Define --> defines

@rgommers rgommers commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ return A2kN
+
+
+def anderson_ksamp(samples, discrete=False):
+ """The Anderson-Darling test for k-samples.
+
+ The k-sample Anderson-Darling test is a modification of the
+ one-sample Anderson-Darling test. It tests the null hypothesis
+ that k-samples are drawn from the same population without having
+ to specify the distribution function of that population. The
+ critical values depend on the number of samples.
+
+ Parameters
+ ----------
+ samples : array_like
+ array of sample data in arrays
@rgommers
rgommers Feb 1, 2014 Member

style nit: can you start each description with a capital letter and end it with .?

@rgommers rgommers commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ A2kN += inner.sum() / n[i]
+ return A2kN
+
+
+def anderson_ksamp(samples, discrete=False):
+ """The Anderson-Darling test for k-samples.
+
+ The k-sample Anderson-Darling test is a modification of the
+ one-sample Anderson-Darling test. It tests the null hypothesis
+ that k-samples are drawn from the same population without having
+ to specify the distribution function of that population. The
+ critical values depend on the number of samples.
+
+ Parameters
+ ----------
+ samples : array_like
@rgommers
rgommers Feb 1, 2014 Member

This is actually sequence of 1-D array_like, right? From the description here you'd think a single 2-D array is needed.

@rgommers
Member
rgommers commented Feb 1, 2014

Did you compare this against anderson? If you draw one set of samples from a distribution and specify the same distribution for anderson, and then test those against a single other set of samples, then after enough runs the test statistic and p-values should be the same within some tolerance (at least that's what I expect).

@rgommers
Member
rgommers commented Feb 1, 2014

I don't have the reference and Josef understands this much better than I do anyway, so I'll let him judge the correctness of the stats.

@josef-pkt josef-pkt commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ k : int
+ number of samples
+ n : array_like
+ number of observations in each sample
+ N : int
+ total number of observations
+
+ Returns
+ -------
+ A2aKN : float
+ The A2aKN statistics of Scholz & Stephens
+ """
+
+ A2akN = 0.
+ lj = Z.searchsorted(Zstar, 'right') - Z.searchsorted(Zstar, 'left')
+ Bj = Z.searchsorted(Zstar) + lj / 2.
@josef-pkt
josef-pkt Feb 1, 2014 Member

same as searchsorted 'left' (default), save and reuse

@josef-pkt josef-pkt commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ N : int
+ total number of observations
+
+ Returns
+ -------
+ A2aKN : float
+ The A2aKN statistics of Scholz & Stephens
+ """
+
+ A2akN = 0.
+ lj = Z.searchsorted(Zstar, 'right') - Z.searchsorted(Zstar, 'left')
+ Bj = Z.searchsorted(Zstar) + lj / 2.
+ for i in arange(0, k):
+ s = np.sort(samples[i])
+ Mij = s.searchsorted(Zstar, side='right').astype(np.float)
+ fij = s.searchsorted(Zstar, 'right') - s.searchsorted(Zstar, 'left')
@josef-pkt
josef-pkt Feb 1, 2014 Member

reuse searchsorted 'right'

@josef-pkt josef-pkt commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
@@ -1131,6 +1131,217 @@ def rootfunc(ab,xj,N):
return A2, critical, sig
+def _anderson_ksamp_both(samples, Z, Zstar, k, n, N):
@josef-pkt
josef-pkt Feb 1, 2014 Member

isn't this discrete? it uses midrank

@josef-pkt josef-pkt commented on the diff Feb 1, 2014
scipy/stats/morestats.py
+ total number of observations
+
+ Returns
+ -------
+ A2KN : float
+ The A2KN statistics of Scholz & Stephens
+ """
+
+ A2kN = 0.
+ lj = Z.searchsorted(Zstar[:-1], 'right') - Z.searchsorted(Zstar[:-1],
+ 'left')
+ Bj = lj.cumsum()
+ for i in arange(0, k):
+ s = np.sort(samples[i])
+ Mij = s.searchsorted(Zstar[:-1], side='right')
+ inner = lj / float(N) * (N * Mij - Bj * n[i])**2 / (Bj * (N - Bj))
@josef-pkt
josef-pkt Feb 1, 2014 Member

for continuous:
My impression is that we should replace lj by 1, and use the original sorted series not the uniques
for no ties: Z == Zstar
and for discrete (without tie handling) we can take the 'right' count

This would give us some speed up (minus unique and minus two searchsorted.)

@josef-pkt josef-pkt commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ A2akN = 0.
+ lj = Z.searchsorted(Zstar, 'right') - Z.searchsorted(Zstar, 'left')
+ Bj = Z.searchsorted(Zstar) + lj / 2.
+ for i in arange(0, k):
+ s = np.sort(samples[i])
+ Mij = s.searchsorted(Zstar, side='right').astype(np.float)
+ fij = s.searchsorted(Zstar, 'right') - s.searchsorted(Zstar, 'left')
+ Mij -= fij / 2.
+ inner = lj / float(N) * (N * Mij - Bj * n[i])**2 / \
+ (Bj * (N - Bj) - N * lj / 4.)
+ A2akN += inner.sum() / n[i]
+ A2akN *= (N - 1.) / N
+ return A2akN
+
+
+def _anderson_ksamp_discrete(samples, Z, Zstar, k, n, N):
@josef-pkt
josef-pkt Feb 1, 2014 Member

should this be both it uses 'right' and not the midrank

@josef-pkt josef-pkt commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
+ -----
+ [1]_ Define three versions of the k-sample Anderson-Darling test:
+ one for continous distributions and two for discrete
+ distributions, in which ties between samples may occur. The latter
+ variant of the test is also applicable to continuous data. By
+ default, this routine computes the test for continuous and
+ discrete data. If discrete is set to True, the test for discrete
+ data is computed. According to [1]_, the two test statistics
+ differ only slightly if a few collisions due to round-off errors
+ occur in the test not adjusted for ties between samples.
+
+ .. versionadded:: 0.14.0
+
+ References
+ ----------
+ .. [1] Scholz, F. W & Stephens, M. A. (1987), K-Sample Anderson-Darling
@josef-pkt
josef-pkt Feb 1, 2014 Member

write and instead of &

@josef-pkt josef-pkt and 1 other commented on an outdated diff Feb 1, 2014
scipy/stats/tests/test_morestats.py
+class TestAndersonKSamp(TestCase):
+ def test_example1a(self):
+ # Example data from Scholz & Stephens (1987), originally
+ # published in Lehmann (1995, Nonparametrics, Statistical
+ # Methods Based on Ranks, p. 309)
+ # Pass a mixture of lists and arrays
+ t1 = [38.7, 41.5, 43.8, 44.5, 45.5, 46.0, 47.7, 58.0]
+ t2 = np.array([39.2, 39.3, 39.7, 41.4, 41.8, 42.9, 43.3, 45.8])
+ t3 = np.array([34.0, 35.0, 39.0, 40.0, 43.0, 43.0, 44.0, 45.0])
+ t4 = np.array([34.0, 34.8, 34.8, 35.4, 37.2, 37.8, 41.2, 42.8])
+ Tk, tm, p = assert_warns(UserWarning, stats.anderson_ksamp, (t1, t2,
+ t3, t4), discrete=True)
+ assert_almost_equal(Tk, 4.449, 3)
+ assert_array_almost_equal([0.4985, 1.3237, 1.9158, 2.4930, 3.2459],
+ tm, 4)
+ assert_almost_equal(p, 0.0021, 4)
@josef-pkt
josef-pkt Feb 1, 2014 Member

Are the Tk and pvalues in the unittests from Scholz and Stephens or "regression test" numbers?

@joergdietrich
joergdietrich Feb 2, 2014 Contributor

The Tk values are from Scholz and Stephens. The pvalues differ by one at the last digit because I used a second order polynomial instead of a linear one. The choice was motivated by looking at the interpolation of the two test cases.

@josef-pkt
Member

I think the algorithm looks good now. The use of searchsorted is pretty much the fastest we can get with numpy.

I'm confused about _discrete versus _both: In my reading both uses the midrank and discrete uses the 'right'/cdf definition. That's reversed from what I expected, but I didn't read Scholz and Stephens again.

The continuous/both(?) case could be made faster by skipping the lj (number of elements of a unique) calculation. (lj==1 if all observations are unique)

@joergdietrich
Contributor

Scholz and Stephens write about the midrank version: "This formula applies for a continuous population also, then all l_j = 1." So I'll implement that speed-up but the midrank version indeed seems to be the one for discrete and continuous cases.

I'm not quite happy with the naming of the functions and the call signature. Maybe instead of discrete=False we should have midrank=True and rename the functions to _midrank and _right with appropriate modifcations to the docstring. Any thoughts?

@josef-pkt
Member

I also think your proposal, using midrank=True, sounds better
It will also make it easier to explain the difference in the docstring.

Thanks, most likely I will copy the function to statsmodels until our minimum supported scipy version includes it.

joergdietrich added some commits Feb 2, 2014
@joergdietrich joergdietrich Change call signature of anderson_ksamp
- Replace discrete=False with midrank=True to better explain what the
  difference between tests is. Update the docstrings accordingly.
bb3ec01
@joergdietrich joergdietrich Add missing word to docstring 9b0cb39
@joergdietrich joergdietrich avoid computation of lj for continuous distributions; fix docstring f…
…or parameters
908e281
@joergdietrich joergdietrich actually re-use saved searchsorted array instead of just saving it an…
…d then recomputing it ...
6688108
@rgommers rgommers commented on an outdated diff Feb 3, 2014
scipy/stats/morestats.py
+ """The Anderson-Darling test for k-samples.
+
+ The k-sample Anderson-Darling test is a modification of the
+ one-sample Anderson-Darling test. It tests the null hypothesis
+ that k-samples are drawn from the same population without having
+ to specify the distribution function of that population. The
+ critical values depend on the number of samples.
+
+ Parameters
+ ----------
+ samples : sequence of 1-D array_like
+ Array of sample data in arrays.
+ midrank : bool, optional
+ Type of Anderson-Darling test which is computed. Default is
+ the midrank test applicable to continuous and discrete
+ populations.
@rgommers
rgommers Feb 3, 2014 Member

Default (True). And If False, the type is ..... Just one sentence and a reference to more elaborate explanation in the Notes section is OK, but there should be something here.

@rgommers
Member
rgommers commented Feb 3, 2014

Updates so far look good, most of my comments are addressed. Do need to squash some commits and write the commit messages in more standard form before merging.

@joergdietrich
Contributor

I need some guidance for the "more standard form" of commit messages.

@coveralls

Coverage Status

Coverage remained the same when pulling 24931cb on joergdietrich:k-sample-AD into fd99d3f on scipy:master.

@rgommers
Member
rgommers commented Feb 7, 2014

Example:

More verbose explanation of midrank parameter

Should be something like:

DOC: more verbose explanation of midrank parameter in stats.anderson_ksamp
joergdietrich added some commits Jan 3, 2014
@joergdietrich joergdietrich ENH: Add k-sample Anderson-Darling test to stats module 8a92e25
@joergdietrich joergdietrich API: Speed up and implement both version for discrete distributions f…
…or k-sample Anderson-Darling test

1. Change call signature to have array of arrays and optional keyword to
   specify which version of the k-sample AD test should be computed.

2. Get rid of all inner loops and list comprehensions by using
   np.searchsorted.

3. Both version given by Scholz & Stephens for discrete sample can be
   computed now.
d9d46af
@joergdietrich joergdietrich STY: Replace "&" with "and: in citation and add year in stats.anderso…
…n_ksamp and stats.__anderson_ksamp_both
b2c5ef9
@joergdietrich joergdietrich MAINT: fix typo in docstring stats.anderson_ksamp 371dcc7
@joergdietrich joergdietrich STY: add blank lines after if blocks in k-sample Anderson Darling rou…
…tines
b2682cd
@joergdietrich joergdietrich STY: Use ... for line continuation in docstring in stats.anderson_ksamp b138007
@joergdietrich joergdietrich DOC: Add comment to explain interpolation values stats.anderson_ksamp a12c26f
@joergdietrich joergdietrich STY: capital letter for each parameter description and period at end …
…in k-sample Anderson-Darling docstrings
f90390d
@joergdietrich joergdietrich MAINT: combine a few simple statements in stats.anderson_ksamp a2a077b
@joergdietrich joergdietrich API: Change call signature of stats.anderson_ksamp
- Replace discrete=False with midrank=True to better explain what the
  difference between tests is. Update the docstrings accordingly.
1626e0f
@joergdietrich joergdietrich MAINT: Add missing word to docstring of stats.anderson_ksamp aad0972
@joergdietrich joergdietrich MAINT: avoid computation of lj for continuous distributions in stats.…
…_anderson_ksamp_midrank; fix docstring for parameters
11b5a01
@joergdietrich joergdietrich DOC: More verbose explanation of midrank parameter in stats.anderson_…
…ksamp
7e8e03c
@joergdietrich joergdietrich Merge branch 'k-sample-AD' of github.com:joergdietrich/scipy into k-s…
…ample-AD
6bce442
@joergdietrich
Contributor

I'm not particularly comfortable with rebasing. I hope I got this right and the commit messages are okay now. If anything else needs fixing it'll probably have to wait ~2 weeks.

@pv pv added the PR label Feb 19, 2014
@joergdietrich
Contributor

Anything missing to get this merged into 0.14.x?

@josef-pkt josef-pkt added this to the 0.14.0 milestone Feb 24, 2014
@josef-pkt
Member

Ralf, I think this is fine to merge. I didn't look at the details again, but, last time I went through this, I didn't see anything that would hold this up from the statistics side.

@rgommers
Member

I don't think anything is missing. Looks like something went wrong with the rebase, I'll try to fix that tonight and merge this. 0.14.0 milestone is already set for this PR.

@rgommers rgommers added a commit that referenced this pull request Feb 24, 2014
@rgommers rgommers Merge branch 'pr/3183' into master.
Review at #3183
a32a7ba
@rgommers
Member

Merged in a32a7ba. Thanks @joergdietrich, @josef-pkt

@rgommers rgommers closed this Feb 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment