# ENH: Add k-sample Anderson-Darling test to stats module #3183

Closed
wants to merge 31 commits into
from
+338 −2

None yet

### 6 participants

Contributor
 This PR adds the k-sample Anderson-Darling test for continuous distributions as described by Scholz & Stephen (1987, Journal of the American Statistical Association, Vol. 82, pp. 918-924) to the stats module.
 joergdietrich `ENH: Add k-sample Anderson-Darling test to stats module` `95ad13d`
and 1 other commented on an outdated diff Jan 3, 2014
scipy/stats/morestats.py
 + Zstar = np.unique(Z) + L = Zstar.size + if not L > 1: + raise ValueError("anderson_ksamp needs more than one distinct " + "observation") + n = np.array([sample.size for sample in samples]) + if any(n == 0): + raise ValueError("anderson_ksamp encountered sample without " + "observations") + A2kN = 0. + lj = np.array([(Z == zj).sum() for zj in Zstar[:-1]]) + Bj = lj.cumsum() + for i in arange(0, k): + fij = np.array([(samples[i] == zj).sum() for zj in Zstar[:-1]]) + Mij = fij.cumsum() + inner = lj / float(N) * (N * Mij - Bj * n[i])**2 / (Bj * (N - Bj))
 josef-pkt Member Is this equation (6) in Scholz and Stephens, and not equation (3) ? discrete parent population ? not continuous as in the docstring, lines 1174, 1175 joergdietrich Contributor Yes, it is. The docstring needs to be fixed.
commented on the diff Jan 3, 2014
scipy/stats/morestats.py
 + fij = np.array([(samples[i] == zj).sum() for zj in Zstar[:-1]]) + Mij = fij.cumsum() + inner = lj / float(N) * (N * Mij - Bj * n[i])**2 / (Bj * (N - Bj)) + A2kN += inner.sum() / n[i] + + h = (1. / arange(1, N)).sum() + H = (1. / n).sum() + g = 0 + for l in arange(1, N-1): + inner = np.array([1. / ((N - l) * m) for m in arange(l+1, N)]) + g += inner.sum() + a = (4*g - 6) * (k - 1) + (10 - 6*g)*H + b = (2*g - 4)*k**2 + 8*h*k + (2*g - 14*h - 4)*H - 8*h + 4*g - 6 + c = (6*h + 2*g - 2)*k**2 + (4*h - 4*g + 6)*k + (2*h - 6)*H + 4*h + d = (2*h + 6)*k**2 - 4*h*k + sigmasq = (a*N**3 + b*N**2 + c*N + d) / ((N - 1.) * (N - 2.) * (N - 3.))
 josef-pkt Member this is based on equation 4? joergdietrich Contributor Yes.
Member
 good I never tried my hand on the discrete Anderson Darling tests. I think we should add `_discrete` to the name of the function. I'm asking a question on the mailinglist about the signature.
Member
 If we are planning on two different functions for discrete and continuous, then it might be better to outsource the pvalue calculation into a (private) helper function.
Contributor
 I'll change the signature according to Ralf's suggestion. Regarding discrete and continuous distributions, I now have code to compute equations 3, 6, and 7 of Scholz and Stephens. So we could include all of them, or just 6 and 7, where the latter would also apply to continuous distributions.
Member
 One question that's not clear to mean when I read the formulas: Does equ 3 give the same answer as equ 6 even when the values are discrete and there are ties? The two main questions are: how many versions of the calculation do we need? and what's the best way to implement them? To the second: I'm pretty sure the continuous version can be made without loops by using np.searchsorted. For the discrete version it might be possible to use np.searchsorted or a cython function for rankdata that is already in scipy.stats. I think getting a very fast version is not a requirement for this PR, but it's also possible that it's not very difficult to get a no python loop version with the existing tools. What are the options for a user if we have 3, 6, and 7? If 3 and 6 can use the same code, then it would only be whether to use mean rank in case of ties. If only continuous equ 3 can get a fast algorithm, then we should also allow users to choose that. (I'm a bit distracted because I also have pull requests in statsmodels where I need to catch up with with some readings to understand the topics.)
Member
 Related: From what I remember equ 5 a weighted sum of chisquare distributions shows up in some of the discrete gof tests. I finally wrote the code for getting the p-values from that, but I don't have any truncation rule for the infinite sum. I don't think it's really relevant in this case because according to the references that I looked at, the anderson darling statistic converges fast and the Stephen's approximation seems to work pretty well. However, I didn't look much at discrete cases.
 joergdietrich `Speed up and implement both version for discrete distributions` ```1. Change call signature to have array of arrays and optional keyword to specify which version of the k-sample AD test should be computed. 2. Get rid of all inner loops and list comprehensions by using np.searchsorted. 3. Both version given by Scholz & Stephens for discrete sample can be computed now.``` `3f421bc`
Contributor
 I managed to rewrite everything inside the outer loop and the determination of the multiplicity using np.searchsorted. Thanks for this pointer, I wasn't aware of the power of this function. Unfortunately, it's only a speed-up of less than a percent per outer loop for a large range of sample sizes, so the list comprehension was doing pretty well already. This change mostly benefits large numbers of samples, which may not be very common. Eqs. 3, 6, 7 always disagree.
Contributor
 The test failure is because a few days ago TravisCI decided to start using less precision when it calculates small cosine integrals, for some reason.
Contributor
scipy/stats/morestats.py
 + ValueError + If less than 2 samples are provided, a sample is empty, or no + distinct observations are in the samples. + + See Also + -------- + ks_2samp : 2 sample Kolmogorov-Smirnov test + anderson : 1 sample Anderson-Darling test + + Notes + ----- + [1]_ Define three versions of the k-sample Anderson-Darling test: + one for continous distributions and two for discrete + distributions, in which ties between samples may occur. The latter + variant of the test is also applicable to continuous data. By + default, this routine computes the test for continuous and
 rgommers Member This statement looks incorrect (only one p-value returned) and would also be strange. Default is continuous only right?
and 1 other commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
 + + Parameters + ---------- + samples : array_like + array of sample data in arrays + + discrete : bool, optional + type of Anderson-Darling test which is computed. Default is a test + applicable to discrete and continous distributions. + + Returns + ------- + Tk : float + Normalized k-sample Anderson-Darling test statistic, not adjusted for + ties + tm : array
 rgommers Member Could you use some more descriptive variable names? `p` is so widely used that it may be OK, but `Tk` and `tm` are not. Same comment for the internal variables, they're almost all one-letter which isn't readable. rgommers Member Actually, why not name the return values consistent with `anderson`? joergdietrich Contributor I changed the return values to be as consistent with `anderson` as possible. I prefer to keep the internal variable names unchanged. The implementation of methods from the literature can only be understood together with the paper describing the method. In such cases I find it much more helpful to have the variable names in the code match the variable names in the paper rather than trying to come up with descriptive names that anybody trying to match the code to the paper than has to translate back. rgommers Member OK, the paper is available online for free, so for internal variables that should be fine.
Member
commented Feb 1, 2014
 The function needs to be added in `stats/__init__.py` in order for it to show up in the documentation.
and 2 others commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
 + g += inner.sum() + a = (4*g - 6) * (k - 1) + (10 - 6*g)*H + b = (2*g - 4)*k**2 + 8*h*k + (2*g - 14*h - 4)*H - 8*h + 4*g - 6 + c = (6*h + 2*g - 2)*k**2 + (4*h - 4*g + 6)*k + (2*h - 6)*H + 4*h + d = (2*h + 6)*k**2 - 4*h*k + sigmasq = (a*N**3 + b*N**2 + c*N + d) / ((N - 1.) * (N - 2.) * (N - 3.)) + m = k - 1 + Tk = (A2kN - m) / math.sqrt(sigmasq) + + b0 = np.array([0.675, 1.281, 1.645, 1.96, 2.326]) + b1 = np.array([-0.245, 0.25, 0.678, 1.149, 1.822]) + b2 = np.array([-0.105, -0.305, -0.362, -0.391, -0.396]) + tm = b0 + b1 / math.sqrt(m) + b2 / m + pf = np.polyfit(tm, log(np.array([0.25, 0.1, 0.05, 0.025, 0.01])), 2) + if Tk < tm.min() or Tk > tm.max(): + warnings.warn("approximate p-value will be computed by extrapolation")
 rgommers Member Is this warning needed? It shows up in most of the test cases, so I'm guessing it's not that uncommon (didn't check). If so, adding a note in the docstring might make more sense. If the warning has to be kept, it shouldn't show up in the test output (can be silenced within a `with warnings.catch_warnings()` block if needed). josef-pkt Member I'm not sure what the best pattern for cases like this is. I don't know how good the extrapolation is, it might have quite a large error in some ranges. I have something similar for tables of p-values without extrapolation: mention only in docstring about the range of extrapolation (It's just lower precision than interpolation.) keep warning as here truncate (without extrapolation some packages, and some of my functions, just return the boundary value 0.25 or 0.01, for text return it would be `'<0.01'` or `'>0.25'`) For most use cases the exact p-value outside [0.01, 0.25] doesn't really matter and just mentioning in docstring would be enough. But I guess there would be multiple testing applications, where smaller p-values are relevant and users need to be aware that those are not very precise. joergdietrich Contributor I don't think the quality of the interpolation is known. Scholz & Stephens vary the polynomial order depending on the number of samples and provide no guidance for what a general procedure should use. The test cases are taken from Scholz and Stephens and happen to be cases where the null hypothesis can be rejected at better than the 1% level. Given the unknown level of accuracy I'd prefer to keep the warning, unless there's a strong preference to move it to the docstring. rgommers Member OK, that's fine with me then. josef-pkt Member warning is fine with me too I don't have a strong opinion given I don't know how good the extrapolation is.
and 2 others commented on an outdated diff Feb 1, 2014
scipy/stats/morestats.py
 + array of sample arrays + Z : array_like + sorted array of all observations + Zstar : array_like + sorted array of unique observations + k : int + number of samples + n : array_like + number of observations in each sample + N : int + total number of observations + + Returns + ------- + A2aKN : float + The A2aKN statistics of Scholz & Stephens
 rgommers Member The `&` should be escaped, or use a raw docstring. Results in Sphinx warnings (or errors, can't remember) otherwise. josef-pkt Member spell it out `and` plus year argriffing Contributor This is another case where wider-than-function scope references in sphinx-formatted docstrings would be helpful. Then you could just add a ref link to https://github.com/joergdietrich/scipy/blob/k-sample-AD/scipy/stats/morestats.py#L1267
scipy/stats/morestats.py
 + + +def anderson_ksamp(samples, discrete=False): + """The Anderson-Darling test for k-samples. + + The k-sample Anderson-Darling test is a modification of the + one-sample Anderson-Darling test. It tests the null hypothesis + that k-samples are drawn from the same population without having + to specify the distribution function of that population. The + critical values depend on the number of samples. + + Parameters + ---------- + samples : array_like + array of sample data in arrays +
 rgommers Member No blank line needed.
scipy/stats/morestats.py
 + distinct observations are in the samples. + + See Also + -------- + ks_2samp : 2 sample Kolmogorov-Smirnov test + anderson : 1 sample Anderson-Darling test + + Notes + ----- + [1]_ Define three versions of the k-sample Anderson-Darling test: + one for continous distributions and two for discrete + distributions, in which ties between samples may occur. The latter + variant of the test is also applicable to continuous data. By + default, this routine computes the test for continuous and + discrete data. If discrete is set to True, the test for discrete + data is computed. According to [1]_, the two test statistics
 rgommers Member insert "discrete" in "two test"
scipy/stats/morestats.py
 + Tests, Journal of the American Statistical Association, Vol. 82, + pp. 918-924. + + Examples: + --------- + >>> from scipy import stats + >>> np.random.seed(314159) + + The null hypothesis that the two random samples come from the same + distribution can be rejected at the 5% level because the returned + test value is greater than the critical value for 5% (1.961) but + not at the 2.5% level. The interpolation gives an approximate + significance level of 3.1%: + + >>> stats.anderson_ksamp(np.random.normal(size=50), \ + np.random.normal(loc=0.5, size=30))
 rgommers Member Use `...` for the line continuation instead of `\`.
commented on the diff Feb 1, 2014
scipy/stats/morestats.py
 + + The null hypothesis cannot be rejected for three samples from an + identical distribution. The approximate p-value (87%) has to be + computed by extrapolation and may not be very accurate: + + >>> stats.anderson_ksamp(np.random.normal(size=50), \ + np.random.normal(size=30), np.random.normal(size=20)) + (-0.72478622084152444, + array([ 0.44925884, 1.3052767, 1.9434184, 2.57696569, 3.41634856]), + 0.8732440333177699) + + """ + + k = len(samples) + if (k < 2): + raise ValueError("anderson_ksamp needs at least two samples")
 rgommers Member blank lines below this and the next couple of `if` statements rgommers Member only 2 samples? IIRC other tests need 5 to continue with a warning and 20 for no warning. 2 certainly isn't enough for useful results. rgommers Member never mind, figured this one out. The wording is a bit confusing, I propose "two sets of samples". And then there should be the check for number of values per set of samples. josef-pkt Member `samples` is in used and defined in the first line of docstring 2-samp, k-samp, tests for k samples I think it should be clear that we mean 2 samples and not 2 observations per sample
scipy/stats/morestats.py
 + (-0.72478622084152444, + array([ 0.44925884, 1.3052767, 1.9434184, 2.57696569, 3.41634856]), + 0.8732440333177699) + + """ + + k = len(samples) + if (k < 2): + raise ValueError("anderson_ksamp needs at least two samples") + samples = list(map(np.asarray, samples)) + Z = np.hstack(samples) + N = Z.size + Z.sort() + Zstar = np.unique(Z) + L = Zstar.size + if not L > 1:
 rgommers Member `L` is only used here and `not >` is `<`, so I'd rewrite the above two lines as `if Zstar.size < 2:`
scipy/stats/morestats.py
 + + >>> stats.anderson_ksamp(np.random.normal(size=50), \ + np.random.normal(size=30), np.random.normal(size=20)) + (-0.72478622084152444, + array([ 0.44925884, 1.3052767, 1.9434184, 2.57696569, 3.41634856]), + 0.8732440333177699) + + """ + + k = len(samples) + if (k < 2): + raise ValueError("anderson_ksamp needs at least two samples") + samples = list(map(np.asarray, samples)) + Z = np.hstack(samples) + N = Z.size + Z.sort()
 rgommers Member I'd combine this with the line above: `Z = np.sort(np.hstack(samples))`.
scipy/stats/morestats.py
 + H = (1. / n).sum() + g = 0 + for l in arange(1, N-1): + inner = np.array([1. / ((N - l) * m) for m in arange(l+1, N)]) + g += inner.sum() + a = (4*g - 6) * (k - 1) + (10 - 6*g)*H + b = (2*g - 4)*k**2 + 8*h*k + (2*g - 14*h - 4)*H - 8*h + 4*g - 6 + c = (6*h + 2*g - 2)*k**2 + (4*h - 4*g + 6)*k + (2*h - 6)*H + 4*h + d = (2*h + 6)*k**2 - 4*h*k + sigmasq = (a*N**3 + b*N**2 + c*N + d) / ((N - 1.) * (N - 2.) * (N - 3.)) + m = k - 1 + Tk = (A2kN - m) / math.sqrt(sigmasq) + + b0 = np.array([0.675, 1.281, 1.645, 1.96, 2.326]) + b1 = np.array([-0.245, 0.25, 0.678, 1.149, 1.822]) + b2 = np.array([-0.105, -0.305, -0.362, -0.391, -0.396])
 rgommers Member This deserves a comment above the line `b0 =`, otherwise it's unclear what the magic numbers mean (and yes, I did figure it out from the docstring after some frowning).
scipy/stats/morestats.py
 + provided samples can be rejected + + Raises + ------ + ValueError + If less than 2 samples are provided, a sample is empty, or no + distinct observations are in the samples. + + See Also + -------- + ks_2samp : 2 sample Kolmogorov-Smirnov test + anderson : 1 sample Anderson-Darling test + + Notes + ----- + [1]_ Define three versions of the k-sample Anderson-Darling test:
 rgommers Member Define --> defines
scipy/stats/morestats.py
 + return A2kN + + +def anderson_ksamp(samples, discrete=False): + """The Anderson-Darling test for k-samples. + + The k-sample Anderson-Darling test is a modification of the + one-sample Anderson-Darling test. It tests the null hypothesis + that k-samples are drawn from the same population without having + to specify the distribution function of that population. The + critical values depend on the number of samples. + + Parameters + ---------- + samples : array_like + array of sample data in arrays
 rgommers Member style nit: can you start each description with a capital letter and end it with `.`?
scipy/stats/morestats.py
 + A2kN += inner.sum() / n[i] + return A2kN + + +def anderson_ksamp(samples, discrete=False): + """The Anderson-Darling test for k-samples. + + The k-sample Anderson-Darling test is a modification of the + one-sample Anderson-Darling test. It tests the null hypothesis + that k-samples are drawn from the same population without having + to specify the distribution function of that population. The + critical values depend on the number of samples. + + Parameters + ---------- + samples : array_like
 rgommers Member This is actually `sequence of 1-D array_like`, right? From the description here you'd think a single 2-D array is needed.
Member
commented Feb 1, 2014
 Did you compare this against `anderson`? If you draw one set of samples from a distribution and specify the same distribution for `anderson`, and then test those against a single other set of samples, then after enough runs the test statistic and p-values should be the same within some tolerance (at least that's what I expect).
Member
commented Feb 1, 2014
 I don't have the reference and Josef understands this much better than I do anyway, so I'll let him judge the correctness of the stats.
scipy/stats/morestats.py
 + k : int + number of samples + n : array_like + number of observations in each sample + N : int + total number of observations + + Returns + ------- + A2aKN : float + The A2aKN statistics of Scholz & Stephens + """ + + A2akN = 0. + lj = Z.searchsorted(Zstar, 'right') - Z.searchsorted(Zstar, 'left') + Bj = Z.searchsorted(Zstar) + lj / 2.
 josef-pkt Member same as searchsorted 'left' (default), save and reuse
scipy/stats/morestats.py
 + N : int + total number of observations + + Returns + ------- + A2aKN : float + The A2aKN statistics of Scholz & Stephens + """ + + A2akN = 0. + lj = Z.searchsorted(Zstar, 'right') - Z.searchsorted(Zstar, 'left') + Bj = Z.searchsorted(Zstar) + lj / 2. + for i in arange(0, k): + s = np.sort(samples[i]) + Mij = s.searchsorted(Zstar, side='right').astype(np.float) + fij = s.searchsorted(Zstar, 'right') - s.searchsorted(Zstar, 'left')
 josef-pkt Member reuse searchsorted 'right'
scipy/stats/morestats.py
 @@ -1131,6 +1131,217 @@ def rootfunc(ab,xj,N): return A2, critical, sig +def _anderson_ksamp_both(samples, Z, Zstar, k, n, N):
 josef-pkt Member isn't this discrete? it uses midrank
commented on the diff Feb 1, 2014
scipy/stats/morestats.py
 + total number of observations + + Returns + ------- + A2KN : float + The A2KN statistics of Scholz & Stephens + """ + + A2kN = 0. + lj = Z.searchsorted(Zstar[:-1], 'right') - Z.searchsorted(Zstar[:-1], + 'left') + Bj = lj.cumsum() + for i in arange(0, k): + s = np.sort(samples[i]) + Mij = s.searchsorted(Zstar[:-1], side='right') + inner = lj / float(N) * (N * Mij - Bj * n[i])**2 / (Bj * (N - Bj))
 josef-pkt Member for continuous: My impression is that we should replace lj by 1, and use the original sorted series not the uniques for no ties: Z == Zstar and for discrete (without tie handling) we can take the 'right' count This would give us some speed up (minus `unique` and minus two `searchsorted`.)
scipy/stats/morestats.py
 + A2akN = 0. + lj = Z.searchsorted(Zstar, 'right') - Z.searchsorted(Zstar, 'left') + Bj = Z.searchsorted(Zstar) + lj / 2. + for i in arange(0, k): + s = np.sort(samples[i]) + Mij = s.searchsorted(Zstar, side='right').astype(np.float) + fij = s.searchsorted(Zstar, 'right') - s.searchsorted(Zstar, 'left') + Mij -= fij / 2. + inner = lj / float(N) * (N * Mij - Bj * n[i])**2 / \ + (Bj * (N - Bj) - N * lj / 4.) + A2akN += inner.sum() / n[i] + A2akN *= (N - 1.) / N + return A2akN + + +def _anderson_ksamp_discrete(samples, Z, Zstar, k, n, N):
 josef-pkt Member should this be `both` it uses 'right' and not the midrank
scipy/stats/morestats.py
 + ----- + [1]_ Define three versions of the k-sample Anderson-Darling test: + one for continous distributions and two for discrete + distributions, in which ties between samples may occur. The latter + variant of the test is also applicable to continuous data. By + default, this routine computes the test for continuous and + discrete data. If discrete is set to True, the test for discrete + data is computed. According to [1]_, the two test statistics + differ only slightly if a few collisions due to round-off errors + occur in the test not adjusted for ties between samples. + + .. versionadded:: 0.14.0 + + References + ---------- + .. [1] Scholz, F. W & Stephens, M. A. (1987), K-Sample Anderson-Darling
 josef-pkt Member write `and` instead of `&`
and 1 other commented on an outdated diff Feb 1, 2014
scipy/stats/tests/test_morestats.py
 +class TestAndersonKSamp(TestCase): + def test_example1a(self): + # Example data from Scholz & Stephens (1987), originally + # published in Lehmann (1995, Nonparametrics, Statistical + # Methods Based on Ranks, p. 309) + # Pass a mixture of lists and arrays + t1 = [38.7, 41.5, 43.8, 44.5, 45.5, 46.0, 47.7, 58.0] + t2 = np.array([39.2, 39.3, 39.7, 41.4, 41.8, 42.9, 43.3, 45.8]) + t3 = np.array([34.0, 35.0, 39.0, 40.0, 43.0, 43.0, 44.0, 45.0]) + t4 = np.array([34.0, 34.8, 34.8, 35.4, 37.2, 37.8, 41.2, 42.8]) + Tk, tm, p = assert_warns(UserWarning, stats.anderson_ksamp, (t1, t2, + t3, t4), discrete=True) + assert_almost_equal(Tk, 4.449, 3) + assert_array_almost_equal([0.4985, 1.3237, 1.9158, 2.4930, 3.2459], + tm, 4) + assert_almost_equal(p, 0.0021, 4)
 josef-pkt Member Are the Tk and pvalues in the unittests from Scholz and Stephens or "regression test" numbers? joergdietrich Contributor The Tk values are from Scholz and Stephens. The pvalues differ by one at the last digit because I used a second order polynomial instead of a linear one. The choice was motivated by looking at the interpolation of the two test cases.
Member
 I think the algorithm looks good now. The use of searchsorted is pretty much the fastest we can get with numpy. I'm confused about `_discrete` versus `_both`: In my reading both uses the midrank and discrete uses the 'right'/cdf definition. That's reversed from what I expected, but I didn't read Scholz and Stephens again. The continuous/both(?) case could be made faster by skipping the lj (number of elements of a unique) calculation. (lj==1 if all observations are unique)
added some commits Feb 2, 2014
 joergdietrich `add anderson_ksamp` `23e1f77` joergdietrich `Replace & with and in citation and add year` `5612e42` joergdietrich `Replace another & with and` `ea78602` joergdietrich `fix typo in docstring` `f1f3770` joergdietrich `add blank lines after if blocks` `79b9da6` joergdietrich `Use ... for line continuation in docstring` `b7023b6` joergdietrich `Add comment to explain interpolation values` `45a7447` joergdietrich `capital letter for each parameter description and period at end` `445febf` joergdietrich `combine a few simple statements` `ab1216e`
Contributor
 Scholz and Stephens write about the midrank version: "This formula applies for a continuous population also, then all l_j = 1." So I'll implement that speed-up but the midrank version indeed seems to be the one for discrete and continuous cases. I'm not quite happy with the naming of the functions and the call signature. Maybe instead of `discrete=False` we should have `midrank=True` and rename the functions to `_midrank` and `_right` with appropriate modifcations to the docstring. Any thoughts?
Member
 I also think your proposal, using `midrank=True`, sounds better It will also make it easier to explain the difference in the docstring. Thanks, most likely I will copy the function to statsmodels until our minimum supported scipy version includes it.
added some commits Feb 2, 2014
 joergdietrich `Change call signature of anderson_ksamp` ```- Replace discrete=False with midrank=True to better explain what the difference between tests is. Update the docstrings accordingly.``` `bb3ec01` joergdietrich `Add missing word to docstring` `9b0cb39` joergdietrich `avoid computation of lj for continuous distributions; fix docstring f…` `…or parameters` `908e281` joergdietrich `actually re-use saved searchsorted array instead of just saving it an…` `…d then recomputing it ...` `6688108`
scipy/stats/morestats.py
 + """The Anderson-Darling test for k-samples. + + The k-sample Anderson-Darling test is a modification of the + one-sample Anderson-Darling test. It tests the null hypothesis + that k-samples are drawn from the same population without having + to specify the distribution function of that population. The + critical values depend on the number of samples. + + Parameters + ---------- + samples : sequence of 1-D array_like + Array of sample data in arrays. + midrank : bool, optional + Type of Anderson-Darling test which is computed. Default is + the midrank test applicable to continuous and discrete + populations.
 rgommers Member `Default (True)`. And `If False, the type is ....`. Just one sentence and a reference to more elaborate explanation in the Notes section is OK, but there should be something here.
Member
commented Feb 3, 2014
 Updates so far look good, most of my comments are addressed. Do need to squash some commits and write the commit messages in more standard form before merging.
 joergdietrich `More verbose explanation of midrank parameter` `24931cb`
Contributor
 I need some guidance for the "more standard form" of commit messages.
 Coverage remained the same when pulling 24931cb on joergdietrich:k-sample-AD into fd99d3f on scipy:master.
Member
commented Feb 7, 2014
Member
commented Feb 7, 2014
 Example: ``````More verbose explanation of midrank parameter `````` Should be something like: ``````DOC: more verbose explanation of midrank parameter in stats.anderson_ksamp ``````
added some commits Jan 3, 2014
 joergdietrich `ENH: Add k-sample Anderson-Darling test to stats module` `8a92e25` joergdietrich `API: Speed up and implement both version for discrete distributions f…` ```…or k-sample Anderson-Darling test 1. Change call signature to have array of arrays and optional keyword to specify which version of the k-sample AD test should be computed. 2. Get rid of all inner loops and list comprehensions by using np.searchsorted. 3. Both version given by Scholz & Stephens for discrete sample can be computed now.``` `d9d46af` joergdietrich `STY: Replace "&" with "and: in citation and add year in stats.anderso…` `…n_ksamp and stats.__anderson_ksamp_both` `b2c5ef9` joergdietrich `MAINT: fix typo in docstring stats.anderson_ksamp` `371dcc7` joergdietrich `STY: add blank lines after if blocks in k-sample Anderson Darling rou…` `…tines` `b2682cd` joergdietrich `STY: Use ... for line continuation in docstring in stats.anderson_ksamp` `b138007` joergdietrich `DOC: Add comment to explain interpolation values stats.anderson_ksamp` `a12c26f` joergdietrich `STY: capital letter for each parameter description and period at end …` `…in k-sample Anderson-Darling docstrings` `f90390d` joergdietrich `MAINT: combine a few simple statements in stats.anderson_ksamp` `a2a077b` joergdietrich `API: Change call signature of stats.anderson_ksamp` ```- Replace discrete=False with midrank=True to better explain what the difference between tests is. Update the docstrings accordingly.``` `1626e0f` joergdietrich `MAINT: Add missing word to docstring of stats.anderson_ksamp` `aad0972` joergdietrich `MAINT: avoid computation of lj for continuous distributions in stats.…` `…_anderson_ksamp_midrank; fix docstring for parameters` `11b5a01` joergdietrich `DOC: More verbose explanation of midrank parameter in stats.anderson_…` `…ksamp` `7e8e03c` joergdietrich `Merge branch 'k-sample-AD' of github.com:joergdietrich/scipy into k-s…` `…ample-AD` `6bce442`
Contributor
 I'm not particularly comfortable with rebasing. I hope I got this right and the commit messages are okay now. If anything else needs fixing it'll probably have to wait ~2 weeks.
added the PR label Feb 19, 2014
 joergdietrich `DOC: Adapt calls in example to changed signature from 1626e0f` `6231a8d`
Contributor
 Anything missing to get this merged into 0.14.x?
added this to the 0.14.0 milestone Feb 24, 2014
Member
 Ralf, I think this is fine to merge. I didn't look at the details again, but, last time I went through this, I didn't see anything that would hold this up from the statistics side.
Member
 I don't think anything is missing. Looks like something went wrong with the rebase, I'll try to fix that tonight and merge this. 0.14.0 milestone is already set for this PR.
added a commit that referenced this pull request Feb 24, 2014
 rgommers `Merge branch 'pr/3183' into master.` `Review at #3183` `a32a7ba`
Member
 Merged in a32a7ba. Thanks @joergdietrich, @josef-pkt
closed this Feb 24, 2014