Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add k-sample Anderson-Darling test to stats module #3183

Closed
wants to merge 31 commits into from
Closed

ENH: Add k-sample Anderson-Darling test to stats module #3183

wants to merge 31 commits into from

Conversation

joergdietrich
Copy link
Contributor

This PR adds the k-sample Anderson-Darling test for continuous distributions as described by Scholz & Stephen (1987, Journal of the American Statistical Association, Vol. 82, pp. 918-924) to the stats module.

for i in arange(0, k):
fij = np.array([(samples[i] == zj).sum() for zj in Zstar[:-1]])
Mij = fij.cumsum()
inner = lj / float(N) * (N * Mij - Bj * n[i])**2 / (Bj * (N - Bj))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this equation (6) in Scholz and Stephens, and not equation (3) ?

discrete parent population ? not continuous as in the docstring, lines 1174, 1175

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. The docstring needs to be fixed.

@josef-pkt
Copy link
Member

good I never tried my hand on the discrete Anderson Darling tests.
I think we should add _discrete to the name of the function.

I'm asking a question on the mailinglist about the signature.

@josef-pkt
Copy link
Member

If we are planning on two different functions for discrete and continuous, then it might be better to outsource the pvalue calculation into a (private) helper function.

@joergdietrich
Copy link
Contributor Author

I'll change the signature according to Ralf's suggestion. Regarding discrete and continuous distributions, I now have code to compute equations 3, 6, and 7 of Scholz and Stephens. So we could include all of them, or just 6 and 7, where the latter would also apply to continuous distributions.

@josef-pkt
Copy link
Member

One question that's not clear to mean when I read the formulas:
Does equ 3 give the same answer as equ 6 even when the values are discrete and there are ties?

The two main questions are: how many versions of the calculation do we need? and what's the best way to implement them?

To the second: I'm pretty sure the continuous version can be made without loops by using np.searchsorted. For the discrete version it might be possible to use np.searchsorted or a cython function for rankdata that is already in scipy.stats.

I think getting a very fast version is not a requirement for this PR, but it's also possible that it's not very difficult to get a no python loop version with the existing tools.

What are the options for a user if we have 3, 6, and 7?
If 3 and 6 can use the same code, then it would only be whether to use mean rank in case of ties.
If only continuous equ 3 can get a fast algorithm, then we should also allow users to choose that.

(I'm a bit distracted because I also have pull requests in statsmodels where I need to catch up with with some readings to understand the topics.)

@josef-pkt
Copy link
Member

Related: From what I remember equ 5 a weighted sum of chisquare distributions shows up in some of the discrete gof tests. I finally wrote the code for getting the p-values from that, but I don't have any truncation rule for the infinite sum.

I don't think it's really relevant in this case because according to the references that I looked at, the anderson darling statistic converges fast and the Stephen's approximation seems to work pretty well. However, I didn't look much at discrete cases.

1. Change call signature to have array of arrays and optional keyword to
   specify which version of the k-sample AD test should be computed.

2. Get rid of all inner loops and list comprehensions by using
   np.searchsorted.

3. Both version given by Scholz & Stephens for discrete sample can be
   computed now.
@joergdietrich
Copy link
Contributor Author

I managed to rewrite everything inside the outer loop and the determination of the multiplicity using np.searchsorted. Thanks for this pointer, I wasn't aware of the power of this function. Unfortunately, it's only a speed-up of less than a percent per outer loop for a large range of sample sizes, so the list comprehension was doing pretty well already. This change mostly benefits large numbers of samples, which may not be very common.

Eqs. 3, 6, 7 always disagree.

@argriffing
Copy link
Contributor

The test failure is because a few days ago TravisCI decided to start using less precision when it calculates small cosine integrals, for some reason.

@joergdietrich
Copy link
Contributor Author

Any further comments on this?

one for continous distributions and two for discrete
distributions, in which ties between samples may occur. The latter
variant of the test is also applicable to continuous data. By
default, this routine computes the test for continuous and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement looks incorrect (only one p-value returned) and would also be strange. Default is continuous only right?

@rgommers
Copy link
Member

rgommers commented Feb 1, 2014

The function needs to be added in stats/__init__.py in order for it to show up in the documentation.

tm = b0 + b1 / math.sqrt(m) + b2 / m
pf = np.polyfit(tm, log(np.array([0.25, 0.1, 0.05, 0.025, 0.01])), 2)
if Tk < tm.min() or Tk > tm.max():
warnings.warn("approximate p-value will be computed by extrapolation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this warning needed? It shows up in most of the test cases, so I'm guessing it's not that uncommon (didn't check). If so, adding a note in the docstring might make more sense. If the warning has to be kept, it shouldn't show up in the test output (can be silenced within a with warnings.catch_warnings() block if needed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the best pattern for cases like this is. I don't know how good the extrapolation is, it might have quite a large error in some ranges.
I have something similar for tables of p-values without extrapolation:

  • mention only in docstring about the range of extrapolation (It's just lower precision than interpolation.)
  • keep warning as here
  • truncate (without extrapolation some packages, and some of my functions, just return the boundary value 0.25 or 0.01, for text return it would be '<0.01' or '>0.25')

For most use cases the exact p-value outside [0.01, 0.25] doesn't really matter and just mentioning in docstring would be enough. But I guess there would be multiple testing applications, where smaller p-values are relevant and users need to be aware that those are not very precise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the quality of the interpolation is known. Scholz & Stephens vary the polynomial order depending on the number of samples and provide no guidance for what a general procedure should use. The test cases are taken from Scholz and Stephens and happen to be cases where the null hypothesis can be rejected at better than the 1% level. Given the unknown level of accuracy I'd prefer to keep the warning, unless there's a strong preference to move it to the docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's fine with me then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning is fine with me too
I don't have a strong opinion given I don't know how good the extrapolation is.

@rgommers
Copy link
Member

rgommers commented Feb 1, 2014

Did you compare this against anderson? If you draw one set of samples from a distribution and specify the same distribution for anderson, and then test those against a single other set of samples, then after enough runs the test statistic and p-values should be the same within some tolerance (at least that's what I expect).

@rgommers
Copy link
Member

rgommers commented Feb 1, 2014

I don't have the reference and Josef understands this much better than I do anyway, so I'll let him judge the correctness of the stats.


A2akN = 0.
lj = Z.searchsorted(Zstar, 'right') - Z.searchsorted(Zstar, 'left')
Bj = Z.searchsorted(Zstar) + lj / 2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as searchsorted 'left' (default), save and reuse

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 24931cb on joergdietrich:k-sample-AD into fd99d3f on scipy:master.

@rgommers
Copy link
Member

rgommers commented Feb 7, 2014

Example:

More verbose explanation of midrank parameter

Should be something like:

DOC: more verbose explanation of midrank parameter in stats.anderson_ksamp

…or k-sample Anderson-Darling test

1. Change call signature to have array of arrays and optional keyword to
   specify which version of the k-sample AD test should be computed.

2. Get rid of all inner loops and list comprehensions by using
   np.searchsorted.

3. Both version given by Scholz & Stephens for discrete sample can be
   computed now.
- Replace discrete=False with midrank=True to better explain what the
  difference between tests is. Update the docstrings accordingly.
…_anderson_ksamp_midrank; fix docstring for parameters
@joergdietrich
Copy link
Contributor Author

I'm not particularly comfortable with rebasing. I hope I got this right and the commit messages are okay now. If anything else needs fixing it'll probably have to wait ~2 weeks.

@pv pv added the PR label Feb 19, 2014
@joergdietrich
Copy link
Contributor Author

Anything missing to get this merged into 0.14.x?

@josef-pkt josef-pkt added this to the 0.14.0 milestone Feb 24, 2014
@josef-pkt
Copy link
Member

Ralf, I think this is fine to merge. I didn't look at the details again, but, last time I went through this, I didn't see anything that would hold this up from the statistics side.

@rgommers
Copy link
Member

I don't think anything is missing. Looks like something went wrong with the rebase, I'll try to fix that tonight and merge this. 0.14.0 milestone is already set for this PR.

rgommers added a commit that referenced this pull request Feb 24, 2014
@rgommers
Copy link
Member

Merged in a32a7ba. Thanks @joergdietrich, @josef-pkt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants