# Edited scipy/stats/stats.py via GitHub #87

Merged
merged 3 commits into from Oct 7, 2011

None yet

### 5 participants

Contributor
commented Oct 1, 2011
 Changes f_oneway conversion of input to numpy array now being used (instead of discarded) some small notation changes kruskal conversion of input to numpy array now being done improved code efficiency
 ckuster `Edited scipy/stats/stats.py via GitHub` `0f2e475`
commented on the diff Oct 1, 2011
scipy/stats/stats.py
 @@ -2148,15 +2148,15 @@ def f_oneway(*args): .. [2] Heiman, G.W. Research Methods in Statistics. 2002. """ - na = len(args) # ANOVA on 'na' groups, each in it's own array - tmp = map(np.array,args)
 ckuster Contributor This line discards its result
commented on the diff Oct 1, 2011
scipy/stats/stats.py
 raise ValueError("Need at least two groups in stats.kruskal()") n = map(len,args) - all = [] - for i in range(len(args)): - all.extend(args[i].tolist()) - ranked = list(rankdata(all)) - T = tiecorrect(ranked) - args = list(args) - for i in range(len(args)):
 ckuster Contributor The individual ranked lists don't need to be stored
scipy/stats/stats.py
 if T == 0: raise ValueError('All numbers are identical in kruskal') + + j = np.insert(np.cumsum(n),0,0)
 josef-pkt Member n is still a list here, so j = np.cumsum([0]+n) should work
scipy/stats/stats.py
 @@ -2148,15 +2148,15 @@ def f_oneway(*args): .. [2] Heiman, G.W. Research Methods in Statistics. 2002. """ - na = len(args) # ANOVA on 'na' groups, each in it's own array - tmp = map(np.array,args) + args = map(np.asarray,args) # convert to an numpy array
 josef-pkt Member space after comma
Member
 Looks good to me. a few spaces after commas to be more pep-8 compliant Did you check whether these changes are covered by the test suite, and did you run it? Thanks, kruskal looks much better after this rewrite.
Member
 as reference: this is ticket http://projects.scipy.org/scipy/ticket/1527
Contributor
commented Oct 1, 2011
 Josef, On Oct 1, 2011, at 6:42 PM, Josef Perktold wrote: Looks good to me. a few spaces after commas to be more pep-8 compliant I'll make those changes and re-request the pull. Did you check whether these changes are covered by the test suite, and did you run it? I didn't. I'll look for the tests in the distribution and run them. Thanks, kruskal looks much better after this rewrite. No problem. I'll take a look at more of the stats.py file as part of my learning process. Thank you for the suggestions. I have a lot better understanding of the numpy arrays, and appreciate the performance tradeoff involved. Reply to this email directly or view it on GitHub: #87 (comment) Chris Dr. Christopher Kuster Assistant Professor of Mathematics Carroll University Waukesha, WI Email: ckuster@carrollu.edu Phone: (262) 524-7140
Member
 can you edit the current pull request? It would be easier to keep the discussion together. As an aside: I used scipy.stats.distributions to learn numpy and scipy. I only managed to get partway through the stats.stats rewrite/cleanup. and some advertising, when you run out of things to do in scipy.stats: http://statsmodels.sourceforge.net/
 ckuster `Edited scipy/stats/stats.py via GitHub` `a7e79ff`
Owner
commented on `a7e79ff` Oct 2, 2011
 Includes a few style changes (added some spaces after commas) Also forced n to be a numpy array in kruskal and used stored variable for the number of arguments instead of calling len again.
Owner
 Changes passed test suite. f_oneway is tested by the suite, but kruskal is not. I will try to add a test for kruskal on Monday .
 about testing for kruskal and similar functions in scipy.stats. we had a long discussion on this and the main comparison with matlab and R on this is at http://projects.scipy.org/scipy/attachment/ticket/901/comparison_mannwhithey_oth.txt
 ckuster `Added a few more spaces to make f_oneway and kruskal more pep-8 compl…` ```…iant. Re-ran test suite to make sure that didn't break anything.``` `67ba130`
Contributor
commented Oct 2, 2011
 I assume that what I am doing now is editing the current pull request. I'll comment and close and hope that's the right thing to do. :-) I'll check out the statsmodels page.
closed this Oct 2, 2011
reopened this Oct 2, 2011
Member
commented Oct 2, 2011
 Closing is probably not what you want to do. You can edit your branch, push to github, and it will show up here.
Member
 Looks good. I can add some tests for kruskal if you'd like.
Member
 I went ahead and created some tests: WarrenWeckesser@9807ac1
Member
 I think this is ready to go. Unless I hear otherwise (or unless someone does it before me), I'll push this tomorrow.
Member
 Fine with me, as reminder: this is ticket http://projects.scipy.org/scipy/ticket/1527
Contributor
commented Oct 5, 2011
 Ok by me. On 10/5/11 8:44 AM, "Warren Weckesser" reply@reply.github.com wrote: I think this is ready to go. Unless I hear otherwise (or unless someone does it before more), I'll push this tomorrow. Reply to this email directly or view it on GitHub: #87 (comment)
Contributor
commented Oct 6, 2011
 Um... Am I late on this? Sounds good. This week has been busier than expected. My concern in testing Is that any time the tests are valid for the current implementation, you may as well have run with ANOVA. Should there be a way to request an emulated p-value ala SPSS or similar? I don't see how that fits with the current implementation. I know that method is slow, but should it be available in scipy? Chris On Oct 3, 2011, at 9:14 PM, Warren Weckesserreply@reply.github.com wrote: Looks good. I can add some tests for kruskal if you'd like. Reply to this email directly or view it on GitHub: #87 (comment)
Member
 What do you mean with an "emulated p-value ala SPSS or similar"? I never heard emulated in the context of p-values. Even in the current implementation the test relies less on distributional assumptions than the standard ANOVA, since it uses ranks instead of the original observed values. Or not? There are some efforts to develop permutation/bootstrap based tests in other python packages. (edit: mainly permutation)
Member
 My concern in testing Is that any time the tests are valid for the current implementation, you may as well have run with ANOVA. Should there be a way to request an emulated p-value ala SPSS or similar? I don't see how that fits with the current implementation. I know that method is slow, but should it be available in scipy? Chris Sorry, I don't understand your concern. The tests that I wrote simply check that the implementation of the Kruskal-Wallis test in stats.kruskal is correct.
Contributor
commented Oct 6, 2011
 The p-values used in the scipy version of the non-parametric tests are calculated using functions that are reasonable approximations for large enough data sets. For very small data sets, the correct p-value is pretty easy to find. First, you calculate the statistic for the data you have, and then you look at every possible combination of ranks and count how many of those combinations have the same or more extreme statistic as the one for your data. Example: data: 1,3,4/2,5 (statistic = 10/3) comb: 1,2,3/4,5 (statistic = 2) 1,2,4/3,5 (statistic = 7/3) 1,2,5/3,4 (statistic = 8/3) 1,3,4/2,5 (statistic = 8/3) 1,3,5/2,4 (statistic = 3) 1,4,5/2,3 (statistic = 10/3) 2,3,4/1,5 (statistic = 3) 2,3,5/1,4 (statistic = 10/3) 2,4,5/1,3 (statistic = 11/3) 3,4,5/1,2 (statistic = 4) 4 of the 10 possible rank combinations have a statistic of 10/3 or larger, so the p-value for the data is 0.4. Looking at all combinations gets too expensive well before the function approximation is valid. In these in-between case you can approximate the p-value by repeatedly creating sets with the same number of data points as in your experiment but with randomly assigned ranks. For each of these random sets, you calculate the test statistic and compute the p-value as the fraction of your random sets with the same or more extreme statistic as the one for your data. The current implementation is fine for large non-normal data sets (which is one of the reasons to use non-parametric tests), but gives the wrong p-value for small data sets (the other reason). "Emulated p-value" is probably not the correct terminology, but this feature does exist in some statistics packages. I have used it in SPSS (or PASW or whatever they call it now). It is very slow, and will not return the same p-value every time (because of the randomness), but the p-value you obtain is more accurate than the result from the functional approximation. sorry... that is more than I meant to type... On Oct 5, 2011, at 10:02 PM, Josef Perktold wrote: What do you mean with an "emulated p-value ala SPSS or similar"? I never heard emulated in the context of p-values. Even in the current implementation the test relies less on distributional assumptions than the standard ANOVA, since it uses ranks instead of the original observed values. Or not? There are some efforts to develop bootstrap based tests in other python packages. Reply to this email directly or view it on GitHub: #87 (comment) Dr. Christopher Kuster Assistant Professor of Mathematics Carroll University Waukesha, WI Email: ckuster@carrollu.edu Phone: (262) 524-7140
Contributor
commented Oct 6, 2011
 I have no problem with your tests. I'm saying that the implementation uses an approximation of the p-value that is invalid for mid-sized data sets (it says this in the function comments), and wonder if there should be new functionality in the stats package to compute reasonable p-values in these cases. I'm pretty sure it would have to be a separate function. On Oct 5, 2011, at 10:07 PM, Warren Weckesser wrote: My concern in testing Is that any time the tests are valid for the current implementation, you may as well have run with ANOVA. Should there be a way to request an emulated p-value ala SPSS or similar? I don't see how that fits with the current implementation. I know that method is slow, but should it be available in scipy? Chris Sorry, I don't understand your concern. The tests that I wrote simply check that the implementation of the Kruskal-Wallis test in stats.kruskal is correct. Reply to this email directly or view it on GitHub: #87 (comment) Dr. Christopher Kuster Assistant Professor of Mathematics Carroll University Waukesha, WI Email: ckuster@carrollu.edu Phone: (262) 524-7140
Member
 Christopher, Thanks, I like verbose answers, it's a good summary and I copied it to ticket #1527 so it's easier to find. That's essentially what I meant with permutation and bootstrap tests. The only `exact` test currently in scipy.stats is fisher_exact. Which is actually very fast because it uses an efficient algorithm. It would be useful to have more like these in scipy.stats, but for a generic approach that can work with different tests, personally, I prefer to work in statsmodels, because there is less pressure to get it right the first time. If you are working on any of these tests, then improvements or enhancements will be very welcome, so we can get stats in python closer to (what was formerly known as) SPSS or similar.
Contributor
commented Oct 6, 2011

Josef,

Here is some code to approximate the p-values "more correctly". It would
replace the last line in kruskal.

I am posting this as commentary instead of a code change because I DON'T
THINK IT SHOULD BE INCORPORATED INTO THE CURRENT IMPLEMENTATION :-). This
method is slow. In fact, SPSS allows you to choose a maximum amount of
time it should spend trying to do the calculation. Also, the current
implementation gives an over-estimate of the p-value in the cases where it
isn't valid. I think that isn't so bad since if you have little enough
data that the chi-squared distribution is a poor approximation, you should
be more conservative in your conclusions anyway. Of course... I'm a
Numerical Analyst/Mathematical Modeler not a Statistician, so my
philosophical views on this matter may not be reasonable.

Anyway... Code.

``````# functional approximation (current implementation)
#    pval = chisqprob(h,df)
# approximation via simulation (alternative, very slow implementation)
tot_count = 0
pas_count = 0
# There are a few ways of deciding when to end the following loop
#     1) convergence of p-value (this could take a long time)
#     2) maximum iteration count
#     3) maximum time spent
# I think SPSS (now PASW) has at least 1 and 3 from this list as
``````

options
while (tot_count < 100000): # this is the stupid way
np.random.shuffle(ranked)
ssbn = 0.0
for i in range(na):
ssbn = ssbn + square_of_sums(ranked[j[i]:j[i+1]])/float(n[i])

# Compute sum^2/n for each group

``````    totaln = np.sum(n)
htest = 12.0 / (totaln*(totaln+1)) * ssbn - 3*(totaln+1)
htest = htest / float(T)
tot_count += 1
if htest >= h:
pas_count += 1
pval = float(pas_count)/float(tot_count)
return h, pval
``````

-Chris

On 10/6/11 8:26 AM, "Josef Perktold"
wrote:

Christopher,

Thanks, I like verbose answers, it's a good summary and I copied it to
ticket #1527 so it's easier to find.

That's essentially what I meant with permutation and bootstrap tests.

The only `exact` test currently in scipy.stats is fisher_exact. Which is
actually very fast because it uses an efficient algorithm. It would be
useful to have more like these in scipy.stats, but for a generic approach
that can work with different tests, personally, I prefer to work in
statsmodels, because there is less pressure to get it right the first
time.

If you are working on any of these tests, then improvements or
enhancements will be very welcome, so we can get stats in python closer
to (what was formerly known as) SPSS or similar.

Reply to this email directly or view it on GitHub:
#87 (comment)

Member
 Hiding a discussion like this in the comments to a pull request is not the best location to find it again in future (and it is difficult to copy code). The statsmodels or scipy-user mailing lists would be more appropriate. I ran a few examples with your code, with 3 groups and 10 observations each, with either squared normal or uniform integer distributed with different means or scale. My (unsystematic) impression was that the difference between chisquare and random permutation p-values is in the 0.005 range. I saw similar results with another test before. For most purposes this difference will not matter. Adding permutation p-values to scipy.stats would be possible even if they are more expensive to calculate, but it would be good to identify cases when it really matters, and have guidelines in the docstring, or even automatic switching. My impression is that some of these tests have good correction factors that makes the asymptotic distribution already a good approximation in small samples (>5 or >10 per group?) Where I would really like to see permutation, bootstrap or empirical p-values is in tests that don't have a standard asymptotic distribution, e.g. (my favorites) goodness of fit tests with estimated parameters (anderson-darling, kolmogorov-smirnov, ...) The last are my personal preference and opinion, notwithstanding that I welcome and review all enhancements and pull requests for scipy.stats.
pushed a commit to WarrenWeckesser/scipy that referenced this pull request Oct 7, 2011
 Warren Weckesser `BUG: stats: Merge pull request #87 (handle array_like in kruskal, imp…` `…rove code in kruskal and f_oneway)` `ea3d301`
merged commit `67ba130` into scipy:master Oct 7, 2011
referenced this pull request Apr 25, 2013
Closed

#### stats.kruskal not working as advertised (Trac #1527) #2052

referenced this pull request Aug 9, 2015
Closed