Edited scipy/stats/stats.py via GitHub #87

Merged
merged 3 commits into from Oct 7, 2011

Projects

None yet

5 participants

Contributor
ckuster commented Oct 1, 2011

Changes

f_oneway
conversion of input to numpy array now being used (instead of discarded)
some small notation changes

kruskal
conversion of input to numpy array now being done
improved code efficiency

@ckuster ckuster commented on the diff Oct 1, 2011
scipy/stats/stats.py
@@ -2148,15 +2148,15 @@ def f_oneway(*args):
.. [2] Heiman, G.W. Research Methods in Statistics. 2002.
"""
- na = len(args) # ANOVA on 'na' groups, each in it's own array
- tmp = map(np.array,args)
ckuster
ckuster Oct 1, 2011 Contributor

This line discards its result

@ckuster ckuster commented on the diff Oct 1, 2011
scipy/stats/stats.py
raise ValueError("Need at least two groups in stats.kruskal()")
n = map(len,args)
- all = []
- for i in range(len(args)):
- all.extend(args[i].tolist())
- ranked = list(rankdata(all))
- T = tiecorrect(ranked)
- args = list(args)
- for i in range(len(args)):
ckuster
ckuster Oct 1, 2011 Contributor

The individual ranked lists don't need to be stored

@josef-pkt josef-pkt commented on an outdated diff Oct 1, 2011
scipy/stats/stats.py
if T == 0:
raise ValueError('All numbers are identical in kruskal')
+
+ j = np.insert(np.cumsum(n),0,0)
josef-pkt
josef-pkt Oct 1, 2011 Member

n is still a list here, so

j = np.cumsum([0]+n)

should work

@josef-pkt josef-pkt commented on an outdated diff Oct 1, 2011
scipy/stats/stats.py
@@ -2148,15 +2148,15 @@ def f_oneway(*args):
.. [2] Heiman, G.W. Research Methods in Statistics. 2002.
"""
- na = len(args) # ANOVA on 'na' groups, each in it's own array
- tmp = map(np.array,args)
+ args = map(np.asarray,args) # convert to an numpy array
josef-pkt
josef-pkt Oct 1, 2011 Member

space after comma

Member

Looks good to me. a few spaces after commas to be more pep-8 compliant

Did you check whether these changes are covered by the test suite, and did you run it?

Thanks, kruskal looks much better after this rewrite.

Member

as reference: this is ticket http://projects.scipy.org/scipy/ticket/1527

Contributor
ckuster commented Oct 1, 2011

Josef,

On Oct 1, 2011, at 6:42 PM, Josef Perktold wrote:

Looks good to me. a few spaces after commas to be more pep-8 compliant

I'll make those changes and re-request the pull.

Did you check whether these changes are covered by the test suite, and did you run it?

I didn't. I'll look for the tests in the distribution and run them.

Thanks, kruskal looks much better after this rewrite.

No problem. I'll take a look at more of the stats.py file as part of my learning process.

Thank you for the suggestions. I have a lot better understanding of the numpy arrays, and appreciate the performance tradeoff involved.

Reply to this email directly or view it on GitHub:
#87 (comment)

Chris


Dr. Christopher Kuster
Assistant Professor of Mathematics
Carroll University
Waukesha, WI

Email: ckuster@carrollu.edu
Phone: (262) 524-7140

Member

can you edit the current pull request? It would be easier to keep the discussion together.

As an aside: I used scipy.stats.distributions to learn numpy and scipy. I only managed to get partway through the stats.stats rewrite/cleanup.

and some advertising, when you run out of things to do in scipy.stats: http://statsmodels.sourceforge.net/

Owner
ckuster commented on a7e79ff Oct 2, 2011

Includes a few style changes (added some spaces after commas)
Also forced n to be a numpy array in kruskal and used stored variable for the number of arguments instead of calling len again.

Owner

Changes passed test suite.
f_oneway is tested by the suite, but kruskal is not.

I will try to add a test for kruskal on Monday .

about testing for kruskal and similar functions in scipy.stats.

we had a long discussion on this and the main comparison with matlab and R on this is at http://projects.scipy.org/scipy/attachment/ticket/901/comparison_mannwhithey_oth.txt

@ckuster ckuster Added a few more spaces to make f_oneway and kruskal more pep-8 compl…
…iant.

Re-ran test suite to make sure that didn't break anything.
67ba130
Contributor
ckuster commented Oct 2, 2011

I assume that what I am doing now is editing the current pull request. I'll comment and close and hope that's the right thing to do. :-)

I'll check out the statsmodels page.

@ckuster ckuster closed this Oct 2, 2011
@charris charris reopened this Oct 2, 2011
Member
charris commented Oct 2, 2011

Closing is probably not what you want to do. You can edit your branch, push to github, and it will show up here.

Member

Looks good. I can add some tests for kruskal if you'd like.

Member

I went ahead and created some tests: WarrenWeckesser@9807ac1

Member

I think this is ready to go. Unless I hear otherwise (or unless someone does it before me), I'll push this tomorrow.

Member

Fine with me, as reminder: this is ticket http://projects.scipy.org/scipy/ticket/1527

Contributor
ckuster commented Oct 5, 2011

Ok by me.

On 10/5/11 8:44 AM, "Warren Weckesser"
reply@reply.github.com
wrote:

I think this is ready to go. Unless I hear otherwise (or unless someone
does it before more), I'll push this tomorrow.

Reply to this email directly or view it on GitHub:
#87 (comment)

Contributor
ckuster commented Oct 6, 2011

Um... Am I late on this? Sounds good.

This week has been busier than expected.

My concern in testing Is that any time the tests are valid for the current implementation, you may as well have run with ANOVA. Should there be a way to request an emulated p-value ala SPSS or similar?

I don't see how that fits with the current implementation. I know that method is slow, but should it be available in scipy?

Chris

On Oct 3, 2011, at 9:14 PM, Warren Weckesserreply@reply.github.com wrote:

Looks good. I can add some tests for kruskal if you'd like.

Reply to this email directly or view it on GitHub:
#87 (comment)

Member

What do you mean with an "emulated p-value ala SPSS or similar"? I never heard emulated in the context of p-values.

Even in the current implementation the test relies less on distributional assumptions than the standard ANOVA, since it uses ranks instead of the original observed values. Or not?

There are some efforts to develop permutation/bootstrap based tests in other python packages. (edit: mainly permutation)

Member

My concern in testing Is that any time the tests are valid for the current implementation, you may as well have run with ANOVA. Should there be a way to request an emulated p-value ala SPSS or similar? I don't see how that fits with the current implementation. I know that method is slow, but should it be available in scipy? Chris

Sorry, I don't understand your concern. The tests that I wrote simply check that the implementation of the Kruskal-Wallis test in stats.kruskal is correct.

Contributor
ckuster commented Oct 6, 2011

The p-values used in the scipy version of the non-parametric tests are calculated using functions that are reasonable approximations for large enough data sets.

For very small data sets, the correct p-value is pretty easy to find. First, you calculate the statistic for the data you have, and then you look at every possible combination of ranks and count how many of those combinations have the same or more extreme statistic as the one for your data.


Example:

data: 1,3,4/2,5 (statistic = 10/3)
comb: 1,2,3/4,5 (statistic = 2)
1,2,4/3,5 (statistic = 7/3)
1,2,5/3,4 (statistic = 8/3)
1,3,4/2,5 (statistic = 8/3)
1,3,5/2,4 (statistic = 3)
1,4,5/2,3 (statistic = 10/3)
2,3,4/1,5 (statistic = 3)
2,3,5/1,4 (statistic = 10/3)
2,4,5/1,3 (statistic = 11/3)
3,4,5/1,2 (statistic = 4)

4 of the 10 possible rank combinations have a statistic of 10/3 or larger, so the p-value for the data is 0.4.


Looking at all combinations gets too expensive well before the function approximation is valid. In these in-between case you can approximate the p-value by repeatedly creating sets with the same number of data points as in your experiment but with randomly assigned ranks. For each of these random sets, you calculate the test statistic and compute the p-value as the fraction of your random sets with the same or more extreme statistic as the one for your data.

The current implementation is fine for large non-normal data sets (which is one of the reasons to use non-parametric tests), but gives the wrong p-value for small data sets (the other reason).

"Emulated p-value" is probably not the correct terminology, but this feature does exist in some statistics packages. I have used it in SPSS (or PASW or whatever they call it now).

It is very slow, and will not return the same p-value every time (because of the randomness), but the p-value you obtain is more accurate than the result from the functional approximation.

sorry... that is more than I meant to type...

On Oct 5, 2011, at 10:02 PM, Josef Perktold wrote:

What do you mean with an "emulated p-value ala SPSS or similar"? I never heard emulated in the context of p-values.

Even in the current implementation the test relies less on distributional assumptions than the standard ANOVA, since it uses ranks instead of the original observed values. Or not?

There are some efforts to develop bootstrap based tests in other python packages.

Reply to this email directly or view it on GitHub:
#87 (comment)


Dr. Christopher Kuster
Assistant Professor of Mathematics
Carroll University
Waukesha, WI

Email: ckuster@carrollu.edu
Phone: (262) 524-7140

Contributor
ckuster commented Oct 6, 2011

I have no problem with your tests.

I'm saying that the implementation uses an approximation of the p-value that is invalid for mid-sized data sets (it says this in the function comments), and wonder if there should be new functionality in the stats package to compute reasonable p-values in these cases. I'm pretty sure it would have to be a separate function.

On Oct 5, 2011, at 10:07 PM, Warren Weckesser wrote:

My concern in testing Is that any time the tests are valid for the current implementation, you may as well have run with ANOVA. Should there be a way to request an emulated p-value ala SPSS or similar? I don't see how that fits with the current implementation. I know that method is slow, but should it be available in scipy? Chris

Sorry, I don't understand your concern. The tests that I wrote simply check that the implementation of the Kruskal-Wallis test in stats.kruskal is correct.

Reply to this email directly or view it on GitHub:
#87 (comment)


Dr. Christopher Kuster
Assistant Professor of Mathematics
Carroll University
Waukesha, WI

Email: ckuster@carrollu.edu
Phone: (262) 524-7140

Member

Christopher,

Thanks, I like verbose answers, it's a good summary and I copied it to ticket #1527 so it's easier to find.

That's essentially what I meant with permutation and bootstrap tests.

The only exact test currently in scipy.stats is fisher_exact. Which is actually very fast because it uses an efficient algorithm. It would be useful to have more like these in scipy.stats, but for a generic approach that can work with different tests, personally, I prefer to work in statsmodels, because there is less pressure to get it right the first time.

If you are working on any of these tests, then improvements or enhancements will be very welcome, so we can get stats in python closer to (what was formerly known as) SPSS or similar.

Contributor
ckuster commented Oct 6, 2011

Josef,

Here is some code to approximate the p-values "more correctly". It would
replace the last line in kruskal.

I am posting this as commentary instead of a code change because I DON'T
THINK IT SHOULD BE INCORPORATED INTO THE CURRENT IMPLEMENTATION :-). This
method is slow. In fact, SPSS allows you to choose a maximum amount of
time it should spend trying to do the calculation. Also, the current
implementation gives an over-estimate of the p-value in the cases where it
isn't valid. I think that isn't so bad since if you have little enough
data that the chi-squared distribution is a poor approximation, you should
be more conservative in your conclusions anyway. Of course... I'm a
Numerical Analyst/Mathematical Modeler not a Statistician, so my
philosophical views on this matter may not be reasonable.

Anyway... Code.

# functional approximation (current implementation)
#    pval = chisqprob(h,df)
# approximation via simulation (alternative, very slow implementation)
tot_count = 0
pas_count = 0
# There are a few ways of deciding when to end the following loop
#     1) convergence of p-value (this could take a long time)
#     2) maximum iteration count
#     3) maximum time spent
# I think SPSS (now PASW) has at least 1 and 3 from this list as

options
while (tot_count < 100000): # this is the stupid way
np.random.shuffle(ranked)
ssbn = 0.0
for i in range(na):
ssbn = ssbn + square_of_sums(ranked[j[i]:j[i+1]])/float(n[i])

Compute sum^2/n for each group

    totaln = np.sum(n)
    htest = 12.0 / (totaln*(totaln+1)) * ssbn - 3*(totaln+1)
    htest = htest / float(T)
    tot_count += 1
    if htest >= h:
        pas_count += 1
pval = float(pas_count)/float(tot_count)
return h, pval

-Chris

On 10/6/11 8:26 AM, "Josef Perktold"
reply@reply.github.com
wrote:

Christopher,

Thanks, I like verbose answers, it's a good summary and I copied it to
ticket #1527 so it's easier to find.

That's essentially what I meant with permutation and bootstrap tests.

The only exact test currently in scipy.stats is fisher_exact. Which is
actually very fast because it uses an efficient algorithm. It would be
useful to have more like these in scipy.stats, but for a generic approach
that can work with different tests, personally, I prefer to work in
statsmodels, because there is less pressure to get it right the first
time.

If you are working on any of these tests, then improvements or
enhancements will be very welcome, so we can get stats in python closer
to (what was formerly known as) SPSS or similar.

Reply to this email directly or view it on GitHub:
#87 (comment)

Member

Hiding a discussion like this in the comments to a pull request is not the best location to find it again in future (and it is difficult to copy code). The statsmodels or scipy-user mailing lists would be more appropriate.

I ran a few examples with your code, with 3 groups and 10 observations each, with either squared normal or uniform integer distributed with different means or scale. My (unsystematic) impression was that the difference between chisquare and random permutation p-values is in the 0.005 range. I saw similar results with another test before. For most purposes this difference will not matter.

Adding permutation p-values to scipy.stats would be possible even if they are more expensive to calculate, but it would be good to identify cases when it really matters, and have guidelines in the docstring, or even automatic switching.

My impression is that some of these tests have good correction factors that makes the asymptotic distribution already a good approximation in small samples (>5 or >10 per group?)

Where I would really like to see permutation, bootstrap or empirical p-values is in tests that don't have a standard asymptotic distribution, e.g. (my favorites) goodness of fit tests with estimated parameters (anderson-darling, kolmogorov-smirnov, ...)

The last are my personal preference and opinion, notwithstanding that I welcome and review all enhancements and pull requests for scipy.stats.

@WarrenWeckesser WarrenWeckesser pushed a commit to WarrenWeckesser/scipy that referenced this pull request Oct 7, 2011
Warren Weckesser BUG: stats: Merge pull request #87 (handle array_like in kruskal, imp…
…rove code in kruskal and f_oneway)
ea3d301
@WarrenWeckesser WarrenWeckesser merged commit 67ba130 into scipy:master Oct 7, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment