Fix mstats.kurtosistest, and test coverage for skewtest/normaltest #3008

Merged
merged 6 commits into from Nov 10, 2013

Conversation

Projects
None yet
6 participants
Owner

rgommers commented Oct 19, 2013

Supercedes gh-2673

Coverage Status

Coverage remained the same when pulling 2dabd6d on rgommers:pull-2673 into b7aa678 on scipy:master.

@josef-pkt josef-pkt commented on the diff Oct 20, 2013

scipy/stats/mstats_basic.py
@@ -1682,10 +1681,15 @@ def kurtosistest(a, axis=0):
A = 6.0 + 8.0/sqrtbeta1 * (2.0/sqrtbeta1 + np.sqrt(1+4.0/(sqrtbeta1**2)))
term1 = 1 - 2./(9.0*A)
denom = 1 + x*ma.sqrt(2/(A-4.0))
- denom[denom < 0] = masked
+ if np.ma.isMaskedArray(denom):
+ # For multi-dimensional array input
+ denom[denom < 0] = masked
@josef-pkt

josef-pkt Oct 20, 2013

Member

I don't see masked defined in this function.

@rgommers

rgommers Oct 20, 2013

Owner
In [1]: np.ma.masked
Out[1]: masked
@josef-pkt

josef-pkt Oct 20, 2013

Member

I found it finally, I had searched for masked with an additional whitespace

@josef-pkt

josef-pkt Oct 21, 2013

Member

I don't see from reading the code when denom < 0.
However, I think this could be replaced by n < 5 or add this as additional condition for cases that should be masked.

The same lines are missing in skewtest.

Member

josef-pkt commented Oct 20, 2013

Change looks good, I'm not sure the functions are good.

Do they behave correctly if the number of valid observations in a column is too small?
(warning is when the approximation of the distribution of the test statistic is not good. If the number of valid observations is very small, then the calculations don't make sense, nobs - 3, ...)

The stats version raise an exception when the sample is too small.
I don't see that the mstats version check this, and I doubt it's implicitly masked.

Member

josef-pkt commented Oct 20, 2013

Owner

rgommers commented Oct 20, 2013

@evgeniburovski sent me a PR to make the small value behavior identical. EDIT: PR merged to here

I didn't intend to close the Statistics Review tickets on these functions yet, just fix them to not be hopelessly broken and finish up gh-2673.

Coverage Status

Coverage remained the same when pulling ae17bb44647dc714d28a2cc66de493ba12e27a2a on rgommers:pull-2673 into b7aa678 on scipy:master.

Member

josef-pkt commented Oct 21, 2013

I just prefer if we get functions fixed so we don't have to look at them anymore for a few years, unless it's just a quick one-line fix (which this was initially).

Owner

rgommers commented Oct 21, 2013

@josef-pkt I agree in principle, but for me the priority should be to fix up the stats functions first and the mstats versions after that. So I fixed the obvious issues and added decent tests. If you want the full review/doc/fixes then this PR may sit here for a while.

katherinehuang and others added some commits Aug 1, 2013

@katherinehuang @rgommers katherinehuang BUG: fix bug in mstats.kurtosistest
Deleted astype(), because it is not an attribute of int (which could
be returned instead of an ndarray).
7e7aa7d
@rgommers rgommers BUG: fix several bugs in the mstats normality tests.
Also provide decent test coverage.
ff7470e
@rgommers rgommers TST: clean up test_mstats_basic.py for docstrings in tests 2d87ba7
@rgommers rgommers BUG: fix errors in mstats.ttest_rel and mstats.ttest_ind.
Change default of axis kw to 0 in ttest_rel.  This is OK without warning
because the function didn't work before anyway.

Also provide basic test coverage (comparison against nonmasked version).
Testing with various masked array inputs to be done.

Closes gh-3047.
32051ab
@rgommers rgommers BUG: fix issues in mstats.ttest_1samp. Add axis keyword. 9511568
Owner

rgommers commented Nov 10, 2013

Rebased and fixed ttest issues in gh-3047 plus a bunch of other ones in the ttest functions.

@josef-pkt I'd like to merge this. Can't fix every last thing about these mstats functions now but that's no good reason to not merge bug fixes that are good to go.

Member

josef-pkt commented Nov 10, 2013

The only question: is it "standard" to return nan nan for empty arrays?
(I would have returned empty.)

 +    if a.size == 0 or b.size == 0:
 +        return (np.nan, np.nan)

Otherwise looks fine (I'm not going to look for missing pieces.)

Coverage Status

Coverage remained the same when pulling c175623 on rgommers:pull-2673 into 8594f29 on scipy:master.

Owner

rgommers commented Nov 10, 2013

It depends on the function. It was already returning nan (or crashing in a few cases), so I didn't think too hard about it. But I think nan makes sense here. Empty makes sense usually for functions that return an array of the same shape as the input array. Here t, prob are scalars (or 1-D arrays for n-D input), returning empty instead of a scalar result would be odd.

Coverage Status

Coverage remained the same when pulling c175623 on rgommers:pull-2673 into 8594f29 on scipy:master.

Owner

rgommers commented Nov 10, 2013

Travis error was a server glitch, can be ignored.

Member

josef-pkt commented Nov 10, 2013

I can think of both use cases, where I want nan and where I want empty. But, I don't have currently a use case for empty arrays.

I think then that the masked function should return masked instead of nan in this case, to follow the pattern of mean() versus ma.mean()

>>> np.ma.mean([])
masked
>>> np.ma.sum([])
0.0
>>> np.ma.var([])
0.0
>>> np.ma.std([])
0.0
>>> np.std([])
0.0
>>> np.__version__
'1.6.1'

(I don't know why std is not nan.)

Owner

rgommers commented Nov 10, 2013

Currently masked is never returned, also not if all input is masked:

In [32]: x
Out[32]: 
masked_array(data = [-- -- --],
             mask = [ True  True  True],
       fill_value = 1e+20)


In [33]: stats.mstats.ttest_rel(x, x)
Warning: divide by zero encountered in double_scalars
Out[33]: 
(array(1.0),
 masked_array(data = nan,
             mask = False,
       fill_value = 1e+20)
)

np.ma.mean returns nan not masked (1.6.x change was reverted apparently):

In [36]: np.__version__
Out[36]: '1.5.1'

In [37]: np.ma.mean([])
Warning: invalid value encountered in double_scalars
Out[37]: nan


In [1]: np.__version__
Out[1]: '1.9.0.dev-96dd69c'

In [2]: np.ma.mean([])
/home/rgommers/Code/numpy/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)
/home/rgommers/Code/numpy/numpy/core/_methods.py:65: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)
Out[2]: nan

So stick to nan?

Member

josef-pkt commented Nov 10, 2013

Stick to nan.
I don't have an opinion, and would just follow np.ma.

Owner

rgommers commented Nov 10, 2013

OK. Thanks for the review.

@rgommers rgommers added a commit that referenced this pull request Nov 10, 2013

@rgommers rgommers Merge pull request #3008 from rgommers/pull-2673
Fix mstats.kurtosistest, and test coverage for skewtest/normaltest
b73ec23

@rgommers rgommers merged commit b73ec23 into scipy:master Nov 10, 2013

1 check failed

default The Travis CI build could not complete due to an error
Details

rgommers deleted the rgommers:pull-2673 branch Nov 10, 2013

Member

WarrenWeckesser commented Nov 10, 2013

Sorry for being late to the party. In ttest_1samp, returning (nan, nan) is not quite right (imho) for an array of size 0 in the n-dimensional case. An n-dimensional input with n > 1 is a collection of data sets that happens to be stored in an array. When axis is not None, the t and p values returned by ttest_1samp are (n-1)-dimensional arrays, holding the corresponding statistics for each data set in the input. This generalizes down to the edge case where one or more dimensions of the input has length 0.

Assume (nan, nan) is the correct result for a single empty data set, and suppose x has shape (3,0). Computing ttest_1samp(x, 0, axis=1) means we want to apply ttest_1samp to 3 data sets, each of which has length 0. So the result should be ([nan, nan, nan], [nan, nan, nan]) (I'm dropping the array or masked array wrappers, and just showing the expected values).

On the other hand, ttest_1samp(x, 0, axis=0) means we are applying the test to an empty collection of data sets. So in that case, the result should be ([], []).

This is how the regular ttest_1samp works:

In [2]: from scipy.stats import ttest_1samp

In [3]: x = np.zeros((3,0))

In [4]: ttest_1samp(x, 0, axis=1)
/home/warren/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:55: RuntimeWarning: invalid value encountered in true_divide
  out=ret, casting='unsafe', subok=False)
/home/warren/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:72: RuntimeWarning: invalid value encountered in true_divide
  out=arrmean, casting='unsafe', subok=False)
/home/warren/local_scipy/lib/python2.7/site-packages/scipy/stats/stats.py:3148: RuntimeWarning: invalid value encountered in true_divide
  denom = np.sqrt(v / float(n))
Out[4]: (array([ nan,  nan,  nan]), array([ nan,  nan,  nan]))

In [5]: ttest_1samp(x, 0, axis=0)
Out[5]: (array([], dtype=float64), array([], dtype=float64))

(A separate issue is to fix the regular ttest_1samp function to get rid of the spurious warnings in this edge case.)

Owner

rgommers commented Nov 25, 2013

Thanks Warren. I won't forget about this one (but a bit short on time now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment