Fix mstats.kurtosistest, and test coverage for skewtest/normaltest #3008

rgommers · 2013-10-19T19:42:14Z

Supercedes gh-2673

coveralls · 2013-10-19T20:18:24Z

Coverage remained the same when pulling 2dabd6d on rgommers:pull-2673 into b7aa678 on scipy:master.

josef-pkt · 2013-10-20T17:10:20Z

scipy/stats/mstats_basic.py

-    denom[denom < 0] = masked
+    if np.ma.isMaskedArray(denom):
+        # For multi-dimensional array input
+        denom[denom < 0] = masked


I don't see masked defined in this function.

In [1]: np.ma.masked Out[1]: masked

@josef-pkt: see line 45

I found it finally, I had searched for masked with an additional whitespace

I don't see from reading the code when denom < 0.
However, I think this could be replaced by n < 5 or add this as additional condition for cases that should be masked.

The same lines are missing in skewtest.

josef-pkt · 2013-10-20T17:16:13Z

Change looks good, I'm not sure the functions are good.

Do they behave correctly if the number of valid observations in a column is too small?
(warning is when the approximation of the distribution of the test statistic is not good. If the number of valid observations is very small, then the calculations don't make sense, nobs - 3, ...)

The stats version raise an exception when the sample is too small.
I don't see that the mstats version check this, and I doubt it's implicitly masked.

josef-pkt · 2013-10-20T17:19:43Z

gh-1950

rgommers · 2013-10-20T21:05:44Z

@EvgeniBurovski sent me a PR to make the small value behavior identical. EDIT: PR merged to here

I didn't intend to close the Statistics Review tickets on these functions yet, just fix them to not be hopelessly broken and finish up gh-2673.

coveralls · 2013-10-20T21:43:22Z

Coverage remained the same when pulling ae17bb44647dc714d28a2cc66de493ba12e27a2a on rgommers:pull-2673 into b7aa678 on scipy:master.

josef-pkt · 2013-10-21T01:18:50Z

I just prefer if we get functions fixed so we don't have to look at them anymore for a few years, unless it's just a quick one-line fix (which this was initially).

rgommers · 2013-10-21T19:53:54Z

@josef-pkt I agree in principle, but for me the priority should be to fix up the stats functions first and the mstats versions after that. So I fixed the obvious issues and added decent tests. If you want the full review/doc/fixes then this PR may sit here for a while.

Deleted astype(), because it is not an attribute of int (which could be returned instead of an ndarray).

Also provide decent test coverage.

Change default of axis kw to 0 in ttest_rel. This is OK without warning because the function didn't work before anyway. Also provide basic test coverage (comparison against nonmasked version). Testing with various masked array inputs to be done. Closes scipygh-3047.

rgommers · 2013-11-10T13:16:11Z

Rebased and fixed ttest issues in gh-3047 plus a bunch of other ones in the ttest functions.

@josef-pkt I'd like to merge this. Can't fix every last thing about these mstats functions now but that's no good reason to not merge bug fixes that are good to go.

josef-pkt · 2013-11-10T13:35:43Z

The only question: is it "standard" to return nan nan for empty arrays?
(I would have returned empty.)

 +    if a.size == 0 or b.size == 0:
 +        return (np.nan, np.nan)

Otherwise looks fine (I'm not going to look for missing pieces.)

coveralls · 2013-11-10T13:55:31Z

Coverage remained the same when pulling c175623 on rgommers:pull-2673 into 8594f29 on scipy:master.

rgommers · 2013-11-10T13:55:53Z

It depends on the function. It was already returning nan (or crashing in a few cases), so I didn't think too hard about it. But I think nan makes sense here. Empty makes sense usually for functions that return an array of the same shape as the input array. Here t, prob are scalars (or 1-D arrays for n-D input), returning empty instead of a scalar result would be odd.

coveralls · 2013-11-10T13:56:48Z

Coverage remained the same when pulling c175623 on rgommers:pull-2673 into 8594f29 on scipy:master.

rgommers · 2013-11-10T13:57:17Z

Travis error was a server glitch, can be ignored.

josef-pkt · 2013-11-10T14:28:26Z

I can think of both use cases, where I want nan and where I want empty. But, I don't have currently a use case for empty arrays.

I think then that the masked function should return masked instead of nan in this case, to follow the pattern of mean() versus ma.mean()

>>> np.ma.mean([])
masked
>>> np.ma.sum([])
0.0
>>> np.ma.var([])
0.0
>>> np.ma.std([])
0.0
>>> np.std([])
0.0
>>> np.__version__
'1.6.1'

(I don't know why std is not nan.)

rgommers · 2013-11-10T14:38:10Z

Currently masked is never returned, also not if all input is masked:

In [32]: x
Out[32]: 
masked_array(data = [-- -- --],
             mask = [ True  True  True],
       fill_value = 1e+20)


In [33]: stats.mstats.ttest_rel(x, x)
Warning: divide by zero encountered in double_scalars
Out[33]: 
(array(1.0),
 masked_array(data = nan,
             mask = False,
       fill_value = 1e+20)
)

np.ma.mean returns nan not masked (1.6.x change was reverted apparently):

In [36]: np.__version__
Out[36]: '1.5.1'

In [37]: np.ma.mean([])
Warning: invalid value encountered in double_scalars
Out[37]: nan


In [1]: np.__version__
Out[1]: '1.9.0.dev-96dd69c'

In [2]: np.ma.mean([])
/home/rgommers/Code/numpy/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)
/home/rgommers/Code/numpy/numpy/core/_methods.py:65: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)
Out[2]: nan

So stick to nan?

josef-pkt · 2013-11-10T14:49:22Z

Stick to nan.
I don't have an opinion, and would just follow np.ma.

rgommers · 2013-11-10T14:52:36Z

OK. Thanks for the review.

Fix mstats.kurtosistest, and test coverage for skewtest/normaltest

WarrenWeckesser · 2013-11-10T16:17:25Z

Sorry for being late to the party. In ttest_1samp, returning (nan, nan) is not quite right (imho) for an array of size 0 in the n-dimensional case. An n-dimensional input with n > 1 is a collection of data sets that happens to be stored in an array. When axis is not None, the t and p values returned by ttest_1samp are (n-1)-dimensional arrays, holding the corresponding statistics for each data set in the input. This generalizes down to the edge case where one or more dimensions of the input has length 0.

Assume (nan, nan) is the correct result for a single empty data set, and suppose x has shape (3,0). Computing ttest_1samp(x, 0, axis=1) means we want to apply ttest_1samp to 3 data sets, each of which has length 0. So the result should be ([nan, nan, nan], [nan, nan, nan]) (I'm dropping the array or masked array wrappers, and just showing the expected values).

On the other hand, ttest_1samp(x, 0, axis=0) means we are applying the test to an empty collection of data sets. So in that case, the result should be ([], []).

This is how the regular ttest_1samp works:

In [2]: from scipy.stats import ttest_1samp

In [3]: x = np.zeros((3,0))

In [4]: ttest_1samp(x, 0, axis=1)
/home/warren/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:55: RuntimeWarning: invalid value encountered in true_divide
  out=ret, casting='unsafe', subok=False)
/home/warren/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:72: RuntimeWarning: invalid value encountered in true_divide
  out=arrmean, casting='unsafe', subok=False)
/home/warren/local_scipy/lib/python2.7/site-packages/scipy/stats/stats.py:3148: RuntimeWarning: invalid value encountered in true_divide
  denom = np.sqrt(v / float(n))
Out[4]: (array([ nan,  nan,  nan]), array([ nan,  nan,  nan]))

In [5]: ttest_1samp(x, 0, axis=0)
Out[5]: (array([], dtype=float64), array([], dtype=float64))

(A separate issue is to fix the regular ttest_1samp function to get rid of the spurious warnings in this edge case.)

rgommers · 2013-11-25T21:11:05Z

Thanks Warren. I won't forget about this one (but a bit short on time now).

rgommers mentioned this pull request Oct 19, 2013

deleted astype because it is not an attribute of int #2673

Closed

josef-pkt reviewed Oct 20, 2013
View reviewed changes

rgommers mentioned this pull request Nov 10, 2013

mstats.ttest_rel axis=None, requires masked array #3047

Closed

katherinehuang and others added 5 commits November 10, 2013 14:09

BUG: fix bug in mstats.kurtosistest

7e7aa7d

Deleted astype(), because it is not an attribute of int (which could be returned instead of an ndarray).

BUG: fix several bugs in the mstats normality tests.

ff7470e

Also provide decent test coverage.

TST: clean up test_mstats_basic.py for docstrings in tests

2d87ba7

BUG: fix issues in mstats.ttest_1samp. Add axis keyword.

9511568

TST: normality tests: add a couple of tests for stats vs mstats

c175623

rgommers added a commit that referenced this pull request Nov 10, 2013

Merge pull request #3008 from rgommers/pull-2673

b73ec23

Fix mstats.kurtosistest, and test coverage for skewtest/normaltest

rgommers merged commit b73ec23 into scipy:master Nov 10, 2013

rgommers deleted the pull-2673 branch November 10, 2013 14:52

rgommers mentioned this pull request Nov 10, 2013

kurtosistest fails in mstats with type error (Trac #1424) #1949

Closed

This was referenced Nov 10, 2013

mstats.kurtosistest is incorrectly converting to float, and fails to run (Trac #1769) #2288

Closed

scipy.stats.mstats.kurtosistest crashes on 1d input (Trac #1661) #2186

Closed

Fix for scipy.stats.mstats.kurtosistest not working with 1D inputs #2790

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mstats.kurtosistest, and test coverage for skewtest/normaltest #3008

Fix mstats.kurtosistest, and test coverage for skewtest/normaltest #3008

rgommers commented Oct 19, 2013

coveralls commented Oct 19, 2013

josef-pkt Oct 20, 2013

rgommers Oct 20, 2013

WarrenWeckesser Oct 20, 2013

josef-pkt Oct 20, 2013

josef-pkt Oct 21, 2013

josef-pkt commented Oct 20, 2013

josef-pkt commented Oct 20, 2013

rgommers commented Oct 20, 2013

coveralls commented Oct 20, 2013

josef-pkt commented Oct 21, 2013

rgommers commented Oct 21, 2013

rgommers commented Nov 10, 2013

josef-pkt commented Nov 10, 2013

coveralls commented Nov 10, 2013

rgommers commented Nov 10, 2013

coveralls commented Nov 10, 2013

rgommers commented Nov 10, 2013

josef-pkt commented Nov 10, 2013

rgommers commented Nov 10, 2013

josef-pkt commented Nov 10, 2013

rgommers commented Nov 10, 2013

WarrenWeckesser commented Nov 10, 2013

rgommers commented Nov 25, 2013

Fix mstats.kurtosistest, and test coverage for skewtest/normaltest #3008

Fix mstats.kurtosistest, and test coverage for skewtest/normaltest #3008

Conversation

rgommers commented Oct 19, 2013

coveralls commented Oct 19, 2013

josef-pkt Oct 20, 2013

Choose a reason for hiding this comment

rgommers Oct 20, 2013

Choose a reason for hiding this comment

WarrenWeckesser Oct 20, 2013

Choose a reason for hiding this comment

josef-pkt Oct 20, 2013

Choose a reason for hiding this comment

josef-pkt Oct 21, 2013

Choose a reason for hiding this comment

josef-pkt commented Oct 20, 2013

josef-pkt commented Oct 20, 2013

rgommers commented Oct 20, 2013

coveralls commented Oct 20, 2013

josef-pkt commented Oct 21, 2013

rgommers commented Oct 21, 2013

rgommers commented Nov 10, 2013

josef-pkt commented Nov 10, 2013

coveralls commented Nov 10, 2013

rgommers commented Nov 10, 2013

coveralls commented Nov 10, 2013

rgommers commented Nov 10, 2013

josef-pkt commented Nov 10, 2013

rgommers commented Nov 10, 2013

josef-pkt commented Nov 10, 2013

rgommers commented Nov 10, 2013

WarrenWeckesser commented Nov 10, 2013

rgommers commented Nov 25, 2013