Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mstats.kurtosistest, and test coverage for skewtest/normaltest #3008

Merged
merged 6 commits into from
Nov 10, 2013

Conversation

rgommers
Copy link
Member

Supercedes gh-2673

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 2dabd6d on rgommers:pull-2673 into b7aa678 on scipy:master.

denom[denom < 0] = masked
if np.ma.isMaskedArray(denom):
# For multi-dimensional array input
denom[denom < 0] = masked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see masked defined in this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [1]: np.ma.masked
Out[1]: masked

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josef-pkt: see line 45

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it finally, I had searched for masked with an additional whitespace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see from reading the code when denom < 0.
However, I think this could be replaced by n < 5 or add this as additional condition for cases that should be masked.

The same lines are missing in skewtest.

@josef-pkt
Copy link
Member

Change looks good, I'm not sure the functions are good.

Do they behave correctly if the number of valid observations in a column is too small?
(warning is when the approximation of the distribution of the test statistic is not good. If the number of valid observations is very small, then the calculations don't make sense, nobs - 3, ...)

The stats version raise an exception when the sample is too small.
I don't see that the mstats version check this, and I doubt it's implicitly masked.

@josef-pkt
Copy link
Member

gh-1950

@rgommers
Copy link
Member Author

@EvgeniBurovski sent me a PR to make the small value behavior identical. EDIT: PR merged to here

I didn't intend to close the Statistics Review tickets on these functions yet, just fix them to not be hopelessly broken and finish up gh-2673.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling ae17bb44647dc714d28a2cc66de493ba12e27a2a on rgommers:pull-2673 into b7aa678 on scipy:master.

@josef-pkt
Copy link
Member

I just prefer if we get functions fixed so we don't have to look at them anymore for a few years, unless it's just a quick one-line fix (which this was initially).

@rgommers
Copy link
Member Author

@josef-pkt I agree in principle, but for me the priority should be to fix up the stats functions first and the mstats versions after that. So I fixed the obvious issues and added decent tests. If you want the full review/doc/fixes then this PR may sit here for a while.

katherinehuang and others added 5 commits November 10, 2013 14:09
Deleted astype(), because it is not an attribute of int (which could
be returned instead of an ndarray).
Also provide decent test coverage.
Change default of axis kw to 0 in ttest_rel.  This is OK without warning
because the function didn't work before anyway.

Also provide basic test coverage (comparison against nonmasked version).
Testing with various masked array inputs to be done.

Closes scipygh-3047.
@rgommers
Copy link
Member Author

Rebased and fixed ttest issues in gh-3047 plus a bunch of other ones in the ttest functions.

@josef-pkt I'd like to merge this. Can't fix every last thing about these mstats functions now but that's no good reason to not merge bug fixes that are good to go.

@josef-pkt
Copy link
Member

The only question: is it "standard" to return nan nan for empty arrays?
(I would have returned empty.)

 +    if a.size == 0 or b.size == 0:
 +        return (np.nan, np.nan)

Otherwise looks fine (I'm not going to look for missing pieces.)

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling c175623 on rgommers:pull-2673 into 8594f29 on scipy:master.

@rgommers
Copy link
Member Author

It depends on the function. It was already returning nan (or crashing in a few cases), so I didn't think too hard about it. But I think nan makes sense here. Empty makes sense usually for functions that return an array of the same shape as the input array. Here t, prob are scalars (or 1-D arrays for n-D input), returning empty instead of a scalar result would be odd.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling c175623 on rgommers:pull-2673 into 8594f29 on scipy:master.

@rgommers
Copy link
Member Author

Travis error was a server glitch, can be ignored.

@josef-pkt
Copy link
Member

I can think of both use cases, where I want nan and where I want empty. But, I don't have currently a use case for empty arrays.

I think then that the masked function should return masked instead of nan in this case, to follow the pattern of mean() versus ma.mean()

>>> np.ma.mean([])
masked
>>> np.ma.sum([])
0.0
>>> np.ma.var([])
0.0
>>> np.ma.std([])
0.0
>>> np.std([])
0.0
>>> np.__version__
'1.6.1'

(I don't know why std is not nan.)

@rgommers
Copy link
Member Author

Currently masked is never returned, also not if all input is masked:

In [32]: x
Out[32]: 
masked_array(data = [-- -- --],
             mask = [ True  True  True],
       fill_value = 1e+20)


In [33]: stats.mstats.ttest_rel(x, x)
Warning: divide by zero encountered in double_scalars
Out[33]: 
(array(1.0),
 masked_array(data = nan,
             mask = False,
       fill_value = 1e+20)
)

np.ma.mean returns nan not masked (1.6.x change was reverted apparently):

In [36]: np.__version__
Out[36]: '1.5.1'

In [37]: np.ma.mean([])
Warning: invalid value encountered in double_scalars
Out[37]: nan


In [1]: np.__version__
Out[1]: '1.9.0.dev-96dd69c'

In [2]: np.ma.mean([])
/home/rgommers/Code/numpy/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)
/home/rgommers/Code/numpy/numpy/core/_methods.py:65: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)
Out[2]: nan

So stick to nan?

@josef-pkt
Copy link
Member

Stick to nan.
I don't have an opinion, and would just follow np.ma.

@rgommers
Copy link
Member Author

OK. Thanks for the review.

rgommers added a commit that referenced this pull request Nov 10, 2013
Fix mstats.kurtosistest, and test coverage for skewtest/normaltest
@rgommers rgommers merged commit b73ec23 into scipy:master Nov 10, 2013
@rgommers rgommers deleted the pull-2673 branch November 10, 2013 14:52
@WarrenWeckesser
Copy link
Member

Sorry for being late to the party. In ttest_1samp, returning (nan, nan) is not quite right (imho) for an array of size 0 in the n-dimensional case. An n-dimensional input with n > 1 is a collection of data sets that happens to be stored in an array. When axis is not None, the t and p values returned by ttest_1samp are (n-1)-dimensional arrays, holding the corresponding statistics for each data set in the input. This generalizes down to the edge case where one or more dimensions of the input has length 0.

Assume (nan, nan) is the correct result for a single empty data set, and suppose x has shape (3,0). Computing ttest_1samp(x, 0, axis=1) means we want to apply ttest_1samp to 3 data sets, each of which has length 0. So the result should be ([nan, nan, nan], [nan, nan, nan]) (I'm dropping the array or masked array wrappers, and just showing the expected values).

On the other hand, ttest_1samp(x, 0, axis=0) means we are applying the test to an empty collection of data sets. So in that case, the result should be ([], []).

This is how the regular ttest_1samp works:

In [2]: from scipy.stats import ttest_1samp

In [3]: x = np.zeros((3,0))

In [4]: ttest_1samp(x, 0, axis=1)
/home/warren/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:55: RuntimeWarning: invalid value encountered in true_divide
  out=ret, casting='unsafe', subok=False)
/home/warren/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py:72: RuntimeWarning: invalid value encountered in true_divide
  out=arrmean, casting='unsafe', subok=False)
/home/warren/local_scipy/lib/python2.7/site-packages/scipy/stats/stats.py:3148: RuntimeWarning: invalid value encountered in true_divide
  denom = np.sqrt(v / float(n))
Out[4]: (array([ nan,  nan,  nan]), array([ nan,  nan,  nan]))

In [5]: ttest_1samp(x, 0, axis=0)
Out[5]: (array([], dtype=float64), array([], dtype=float64))

(A separate issue is to fix the regular ttest_1samp function to get rid of the spurious warnings in this edge case.)

@rgommers
Copy link
Member Author

Thanks Warren. I won't forget about this one (but a bit short on time now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants