Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Return array scalars for skew and kurtosis #12634

Closed
wants to merge 6 commits into from

Conversation

Kai-Striega
Copy link
Member

@Kai-Striega Kai-Striega commented Jul 29, 2020

This PR returns array scalars for the mstats version of skew and kurtosis and adds a test that the returned value is indeed a scalar value.

Reference issue

fixes #12548

What does this implement/fix?

The masked skew and kurtosis function have branches where the resultant masked array is not converted to an array scalar leading to a masked array being returned as below:

>>> kurtosis([1,5,4,6,nan], fisher=False, nan_policy="omit")
masked_array(data=2.,  #  Correct value but a masked array
             mask=False,
       fill_value=1e+20)

This PR adds an addition step so that, in the case of a 1d array, a scalar value is returned.

>>> kurtosis([1,5,4,6,nan], fisher=False, nan_policy="omit")
2.0

Additional information

I'm not sure whether this is the correct approach; and haven't worked in the mstats modules enough to know if there is a better approach. I'd appreciate if someone with more experience could give their opinion on whether there is a better way to handle it.

This PR returns array scalars for the mstats version of skew and
kurtosis and adds a test that the returned value is indeed a scalar
value.
This commit changes ``describe(...)`` to use the new array scalars
returned by ``skew(...)`` and ``kurtosis(...)`` instead of the original
masked arrays.
@Kai-Striega
Copy link
Member Author

@WarrenWeckesser I was wondering if you could take a quick look at this? The solution was to changevals -> vals + 0. As vals is a masked array it will convert to a scalar if the number of dimensions is 0. It's passing (the failing tests are due to memory errors) but I'm not sure if it had been better to just use an if statement.

@@ -2300,7 +2300,9 @@ def skew(a, axis=0, bias=True):
m3 = np.extract(can_correct, m3)
nval = ma.sqrt((n-1.0)*n)/(n-2.0)*m3/m2**1.5
np.place(vals, can_correct, nval)
return vals
# Add 0 to ensure a scalar result is returned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the docstring at L2280 should be updated?

    Returns
    -------
    skewness : ndarray
        The skewness of values along an axis, returning 0 where all values are
        equal.

it seems in this case it's returning a scalar instead of a ndarray

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [30]: skew([[1,5,4],[5,1,3],[4,7,nan]], nan_policy="omit") + 0   
Out[30]: 
masked_array(data=[-0.52800498, -0.38180177,  0.        ],
             mask=False,
       fill_value=1e+20)

Will the new code handle this case? maybe vals.data is what we actually want?

Copy link
Member Author

@Kai-Striega Kai-Striega Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit Yes I think the docstring should be updated, but let's get the rest of the discussion done before I make any more changes.

I'm not sure what we want in that case. To me the problem wasn't from the masked array - it is that an array is being returned at all when there is only a single dimension. Like other ufuncs. I don't have a strong preference either way but I'd like to be consistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vals.data seems to correctly be a scalar vs ndarray. spitballing it looks like

vals = vals.data
if fisher:
    return vals - 3
else:
    return vals

Seems to work

Copy link
Member Author

@Kai-Striega Kai-Striega Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just run the above and it still gives a float value for fisher and an array (not a masked array) for the regular version.

>>> import numpy as np
>>> from scipy.stats import kurtosis
>>> kurtosis([1, 5, 4, 6, np.nan], nan_policy="omit")
-1.0
>>> kurtosis([1, 5, 4, 6, np.nan], fisher=False, nan_policy="omit")
array(2.)

Would we expect both results to be a float instead? i.e.

vals = vals.data
if fisher:
    return vals - 3
else:
    return vals + 0

The masked array's data attribute accessesses the underlying data
and returns it as an ndarray.
def test_skew_omit_nan(self):
# Test that skew returns a scalar when nan_policy is omit
# https://github.com/scipy/scipy/issues/12548
s = stats.skew([1, 5, 4, 6, np.nan], nan_policy='omit')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s = stats.skew([1, 5, 4, 6, np.nan], nan_policy='omit')
s = stats.skew([1, 5, 4, 6, np.nan], fisher=True, nan_policy='omit')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion is primarily to make explicit that both options for fisher arg are tested

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlucas7 sorry about taking so long to reply - I've been away for a long time, but I'm trying to get more involved again now. stats.skew doesn't actually accept a fisher kwarg, only kurtosis. Perhaps this should be changed but that should be another PR.

Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kai-Striega thanks for this changeset

This looks like it fixes the issue and makes masked arrays consistent in shape.

I left 1 suggestion inline. The rest of the changes seem ok.

Also can you add something to the docstring to indicate that the skewness is -3, 0 depending on fisher Truthyness when n=1.

Something like:

When `a` has 1 record skew returns a -3 or 0 if fisher is true or false, respectively.
When `a` is empty skew returns a `nan`. 

and similarly for kurtosis?

Other than those small docstring changes this seems to be good.

def test_skew_omit_nan(self):
# Test that skew returns a scalar when nan_policy is omit
# https://github.com/scipy/scipy/issues/12548
s = stats.skew([1, 5, 4, 6, np.nan], nan_policy='omit')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion is primarily to make explicit that both options for fisher arg are tested

@mdhaber
Copy link
Contributor

mdhaber commented Feb 20, 2022

@Kai-Striega I don't think this is needed now that the decorator has been applied to skew and kurtosis.

Now in main:

import numpy as np
from scipy.stats import kurtosis
kurtosis([1, 5, 4, 6, np.nan], fisher=False, nan_policy="omit")  # 2.0

Please re-open if you think the masked versions themselves should be fixed, but I think we're working toward deprecating mstats.

@mdhaber mdhaber closed this Feb 20, 2022
@Kai-Striega Kai-Striega deleted the fix_mstats_array_scalar branch February 20, 2022 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scipy.stats.skew returning MaskedArray
4 participants