BUG: stats.skew: unreliable when all values in axis slice are identical #15554

ronzhou · 2022-02-08T15:45:36Z

Describe your issue.

computing the skewness of a numpy ndarray will produce two result with following method for a column in ndarray (x):

skew(x[:, i])
skew(x)[i]

the issues happens when this column has equal float value

Reproducing Code Example

import numpy as np
from scipy.stats import skew

a = np.random.randn(100, 10)
a[:, 0] = 1.01

print(f'result in matrix value {skew(a)[0]}')   # -1.0
print(f'result in array value {skew(a[:, 0])}') # 0.0

Error message

No error message

SciPy/NumPy/Python version information

1.7.3 1.22.0 sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)

tupui · 2022-02-08T20:46:32Z

Hi @ronzhou, thanks for reporting. I can reproduce and indeed there is an issue here.

It's a resolution issue when we compute the moments with _moment.

from scipy.stats._stats_py import _moment

_moment(a[:, 0], 2, axis, mean=a[:, 0].mean(axis=0, keepdims=True))
# 4.930380657631324e-32

_moment(a, 2, axis, mean=a.mean(axis=0, keepdims=True))[0]
# 1.7749370367472766e-30

And this slight difference makes it fail here:

zero = (m2 <= (np.finfo(m2.dtype).resolution * mean.squeeze(axis))**2)
vals = np.where(zero, 0, m3 / m2**1.5)

And the root cause is from NumPy actually.

import numpy.testing as npt
npt.assert_equal(a.mean(axis=0, keepdims=True)[0][0], a[:, 0].mean(axis=0, keepdims=True)[0])

# Items are not equal:
#  ACTUAL: 1.0100000000000013
#  DESIRED: 1.0100000000000002

tupui · 2022-02-08T21:05:29Z

From NumPy's doc

For floating point numbers the numerical precision of sum (and np.add.reduce) is in general limited by directly adding each number individually to the result causing rounding errors in every step. However, often numpy will use a numerically better approach (partial pairwise summation) leading to improved precision in many use-cases. This improved precision is always provided when no axis is given. When axis is given, it will depend on which axis is summed. Technically, to provide the best speed possible, the improved precision is only used when the summation is along the fast axis in memory. Note that the exact precision may vary depending on other parameters.

So in the end we have to update our check. I am not sure where this (np.finfo(m2.dtype).resolution * mean.squeeze(axis))**2) is coming from.

@peterbell10 I see you touch this area recently. Would you know? Can we bump the tolerance a bit here? I always thought it does not make much sense to go bellow the machine precision.

ronzhou · 2022-02-09T04:46:30Z

Hi @tupui , thanks for your response.

In a 3 dim array, this is not an issue.

import numpy as np
from scipy.stats import skew

a = np.random.randn(10, 5, 3)
a[0, :, 0] = 1.01
print(skew(a, axis =1)[0, 0]) # 0.0

I use scipy.stats.skew to compute skewness along axis 0 in a 2d array. Currently, I use the method below to avoid this issue,

a = np.random.randn(100, 10)
a[:, 0] = 1.01

result = skew(a) 
result[np.nanmin(a, axis = 0) == np.nanmax(a, axis = 0)] = 0

Is there any better solution to avoid it temporarily? Thanks for your help!

tupui · 2022-02-09T16:58:50Z

Is there any better solution to avoid it temporarily? Thanks for your help!

Better I don't know but you can try playing with the resolution if your application allows that: skew(a.astype('float32'))

ronzhou added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Feb 8, 2022

tupui added the scipy.stats label Feb 8, 2022

mdhaber mentioned this issue Mar 15, 2022

A Solid Foundation for Statistics in Python with SciPy mdhaber/scipy#26

Closed

mdhaber changed the title ~~BUG: wrong result in skewness ndarray computation~~ BUG: stats.skew: unreliable when all values in axis slice are identical Mar 23, 2022

mdhaber mentioned this issue Mar 31, 2022

MAINT: stats._moment: warn when catastrophic cancellation occurs #15905

Merged

tirthasheshpatel closed this as completed in #15905 Apr 27, 2022

mdhaber mentioned this issue Apr 29, 2022

MAINT: stats: replace np.var with _moment(..., 2) to warn on constant input #16055

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: stats.skew: unreliable when all values in axis slice are identical #15554

BUG: stats.skew: unreliable when all values in axis slice are identical #15554

ronzhou commented Feb 8, 2022

tupui commented Feb 8, 2022 •

edited

tupui commented Feb 8, 2022 •

edited

ronzhou commented Feb 9, 2022

tupui commented Feb 9, 2022

BUG: stats.skew: unreliable when all values in axis slice are identical #15554

BUG: stats.skew: unreliable when all values in axis slice are identical #15554

Comments

ronzhou commented Feb 8, 2022

Describe your issue.

Reproducing Code Example

Error message

SciPy/NumPy/Python version information

tupui commented Feb 8, 2022 • edited

tupui commented Feb 8, 2022 • edited

ronzhou commented Feb 9, 2022

tupui commented Feb 9, 2022

tupui commented Feb 8, 2022 •

edited

tupui commented Feb 8, 2022 •

edited