Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: stats.skew: unreliable when all values in axis slice are identical #15554

Closed
ronzhou opened this issue Feb 8, 2022 · 4 comments · Fixed by #15905
Closed

BUG: stats.skew: unreliable when all values in axis slice are identical #15554

ronzhou opened this issue Feb 8, 2022 · 4 comments · Fixed by #15905
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats

Comments

@ronzhou
Copy link

ronzhou commented Feb 8, 2022

Describe your issue.

computing the skewness of a numpy ndarray will produce two result with following method for a column in ndarray (x):

  1. skew(x[:, i])
  2. skew(x)[i]

the issues happens when this column has equal float value

Reproducing Code Example

import numpy as np
from scipy.stats import skew

a = np.random.randn(100, 10)
a[:, 0] = 1.01

print(f'result in matrix value {skew(a)[0]}')   # -1.0
print(f'result in array value {skew(a[:, 0])}') # 0.0

Error message

No error message

SciPy/NumPy/Python version information

1.7.3 1.22.0 sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)

@ronzhou ronzhou added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Feb 8, 2022
@tupui
Copy link
Member

tupui commented Feb 8, 2022

Hi @ronzhou, thanks for reporting. I can reproduce and indeed there is an issue here.

It's a resolution issue when we compute the moments with _moment.

from scipy.stats._stats_py import _moment

_moment(a[:, 0], 2, axis, mean=a[:, 0].mean(axis=0, keepdims=True))
# 4.930380657631324e-32

_moment(a, 2, axis, mean=a.mean(axis=0, keepdims=True))[0]
# 1.7749370367472766e-30

And this slight difference makes it fail here:

zero = (m2 <= (np.finfo(m2.dtype).resolution * mean.squeeze(axis))**2)
vals = np.where(zero, 0, m3 / m2**1.5)

And the root cause is from NumPy actually.

import numpy.testing as npt
npt.assert_equal(a.mean(axis=0, keepdims=True)[0][0], a[:, 0].mean(axis=0, keepdims=True)[0])

# Items are not equal:
#  ACTUAL: 1.0100000000000013
#  DESIRED: 1.0100000000000002

@tupui
Copy link
Member

tupui commented Feb 8, 2022

From NumPy's doc

For floating point numbers the numerical precision of sum (and np.add.reduce) is in general limited by directly adding each number individually to the result causing rounding errors in every step. However, often numpy will use a numerically better approach (partial pairwise summation) leading to improved precision in many use-cases. This improved precision is always provided when no axis is given. When axis is given, it will depend on which axis is summed. Technically, to provide the best speed possible, the improved precision is only used when the summation is along the fast axis in memory. Note that the exact precision may vary depending on other parameters.

So in the end we have to update our check. I am not sure where this (np.finfo(m2.dtype).resolution * mean.squeeze(axis))**2) is coming from.

@peterbell10 I see you touch this area recently. Would you know? Can we bump the tolerance a bit here? I always thought it does not make much sense to go bellow the machine precision.

@ronzhou
Copy link
Author

ronzhou commented Feb 9, 2022

Hi @tupui , thanks for your response.

In a 3 dim array, this is not an issue.

import numpy as np
from scipy.stats import skew

a = np.random.randn(10, 5, 3)
a[0, :, 0] = 1.01
print(skew(a, axis =1)[0, 0]) # 0.0

I use scipy.stats.skew to compute skewness along axis 0 in a 2d array. Currently, I use the method below to avoid this issue,

a = np.random.randn(100, 10)
a[:, 0] = 1.01

result = skew(a) 
result[np.nanmin(a, axis = 0) == np.nanmax(a, axis = 0)] = 0

Is there any better solution to avoid it temporarily? Thanks for your help!

@tupui
Copy link
Member

tupui commented Feb 9, 2022

Is there any better solution to avoid it temporarily? Thanks for your help!

Better I don't know but you can try playing with the resolution if your application allows that: skew(a.astype('float32'))

@mdhaber mdhaber changed the title BUG: wrong result in skewness ndarray computation BUG: stats.skew: unreliable when all values in axis slice are identical Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants