-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add handling of NaN in RuntimeWarning except clause #10140
Conversation
Thanks! But in scipy functions, we generally don't ignore We have been adding a parameter called |
@WarrenWeckesser Sure, my changes are not ignoring them, simply make |
@oleksandr-pavlyk, ah right. I read the change too quickly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the treatment of nan should go after the treatment of less than O and the check on ddof. Otherwise the Type of Exception returned for negative data depends on whether a nan is also present in the data.
@oleksandr-pavlyk thanks for taking the time to bring this up. The warning occurs in
Also possibly related: numpy/numpy#9513, numpy/numpy#8230 |
In Intel distribution for Python numpy.log raises RuntimeWarning when input contains np.nan, and the gstd in that case return None. ``` import numpy as np from scipy.stats import gstd a = np.array([[1, 1, 1, 16], [np.nan, 1, 2, 3]]) print(gstd(a, axis=1)) # prints None, instead of np.array([4., np.nan]) ```
Also, the last clause (warning is harmless) was returning Null (I suspect this was unintentional), so I added return of actual computation.
44bb4ba
to
34d7ec6
Compare
@Kai-Striega @pvanmulbregt I have made a change to address your concerns. Please look it over. |
scipy/stats/stats.py
Outdated
a_nan = np.isnan(a) | ||
a_nan_any = a_nan.any() | ||
if a_nan.any(): | ||
return np.exp(np.std(log(a, where=~a_nan), axis=axis, ddof=ddof)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for nan should go after the less equal 0 check. Otherwise the type of exception for negative data will depend on whether a nan is also present.
scipy/stats/stats.py
Outdated
a_nan_any = a_nan.any() | ||
if a_nan.any(): | ||
return np.exp(np.std(log(a, where=~a_nan), axis=axis, ddof=ddof)) | ||
elif ((a_nan_any and np.less_equal(a[~a_nan], 0).any()) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the above if statement a_nan_any
will always be False. So the statement becomes:
(False and np.less_equal(a[~a_nan], 0).any()) or (not False and np.less_equal(a, 0).any()
<==> (False) or (np.less_equal(a, 0).any())
<==> np.less_equal(a, 0).any()
.
Without the above if statement it looks like a good solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow. The conditional reads:
( (a_nan_any and np.less_equal(a[~a_nan], 0).any() ) or
(not a_nan_any and np.less_equal(a, 0).any() ) )
If a_nan_any
is True, we apply mask selector and only check non-nan numbers with zero.
if a_nan_any
is False, we avoid applying the mask, and associated extra copying, and compare all elements with zero.
I thought, however, that I had removed the previous if
statement, but evidently not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed the change deleting the leading if
branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could have made that clearer.
The code goes something like this:
a_nan = np.isnan(a)
a_nan_any = a_nan.any()
if a_nan.any():
# do stuff....
# If we get to here a_nan.any() is False.
# As a_nan_any = a_nan.any(), a_nan_any will also be False.
elif ((a_nan_any and np.less_equal(a[~a_nan], 0).any()) or
(not a_nan_any and np.less_equal(a, 0).any())):
# handle negativity check
If a_nan.any()
is True the first block is run, so, to reach the second statement, a_nan.any()
must be False. As a_nan_any = a_nan.any()
, a_nan_any
must also be False, otherwise the second would not be reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you. I realized that moments after I posted the comment, and pushes changes removing the 'if a_nan_any:` branch of the if statement.
raise ValueError( | ||
'Non positive value encountered. The geometric standard ' | ||
'deviation is defined for strictly positive values only.') | ||
elif 'Degrees of freedom <= 0 for slice' == str(w): | ||
raise ValueError(w) | ||
else: | ||
# Remaining warnings don't need to be exceptions. | ||
warnings.warn(w) | ||
return np.exp(np.std(log(a, where=~a_nan), axis=axis, ddof=ddof)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick confirmation; this avoids log dealing with the nan value that caused the spurious warning we're trying to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
Note that in current master have reached this branch we reissue the warning and return None, which is unexpected. I was expecting the result of gstd instead.
With this change, with or without use of where
, an actual result would be returned.
@oleksandr-pavlyk I'm happy with the PR now. @pvanmulbregt would you please take a look at the PR again? |
a_nan = np.isnan(a) | ||
a_nan_any = a_nan.any() | ||
if ((a_nan_any and np.less_equal(a[~a_nan], 0).any()) or | ||
(not a_nan_any and np.less_equal(a, 0).any())): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the two cases (a
contains NAN
s, a
does not contain NAN
s) need to be treated separately in the if
clause? I.e. Is this following snippet equivalent? (It also may remove the need for the a_nan_any
variable)
a_nan = np.isnan(a)
if np.less_equal(a[~a_nan], 0).any():
raise ValueError(
'Non positive value encountered. The geometric standard '
'deviation is defined for strictly positive values only.')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for efficiency sake.
In [1]: import numpy as np
In [3]: a = np.take([1.0, 2.0, 3.0], np.random.choice(3, 10**6))
In [4]: %timeit np.less(a, 0).any()
418 µs ± 2.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: a_nan = np.isnan(a)
In [7]: %timeit np.less(a[~a_nan], 0).any()
2.86 ms ± 89.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If there are no nans present, we do not want to run mask selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The difference in time is small for a
with a thousand elements but is quite substantial by the time a
has a million elements. Can you add a comment in the code indicating that's why the calculation s being structured in this manner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually even better to avoid computing np.less(a[~a_nan], 0)
at all, and do
np.less(np.nanmin(a), 0)
instead:
In [6]: a = np.take([1.0, np.nan, 3.0], np.random.choice(3, 10**6))
In [7]: %timeit np.less(np.nanmin(a), 0)
4.75 ms ± 9.82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: a_nan = np.isnan(a)
In [9]: %timeit np.less(a[~a_nan], 0).any()
8.23 ms ± 7.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: (np.less(a[~a_nan], 0).any(), np.less(np.nanmin(a), 0))
Out[10]: (False, False)
It np.nanmin
is a little slower than bare np.less
for array of normal reals:
In [11]: a = np.take([1.0, 2.0, 3.0], np.random.choice(3, 10**6))
In [12]: %timeit np.less(np.nanmin(a), 0)
777 µs ± 922 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [13]: %timeit np.less(a, 0).any()
429 µs ± 363 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that.
np.less_equal(a, 0).any()
warns if it encounters np.nan
, and expecting it less_equal
to correctly decide that np.nan < 0
is False
may be asking a bit much on all platforms.
np.nanmin(a)
avoids the warning. It's definitely slower than np.min(a)
so separating the two paths seems reasonable, even in a catch Exception block, as an array containing np.nan
may be quite a common occurrence.
[This does raise the question as to what arrays might generate a RuntimeWarning
but don't contain negative or nan
values and hence get called a second time at the end of the catch block, with the same arguments as the first time.]
…nan], 0).any() since the former is about twice faster
As @pvanmulbregt and I have approved the changes. I'm happy to merge this unless anyone voices any other objections. |
Merged. Thanks, @oleksandr-pavlyk. |
In Intel distribution for Python numpy.log raises RuntimeWarning
when input contains np.nan, and the gstd in that case return None.
This happens on MacOSX.
Also when it does, the
np.less_equal
throws another warning when processingnp.nan
: