Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add handling of NaN in RuntimeWarning except clause #10140

Merged
merged 4 commits into from
Jul 3, 2019

Conversation

oleksandr-pavlyk
Copy link
Contributor

In Intel distribution for Python numpy.log raises RuntimeWarning
when input contains np.nan, and the gstd in that case return None.
This happens on MacOSX.

import numpy as np
from scipy.stats import gstd

a = np.array([[1, 1, 1, 16], [np.nan, 1, 2, 3]])
print(gstd(a, axis=1)) # prints None, instead of np.array([4., np.nan])

Also when it does, the np.less_equal throws another warning when processing np.nan:

In [4]: gstd(np.array([[1, 1, 1, 16], [np.nan, 1, 2, 3]]), axis=1)
env_root/t_scipy_rc/bin/ipython:23: RuntimeWarning: invalid value encountered in less_equal
env_root/t_scipy_rc/bin/ipython:31: RuntimeWarning: invalid value encountered in log

@WarrenWeckesser
Copy link
Member

Thanks! But in scipy functions, we generally don't ignore nan by default, and doing so here creates an inconsistent API. The correct behavior here is that, for example, gstd([1, 2, np.nan]) returns nan.

We have been adding a parameter called nan_policy to many of the functions to control how nan is handled. nan is ignored when nan_policy is set to "omit".

@oleksandr-pavlyk
Copy link
Contributor Author

@WarrenWeckesser Sure, my changes are not ignoring them, simply make TestGeometricStandardDeviation::test_propogates_nan_values pass.

@WarrenWeckesser
Copy link
Member

@oleksandr-pavlyk, ah right. I read the change too quickly.

Copy link
Contributor

@pvanmulbregt pvanmulbregt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the treatment of nan should go after the treatment of less than O and the check on ddof. Otherwise the Type of Exception returned for negative data depends on whether a nan is also present in the data.

@Kai-Striega
Copy link
Member

Kai-Striega commented May 9, 2019

@oleksandr-pavlyk thanks for taking the time to bring this up.

The warning occurs in np.less_equal(a, 0), so if we are to avoid the warning we must check for nan before it is called. @pvanmulbregt is correct that doing the nan check before will introduce new issues w.r.t. negatives/ddof - so I agree that's probably not the best way forward.

np.less_equal's handling of nan values is an open issue on NumPy (see: numpy/numpy#8945) where the warning is treated as a bug. Looking at the referenced issues it seems to be platform/version dependent and difficult to reproduce. I'm not really sure how we should go forward from here.

Also possibly related: numpy/numpy#9513, numpy/numpy#8230

In Intel distribution for Python numpy.log raises RuntimeWarning
when input contains np.nan, and the gstd in that case return None.

```
import numpy as np
from scipy.stats import gstd

a = np.array([[1, 1, 1, 16], [np.nan, 1, 2, 3]])
print(gstd(a, axis=1)) # prints None, instead of np.array([4., np.nan])
```
Also, the last clause (warning is harmless) was returning Null (I suspect
this was unintentional), so I added return of actual computation.
@oleksandr-pavlyk
Copy link
Contributor Author

@Kai-Striega @pvanmulbregt I have made a change to address your concerns. Please look it over.

a_nan = np.isnan(a)
a_nan_any = a_nan.any()
if a_nan.any():
return np.exp(np.std(log(a, where=~a_nan), axis=axis, ddof=ddof))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for nan should go after the less equal 0 check. Otherwise the type of exception for negative data will depend on whether a nan is also present.

a_nan_any = a_nan.any()
if a_nan.any():
return np.exp(np.std(log(a, where=~a_nan), axis=axis, ddof=ddof))
elif ((a_nan_any and np.less_equal(a[~a_nan], 0).any()) or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the above if statement a_nan_any will always be False. So the statement becomes:

(False and np.less_equal(a[~a_nan], 0).any()) or (not False and np.less_equal(a, 0).any()
<==> (False) or (np.less_equal(a, 0).any())
<==> np.less_equal(a, 0).any().

Without the above if statement it looks like a good solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow. The conditional reads:

( (a_nan_any and np.less_equal(a[~a_nan], 0).any() ) or 
   (not a_nan_any and np.less_equal(a, 0).any() )   )

If a_nan_any is True, we apply mask selector and only check non-nan numbers with zero.

if a_nan_any is False, we avoid applying the mask, and associated extra copying, and compare all elements with zero.

I thought, however, that I had removed the previous if statement, but evidently not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the change deleting the leading if branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have made that clearer.

The code goes something like this:

a_nan = np.isnan(a)
a_nan_any = a_nan.any()
if a_nan.any():
    # do stuff....
# If we get to here  a_nan.any() is False.
# As a_nan_any = a_nan.any(), a_nan_any will also be False.
elif ((a_nan_any and np.less_equal(a[~a_nan], 0).any()) or
       (not a_nan_any and np.less_equal(a, 0).any())):
    # handle negativity check

If a_nan.any() is True the first block is run, so, to reach the second statement, a_nan.any() must be False. As a_nan_any = a_nan.any(), a_nan_any must also be False, otherwise the second would not be reached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you. I realized that moments after I posted the comment, and pushes changes removing the 'if a_nan_any:` branch of the if statement.

raise ValueError(
'Non positive value encountered. The geometric standard '
'deviation is defined for strictly positive values only.')
elif 'Degrees of freedom <= 0 for slice' == str(w):
raise ValueError(w)
else:
# Remaining warnings don't need to be exceptions.
warnings.warn(w)
return np.exp(np.std(log(a, where=~a_nan), axis=axis, ddof=ddof))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick confirmation; this avoids log dealing with the nan value that caused the spurious warning we're trying to avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Note that in current master have reached this branch we reissue the warning and return None, which is unexpected. I was expecting the result of gstd instead.

With this change, with or without use of where, an actual result would be returned.

@Kai-Striega
Copy link
Member

@oleksandr-pavlyk I'm happy with the PR now.

@pvanmulbregt would you please take a look at the PR again?

a_nan = np.isnan(a)
a_nan_any = a_nan.any()
if ((a_nan_any and np.less_equal(a[~a_nan], 0).any()) or
(not a_nan_any and np.less_equal(a, 0).any())):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the two cases (a contains NANs, a does not contain NANs) need to be treated separately in the if clause? I.e. Is this following snippet equivalent? (It also may remove the need for the a_nan_any variable)

a_nan = np.isnan(a)
if np.less_equal(a[~a_nan], 0).any():
    raise ValueError(
        'Non positive value encountered. The geometric standard '
	'deviation is defined for strictly positive values only.')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for efficiency sake.

In [1]: import numpy as np                                                                                                                                                                                                                   

In [3]: a = np.take([1.0, 2.0, 3.0], np.random.choice(3, 10**6))                                                                                                                                                                             

In [4]: %timeit np.less(a, 0).any()                                                                                                                                                                                                          
418 µs ± 2.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: a_nan = np.isnan(a)                                                                                                                                                                                                                  

In [7]: %timeit np.less(a[~a_nan], 0).any()                                                                                                                                                                                                  
2.86 ms ± 89.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

If there are no nans present, we do not want to run mask selection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The difference in time is small for a with a thousand elements but is quite substantial by the time a has a million elements. Can you add a comment in the code indicating that's why the calculation s being structured in this manner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually even better to avoid computing np.less(a[~a_nan], 0) at all, and do
np.less(np.nanmin(a), 0) instead:

In [6]: a = np.take([1.0, np.nan, 3.0], np.random.choice(3, 10**6))                                                                                                                                                                          

In [7]: %timeit np.less(np.nanmin(a), 0)                                                                                                                                                                                                     
4.75 ms ± 9.82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: a_nan = np.isnan(a)                                                                                                                                                                                                                  

In [9]: %timeit np.less(a[~a_nan], 0).any()                                                                                                                                                                                                  
8.23 ms ± 7.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: (np.less(a[~a_nan], 0).any(), np.less(np.nanmin(a), 0))                                                                                                                                                                             
Out[10]: (False, False)

It np.nanmin is a little slower than bare np.less for array of normal reals:

In [11]: a = np.take([1.0, 2.0, 3.0], np.random.choice(3, 10**6))                                                                                                                                                                            

In [12]: %timeit np.less(np.nanmin(a), 0)                                                                                                                                                                                                    
777 µs ± 922 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [13]: %timeit np.less(a, 0).any()                                                                                                                                                                                                         
429 µs ± 363 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that.
np.less_equal(a, 0).any() warns if it encounters np.nan, and expecting it less_equal to correctly decide that np.nan < 0 is False may be asking a bit much on all platforms.
np.nanmin(a) avoids the warning. It's definitely slower than np.min(a) so separating the two paths seems reasonable, even in a catch Exception block, as an array containing np.nan may be quite a common occurrence.
[This does raise the question as to what arrays might generate a RuntimeWarning but don't contain negative or nan values and hence get called a second time at the end of the catch block, with the same arguments as the first time.]

…nan], 0).any() since the former is about twice faster
@Kai-Striega
Copy link
Member

As @pvanmulbregt and I have approved the changes. I'm happy to merge this unless anyone voices any other objections.

@Kai-Striega Kai-Striega merged commit 1238d2d into scipy:master Jul 3, 2019
@Kai-Striega
Copy link
Member

Merged. Thanks, @oleksandr-pavlyk.

@pv pv added this to the 1.4.0 milestone Jul 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants