-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix #8103: scipy.stats.boxcox - np.nan produces warnings #8108
Conversation
scipy/stats/morestats.py
Outdated
raise ValueError("Data must be positive.") | ||
|
||
if lmbda is not None: # single transformation | ||
return special.boxcox(x, lmbda) | ||
return special.boxcox(x, lmbda) # propagates nan's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be
if nan_policy == 'propagate':
return special.boxcox(x, lmbda)
else:
return special.boxcox(x[not_nan_mask], lmbda)
It won't broadcast lmbda
correctly, but that doesn't seem like an intended use case. @josef-pkt or @ev-br would know better than me though, so we should see what they say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great finding, thank you.
To achieve the same functionality as you suggested, I just moved 2 lines of code (for omitting nan values) more to the top, so it executes earlier than before.
Check please, if it meets your expectations - the effect should be the same,
but no additional IF is needed (less complexity).
Looks good to me, but I forgot to ask you to add a regression test (my bad). Would probably be fine to take your original example from #8103 and check that you get the right results with the various possible nan policies. |
about broadcasting: given that the docstring says x is 1-dim, it won't be available. I think an extra "omit-fit" nan-policy would be useful in this case, which is now not available, AFAICS. (E.g. that would make it easier to combine the transformed array with some original data like a pandas DataFrame). |
The docstring is missing a See Also to the scipy.special function, which users can use if they don't need to estimate lambda and have broadcasting. |
@person142 Tests were added, but I don't know how to run them locally on my Windows machine. Otherwise, I will wait until CI runs and checks all tests. |
@josef-pkt
|
@stefansimik but AFAICS, you don't have an option for computing lmbda without nans, but returning the array with nans. Or do I misread the changes? |
Yes, that's true - you are right. That can happen = computing lmbda without nan's, but returning transformed values with nan's. In my understanding, the calculation of lambda itself is not supposed to be parametrized by nan_policy. But maybe it should be, I am open, if you can see some practical use-case. I can see some value in consistency, if we added additional param lmbda_nan_policy, that would have default value 'omit', and other possible value 'raise'. Option 'propagate' is useless here. But I have other suggestion to fix and answer all these questions. I will write it here in new comment from my PC... |
Guys, I would like to suggest to go in different direction - let's be open-minded for a while and imagine, that How it will work and handle all possible scenarios:
What are main benefits:
On top of that, I believe, that function like
If the function has logical nan handling, let's use it. If someone wants to block nan's, or application domain requires it - then nan's can be easily removed (it is one liner code). Do you know anybody, who would require to put some nan_policy into Probably not many if at all. And in special cases, where it is needed - one can remove nan's before calling np.log(...). So let's get rid of the If one sends array with nan's into
Let's make it work the same way as How would you feel if:
vs.
This my suggestion: let's get rid of nan_policy and always propagate nan's. |
@stefansimik There is a difference between scipy.stats and scipy.special. In this case, stats.boxcox as a simple transformation is now essentially obsolete given the scipy.special function. The advantage is that we estimate and transform at the same time, as a convenience function that combines two separate steps. (Note: my initial suggestion was to use 'ignore' instead of 'propagate' as an option to not even check for nans if checking for nans is relatively expensive for a function that itself does just some simple computation. I'm slowly starting to change my mind about whether that's a useful default if the function has to do more than some very simple computation.) |
@josef-pkt
I can understand, that in scipy.stats - there are hypothesis-tests and many other stats-functions, where nan_policy is relevant and nature of the problem makes it useful. It is important to answer these questions, to resolve this issue:
I mean, boxcox(..) without nan_policy can work great, because
I know, stats.boxcox() does 2 thing at once, which makes thing more complex, but:
It can be expected, that most usages of stats.boxcox() will be straightforward =
I see it very similarly for 'omit' option, because one can easily write one liner - Each mentioned use-case can be done without nan_policy easily. It is generally usable and powerful without nan_policy. Isee no need to make things more complex. Everything where I look says to me there is no significant nor practical added value in nan_policy parameter for stats.boxcox(), that would compensate for higher complexity. For case of boxcox() method, I see nan_policy as
I am open-minded, but in this light - it looks lit not worth it. |
@josef-pkt @person142 |
I will defer to whatever @josef-pkt thinks, since he is much wiser in the ways of |
@person142 @josef-pkt @josef-pkt could you help us to resolve this issue, based on your opinion, please? |
@josef-pkt when you have time, could you please have a look at this? |
I still have problems with this, and I'm still not sure what the appropriate solution should be. The underlying problem also affects other functions, e.g. Those function are elementwise like ufuncs where nan propagation is a useful and the usual default. However, they need intermediate results where a single nan destroys the results. The "obvious" (most common usage) solution would be to behave like ufuncs and always propagate nan in the elementwise part, but use nan robust computation in the reduce statistics. The trimming functions are a bit in between, because they explicitly exclude some elements by definition. For box-cox a similar issue as nan handling would be what to do with non-positive values, options would be to drop or add a constant to make them positive. That is also left to the user, and the function currently hardcodes "raise". aside: I just saw that sigmaclip is missing in the function list at |
Closes #8103