Bugfix #8103: scipy.stats.boxcox - np.nan produces warnings #8108

stefansimik · 2017-10-30T21:42:11Z

…houldn't

person142 · 2017-10-31T22:17:19Z

scipy/stats/morestats.py

        raise ValueError("Data must be positive.")

    if lmbda is not None:  # single transformation
-        return special.boxcox(x, lmbda)
+        return special.boxcox(x, lmbda)  # propagates nan's


I think it should be

if nan_policy == 'propagate': return special.boxcox(x, lmbda) else: return special.boxcox(x[not_nan_mask], lmbda)

It won't broadcast lmbda correctly, but that doesn't seem like an intended use case. @josef-pkt or @ev-br would know better than me though, so we should see what they say.

Great finding, thank you.
To achieve the same functionality as you suggested, I just moved 2 lines of code (for omitting nan values) more to the top, so it executes earlier than before.
Check please, if it meets your expectations - the effect should be the same,
but no additional IF is needed (less complexity).

person142 · 2017-11-01T22:23:47Z

Looks good to me, but I forgot to ask you to add a regression test (my bad). Would probably be fine to take your original example from #8103 and check that you get the right results with the various possible nan policies.

josef-pkt · 2017-11-02T12:24:52Z

about broadcasting: given that the docstring says x is 1-dim, it won't be available.
e.g. taking the mask indexed array would ravel.

I think an extra "omit-fit" nan-policy would be useful in this case, which is now not available, AFAICS.
That is, omit the nans in the estimation of lmbda, but propagate the nans to the transformed array.

(E.g. that would make it easier to combine the transformed array with some original data like a pandas DataFrame).

josef-pkt · 2017-11-02T12:27:46Z

The docstring is missing a See Also to the scipy.special function, which users can use if they don't need to estimate lambda and have broadcasting.

stefansimik · 2017-11-03T16:58:35Z

@person142 Tests were added, but I don't know how to run them locally on my Windows machine.
I expect, it requires some special setup to compile whole scipy first, install it, and so on...
If you know some tutorial how to do it locally, let me know.

Otherwise, I will wait until CI runs and checks all tests.

stefansimik · 2017-11-03T17:04:32Z

@josef-pkt
My understanding is, that

lmbda be calculated only using real values i.e. nan values are not usable in calculating lmbda itself.
in this context - nan_policy parameter logically applies only to the transformation of the values - and not calculation of lmbda -- but here, I agree, this should be more clearly stated in the docs and I would be happy to do it, if you agree

josef-pkt · 2017-11-03T17:11:09Z

@stefansimik but AFAICS, you don't have an option for computing lmbda without nans, but returning the array with nans. Or do I misread the changes?

stefansimik · 2017-11-03T18:11:04Z

@josef-pkt

Yes, that's true - you are right. That can happen = computing lmbda without nan's, but returning transformed values with nan's.

In my understanding, the calculation of lambda itself is not supposed to be parametrized by nan_policy. But maybe it should be, I am open, if you can see some practical use-case.

I can see some value in consistency, if we added additional param lmbda_nan_policy, that would have default value 'omit', and other possible value 'raise'. Option 'propagate' is useless here.

But I have other suggestion to fix and answer all these questions. I will write it here in new comment from my PC...

stefansimik · 2017-11-03T21:01:59Z

@josef-pkt @person142

Guys, I would like to suggest to go in different direction - let's be open-minded for a while and imagine, that boxcox(...) method will have no new parameter(s) like nan_policy, but propagates nan's always -- the same way like np.log(...) does.

How it will work and handle all possible scenarios:

nan propagation scenario: this is default and most common behavior.
nan omit scenario: --> one can remove nan's if this is requirement from application point of view
nan raise scenario: --> one can check nan's and raise exception is requirement from application point of view

What are main benefits:

simple, straightforward, working and intuitive result for majority of use-cases
- the same way, like we can log-transform, we can boxcox-transform
less parameters, less complexity, less thinking
less maintenance

On top of that, I believe, that function like np.log(), or boxcox():

should do only core logic and return results. Nothing more, nothing less.
- that means, such methods should not add too much on top of basic functionality = it should not do complex checking of inputs, nor support multiple input check scenarios to meet various expectations of different domains. This is responsibility of applications, which uses boxcox() method to make sure it prepares proper inputs - and this is based on application domain knowledge.

I believe, that preparing proper inputs or various input checking options should be part of the application, not part of the core math/stats function itself.

If the function has logical nan handling, let's use it. If someone wants to block nan's, or application domain requires it - then nan's can be easily removed (it is one liner code).

Do you know anybody, who would require to put some nan_policy into np.log(...) method?

Probably not many if at all. And in special cases, where it is needed - one can remove nan's before calling np.log(...).

So let's get rid of the nan_policy. As we dont need nan_policy in np.log(...), the same way we don't need nan_policy in boxcox().

If one sends array with nan's into boxcox(...):

then lmbda will be calculated from values other than nan's
- this is the only logical thing we can do
transformation will be done on values also other than nan's
- intuitive and logical handling
where nan's is, there it remains
- common expected handling

Let's make it work the same way as np.log(...).
Let's make it practical, simple and intuitive - as Python is expected to be.

How would you feel if:

you have complex boxcox function, where you have to study and evaluate additional
parameters like nan_policy for transformation and another nan_policy for calculation lmbda?

vs.

you have simple boxcox() that just works. Transparent working by propagating nan's.
Simple intuitive behavior works better, than rich and complex.

This my suggestion: let's get rid of nan_policy and always propagate nan's.
But you guys have the last word of course ;)

josef-pkt · 2017-11-04T14:08:50Z

@stefansimik There is a difference between scipy.stats and scipy.special.
The scipy special functions are elementwise and behave like you describe,
The scipy.stats functions need to do some further processing over the entire array in almost all cases. So the analogy to np.log is not appropriate and e.g. np.mean versus np.nanmean would be a better analogy.

In this case, stats.boxcox as a simple transformation is now essentially obsolete given the scipy.special function. The advantage is that we estimate and transform at the same time, as a convenience function that combines two separate steps.
Similar in hypothesis tests or most other stats functions we have to let the user decide how to handle nan values, because simply propagating them throughout is not possible.

(Note: my initial suggestion was to use 'ignore' instead of 'propagate' as an option to not even check for nans if checking for nans is relatively expensive for a function that itself does just some simple computation. I'm slowly starting to change my mind about whether that's a useful default if the function has to do more than some very simple computation.)

stefansimik · 2017-11-04T17:55:34Z

@josef-pkt
Thank you. I learnt a lot from your detailed explanation.

Similar in hypothesis tests or most other stats functions we have to let the user decide how to handle nan values, because simply propagating them throughout is not possible.

I can understand, that in scipy.stats - there are hypothesis-tests and many other stats-functions, where nan_policy is relevant and nature of the problem makes it useful.

It is important to answer these questions, to resolve this issue:

But are we sure, this is case also for scipy.stats.boxcox(..) ?
Why we need new nan_policy_param, when every scenario can be done nicely without it?
Is there any imaginable scenario, where nan_policy would be must-have at all?
What is the significant practical value, that compensates for new added complexity of nan_policy, that would make life of developer easier?

I mean, boxcox(..) without nan_policy can work great, because

it just works for all use-cases (propagate / omit / raise scenarios)
it is simple, practically working and transparent choice.
it adheres to the principle of least astonishment - it produces intuitive result, that is expected.

I know, stats.boxcox() does 2 thing at once, which makes thing more complex, but:

calculation of lambda has to be done by omitting nan's - always.
- There is no other option = there is nothing to configure. Why configure this at all , if there no option? Let's just state it clearly in docs, how it works.
transformation can be done with nan's - by propagating nan's.
- Why to add options for throwing exceptions here, or omit nan's, when everyone can do it itself - in special cases, they really want it? - manually before calling boxcox(...). Nobody prevents them.

It can be expected, that most usages of stats.boxcox() will be straightforward =

get results for boxcox() and continue in the workflow. No raising exceptions. Anyway - in cases, someone wants to raise, he is free to do it. So what is motivation to incorporate such logic into the core of the boxcox(...) function?

I see it very similarly for 'omit' option, because one can easily write one liner - x[~is.nan(x)] - and remove nan's before calling boxcox() if it is required by application logic. It is trivial to do, so I don't see added value in incorporating such functionality in the core of the boxcox() function.

Each mentioned use-case can be done without nan_policy easily. It is generally usable and powerful without nan_policy. Isee no need to make things more complex.

Everything where I look says to me there is no significant nor practical added value in nan_policy parameter for stats.boxcox(), that would compensate for higher complexity.

For case of boxcox() method, I see nan_policy as

a little overthinked solution, trying to solve trivial functionality of nan-checking scenarios
it does not bring significant or widely practical value
concept of nan_policy for boxcox() can in ideal world save developer this 1 line: x[~is.nan(x)] or raising exception - in special cases, where he wants it, but pollutes the core functionality with new complexity and maintenance forever.

I am open-minded, but in this light - it looks lit not worth it.
Boxcox() without nan_policy works nicely for all use-cases.

stefansimik · 2017-11-07T21:47:31Z

@josef-pkt @person142
Hi guys, any comments or suggestions? Which option do you prefer - with or without nan_policy?
Your opinion is important now, so we can finalize this issue and make scipy even better ;)

person142 · 2017-11-07T23:24:50Z

I will defer to whatever @josef-pkt thinks, since he is much wiser in the ways of scipy.stats than I am.

stefansimik · 2017-11-08T03:49:50Z

@person142 @josef-pkt
Thank you. I agree and will do the same.
Despite any opinions, I am still learning scipy so I will be happy to update
code based on @josef-pkt decision.
No more comments, just implementation of final solution. :)

@josef-pkt could you help us to resolve this issue, based on your opinion, please?

ilayn · 2018-04-02T06:14:52Z

@josef-pkt when you have time, could you please have a look at this?

josef-pkt · 2018-04-02T14:53:36Z

I still have problems with this, and I'm still not sure what the appropriate solution should be.

The underlying problem also affects other functions, e.g.
https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L2246
(zscore, zmap, obrientransform and possibly the trimming functions are the current elementwise functions, besides box-cox if lambda is estimated.)

Those function are elementwise like ufuncs where nan propagation is a useful and the usual default. However, they need intermediate results where a single nan destroys the results.

The "obvious" (most common usage) solution would be to behave like ufuncs and always propagate nan in the elementwise part, but use nan robust computation in the reduce statistics.
But I'm not sure what the options for this should be, and whether decisions should be hard coded instead of optional.
The most useful option IMO would be raise and fit-omit. AFAIK, all current usage for nan-policy are for reduce functions. So, there is no established pattern for ufunc_like functions that also need a reduce.

The trimming functions are a bit in between, because they explicitly exclude some elements by definition.

For box-cox a similar issue as nan handling would be what to do with non-positive values, options would be to drop or add a constant to make them positive. That is also left to the user, and the function currently hardcodes "raise".

aside: I just saw that sigmaclip is missing in the function list at
https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L87

Bugfix #8103: scipy.stats.boxcox - np.nan produces warnings, but it s…

c26e70d

…houldn't

stefansimik mentioned this pull request Oct 30, 2017

scipy.stats.boxcox - np.nan produces warnings, but it shouldn't #8103

Closed

rgommers added maintenance Items related to regular maintenance tasks scipy.stats labels Oct 31, 2017

person142 reviewed Oct 31, 2017

View reviewed changes

Bugfix #8103: scipy.stats.boxcox - omitting NaN's moved to better place

ccfc1d4

Bugfix #8103: scipy.stats.boxcox - added tests for nan_policy

a1c1f66

Bugfix #8103: scipy.stats.boxcox - tests fixed after failure in CI

9e64ca3

josef-pkt mentioned this pull request Apr 3, 2018

BUG/ENH: stats nan-policy with elementswise function #8669

Open

rgommers deleted the branch scipy:master January 3, 2022 16:32

rgommers closed this Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix #8103: scipy.stats.boxcox - np.nan produces warnings #8108

Bugfix #8103: scipy.stats.boxcox - np.nan produces warnings #8108

stefansimik commented Oct 30, 2017 •

edited by ilayn

Loading

person142 Oct 31, 2017

stefansimik Nov 1, 2017 •

edited

Loading

person142 commented Nov 1, 2017

josef-pkt commented Nov 2, 2017

josef-pkt commented Nov 2, 2017

stefansimik commented Nov 3, 2017

stefansimik commented Nov 3, 2017

josef-pkt commented Nov 3, 2017

stefansimik commented Nov 3, 2017

stefansimik commented Nov 3, 2017 •

edited

Loading

josef-pkt commented Nov 4, 2017

stefansimik commented Nov 4, 2017 •

edited

Loading

stefansimik commented Nov 7, 2017 •

edited

Loading

person142 commented Nov 7, 2017

stefansimik commented Nov 8, 2017 •

edited

Loading

ilayn commented Apr 2, 2018

josef-pkt commented Apr 2, 2018

Bugfix #8103: scipy.stats.boxcox - np.nan produces warnings #8108

Bugfix #8103: scipy.stats.boxcox - np.nan produces warnings #8108

Conversation

stefansimik commented Oct 30, 2017 • edited by ilayn Loading

person142 Oct 31, 2017

Choose a reason for hiding this comment

stefansimik Nov 1, 2017 • edited Loading

Choose a reason for hiding this comment

person142 commented Nov 1, 2017

josef-pkt commented Nov 2, 2017

josef-pkt commented Nov 2, 2017

stefansimik commented Nov 3, 2017

stefansimik commented Nov 3, 2017

josef-pkt commented Nov 3, 2017

stefansimik commented Nov 3, 2017

stefansimik commented Nov 3, 2017 • edited Loading

josef-pkt commented Nov 4, 2017

stefansimik commented Nov 4, 2017 • edited Loading

stefansimik commented Nov 7, 2017 • edited Loading

person142 commented Nov 7, 2017

stefansimik commented Nov 8, 2017 • edited Loading

ilayn commented Apr 2, 2018

josef-pkt commented Apr 2, 2018

stefansimik commented Oct 30, 2017 •

edited by ilayn

Loading

stefansimik Nov 1, 2017 •

edited

Loading

stefansimik commented Nov 3, 2017 •

edited

Loading

stefansimik commented Nov 4, 2017 •

edited

Loading

stefansimik commented Nov 7, 2017 •

edited

Loading

stefansimik commented Nov 8, 2017 •

edited

Loading