Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix #8103: scipy.stats.boxcox - np.nan produces warnings #8108

Closed
wants to merge 4 commits into from
Closed

Bugfix #8103: scipy.stats.boxcox - np.nan produces warnings #8108

wants to merge 4 commits into from

Conversation

stefansimik
Copy link

@stefansimik stefansimik commented Oct 30, 2017

Closes #8103

raise ValueError("Data must be positive.")

if lmbda is not None: # single transformation
return special.boxcox(x, lmbda)
return special.boxcox(x, lmbda) # propagates nan's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be

if nan_policy == 'propagate':
    return special.boxcox(x, lmbda)
else:
    return special.boxcox(x[not_nan_mask], lmbda)

It won't broadcast lmbda correctly, but that doesn't seem like an intended use case. @josef-pkt or @ev-br would know better than me though, so we should see what they say.

Copy link
Author

@stefansimik stefansimik Nov 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great finding, thank you.
To achieve the same functionality as you suggested, I just moved 2 lines of code (for omitting nan values) more to the top, so it executes earlier than before.
Check please, if it meets your expectations - the effect should be the same,
but no additional IF is needed (less complexity).

@person142
Copy link
Member

Looks good to me, but I forgot to ask you to add a regression test (my bad). Would probably be fine to take your original example from #8103 and check that you get the right results with the various possible nan policies.

@josef-pkt
Copy link
Member

about broadcasting: given that the docstring says x is 1-dim, it won't be available.
e.g. taking the mask indexed array would ravel.

I think an extra "omit-fit" nan-policy would be useful in this case, which is now not available, AFAICS.
That is, omit the nans in the estimation of lmbda, but propagate the nans to the transformed array.

(E.g. that would make it easier to combine the transformed array with some original data like a pandas DataFrame).

@josef-pkt
Copy link
Member

The docstring is missing a See Also to the scipy.special function, which users can use if they don't need to estimate lambda and have broadcasting.

@stefansimik
Copy link
Author

@person142 Tests were added, but I don't know how to run them locally on my Windows machine.
I expect, it requires some special setup to compile whole scipy first, install it, and so on...
If you know some tutorial how to do it locally, let me know.

Otherwise, I will wait until CI runs and checks all tests.

@stefansimik
Copy link
Author

@josef-pkt
My understanding is, that

  • lmbda be calculated only using real values i.e. nan values are not usable in calculating lmbda itself.
  • in this context - nan_policy parameter logically applies only to the transformation of the values - and not calculation of lmbda -- but here, I agree, this should be more clearly stated in the docs and I would be happy to do it, if you agree

@josef-pkt
Copy link
Member

@stefansimik but AFAICS, you don't have an option for computing lmbda without nans, but returning the array with nans. Or do I misread the changes?

@stefansimik
Copy link
Author

@josef-pkt

Yes, that's true - you are right. That can happen = computing lmbda without nan's, but returning transformed values with nan's.

In my understanding, the calculation of lambda itself is not supposed to be parametrized by nan_policy. But maybe it should be, I am open, if you can see some practical use-case.

I can see some value in consistency, if we added additional param lmbda_nan_policy, that would have default value 'omit', and other possible value 'raise'. Option 'propagate' is useless here.

But I have other suggestion to fix and answer all these questions. I will write it here in new comment from my PC...

@stefansimik
Copy link
Author

stefansimik commented Nov 3, 2017

@josef-pkt @person142

Guys, I would like to suggest to go in different direction - let's be open-minded for a while and imagine, that boxcox(...) method will have no new parameter(s) like nan_policy, but propagates nan's always -- the same way like np.log(...) does.

How it will work and handle all possible scenarios:

  • nan propagation scenario: this is default and most common behavior.
  • nan omit scenario: --> one can remove nan's if this is requirement from application point of view
  • nan raise scenario: --> one can check nan's and raise exception is requirement from application point of view

What are main benefits:

  • simple, straightforward, working and intuitive result for majority of use-cases
    • the same way, like we can log-transform, we can boxcox-transform
  • less parameters, less complexity, less thinking
  • less maintenance

On top of that, I believe, that function like np.log(), or boxcox():

  • should do only core logic and return results. Nothing more, nothing less.
    • that means, such methods should not add too much on top of basic functionality = it should not do complex checking of inputs, nor support multiple input check scenarios to meet various expectations of different domains. This is responsibility of applications, which uses boxcox() method to make sure it prepares proper inputs - and this is based on application domain knowledge.

I believe, that preparing proper inputs or various input checking options should be part of the application, not part of the core math/stats function itself.

If the function has logical nan handling, let's use it. If someone wants to block nan's, or application domain requires it - then nan's can be easily removed (it is one liner code).

Do you know anybody, who would require to put some nan_policy into np.log(...) method?

Probably not many if at all. And in special cases, where it is needed - one can remove nan's before calling np.log(...).

So let's get rid of the nan_policy. As we dont need nan_policy in np.log(...), the same way we don't need nan_policy in boxcox().

If one sends array with nan's into boxcox(...):

  • then lmbda will be calculated from values other than nan's
    • this is the only logical thing we can do
  • transformation will be done on values also other than nan's
    • intuitive and logical handling
  • where nan's is, there it remains
    • common expected handling

Let's make it work the same way as np.log(...).
Let's make it practical, simple and intuitive - as Python is expected to be.

How would you feel if:

  • you have complex boxcox function, where you have to study and evaluate additional
    parameters like nan_policy for transformation and another nan_policy for calculation lmbda?

vs.

  • you have simple boxcox() that just works. Transparent working by propagating nan's.
    Simple intuitive behavior works better, than rich and complex.

This my suggestion: let's get rid of nan_policy and always propagate nan's.
But you guys have the last word of course ;)

@josef-pkt
Copy link
Member

@stefansimik There is a difference between scipy.stats and scipy.special.
The scipy special functions are elementwise and behave like you describe,
The scipy.stats functions need to do some further processing over the entire array in almost all cases. So the analogy to np.log is not appropriate and e.g. np.mean versus np.nanmean would be a better analogy.

In this case, stats.boxcox as a simple transformation is now essentially obsolete given the scipy.special function. The advantage is that we estimate and transform at the same time, as a convenience function that combines two separate steps.
Similar in hypothesis tests or most other stats functions we have to let the user decide how to handle nan values, because simply propagating them throughout is not possible.

(Note: my initial suggestion was to use 'ignore' instead of 'propagate' as an option to not even check for nans if checking for nans is relatively expensive for a function that itself does just some simple computation. I'm slowly starting to change my mind about whether that's a useful default if the function has to do more than some very simple computation.)

@stefansimik
Copy link
Author

stefansimik commented Nov 4, 2017

@josef-pkt
Thank you. I learnt a lot from your detailed explanation.

Similar in hypothesis tests or most other stats functions we have to let the user decide how to handle nan values, because simply propagating them throughout is not possible.

I can understand, that in scipy.stats - there are hypothesis-tests and many other stats-functions, where nan_policy is relevant and nature of the problem makes it useful.

It is important to answer these questions, to resolve this issue:

  • But are we sure, this is case also for scipy.stats.boxcox(..) ?
  • Why we need new nan_policy_param, when every scenario can be done nicely without it?
  • Is there any imaginable scenario, where nan_policy would be must-have at all?
  • What is the significant practical value, that compensates for new added complexity of nan_policy, that would make life of developer easier?

I mean, boxcox(..) without nan_policy can work great, because

  • it just works for all use-cases (propagate / omit / raise scenarios)
  • it is simple, practically working and transparent choice.
  • it adheres to the principle of least astonishment - it produces intuitive result, that is expected.

I know, stats.boxcox() does 2 thing at once, which makes thing more complex, but:

  • calculation of lambda has to be done by omitting nan's - always.
    • There is no other option = there is nothing to configure. Why configure this at all , if there no option? Let's just state it clearly in docs, how it works.
  • transformation can be done with nan's - by propagating nan's.
    • Why to add options for throwing exceptions here, or omit nan's, when everyone can do it itself - in special cases, they really want it? - manually before calling boxcox(...). Nobody prevents them.

It can be expected, that most usages of stats.boxcox() will be straightforward =

  • get results for boxcox() and continue in the workflow. No raising exceptions. Anyway - in cases, someone wants to raise, he is free to do it. So what is motivation to incorporate such logic into the core of the boxcox(...) function?

I see it very similarly for 'omit' option, because one can easily write one liner - x[~is.nan(x)] - and remove nan's before calling boxcox() if it is required by application logic. It is trivial to do, so I don't see added value in incorporating such functionality in the core of the boxcox() function.

Each mentioned use-case can be done without nan_policy easily. It is generally usable and powerful without nan_policy. Isee no need to make things more complex.

Everything where I look says to me there is no significant nor practical added value in nan_policy parameter for stats.boxcox(), that would compensate for higher complexity.

For case of boxcox() method, I see nan_policy as

  • a little overthinked solution, trying to solve trivial functionality of nan-checking scenarios
  • it does not bring significant or widely practical value
  • concept of nan_policy for boxcox() can in ideal world save developer this 1 line: x[~is.nan(x)] or raising exception - in special cases, where he wants it, but pollutes the core functionality with new complexity and maintenance forever.

I am open-minded, but in this light - it looks lit not worth it.
Boxcox() without nan_policy works nicely for all use-cases.

@stefansimik
Copy link
Author

stefansimik commented Nov 7, 2017

@josef-pkt @person142
Hi guys, any comments or suggestions? Which option do you prefer - with or without nan_policy?
Your opinion is important now, so we can finalize this issue and make scipy even better ;)

@person142
Copy link
Member

I will defer to whatever @josef-pkt thinks, since he is much wiser in the ways of scipy.stats than I am.

@stefansimik
Copy link
Author

stefansimik commented Nov 8, 2017

@person142 @josef-pkt
Thank you. I agree and will do the same.
Despite any opinions, I am still learning scipy so I will be happy to update
code based on @josef-pkt decision.
No more comments, just implementation of final solution. :)

@josef-pkt could you help us to resolve this issue, based on your opinion, please?

@ilayn
Copy link
Member

ilayn commented Apr 2, 2018

@josef-pkt when you have time, could you please have a look at this?

@josef-pkt
Copy link
Member

I still have problems with this, and I'm still not sure what the appropriate solution should be.

The underlying problem also affects other functions, e.g.
https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L2246
(zscore, zmap, obrientransform and possibly the trimming functions are the current elementwise functions, besides box-cox if lambda is estimated.)

Those function are elementwise like ufuncs where nan propagation is a useful and the usual default. However, they need intermediate results where a single nan destroys the results.

The "obvious" (most common usage) solution would be to behave like ufuncs and always propagate nan in the elementwise part, but use nan robust computation in the reduce statistics.
But I'm not sure what the options for this should be, and whether decisions should be hard coded instead of optional.
The most useful option IMO would be raise and fit-omit. AFAIK, all current usage for nan-policy are for reduce functions. So, there is no established pattern for ufunc_like functions that also need a reduce.

The trimming functions are a bit in between, because they explicitly exclude some elements by definition.

For box-cox a similar issue as nan handling would be what to do with non-positive values, options would be to drop or add a constant to make them positive. That is also left to the user, and the function currently hardcodes "raise".

aside: I just saw that sigmaclip is missing in the function list at
https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants