New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Yeo-Johnson Power Transformer gives Numpy warning #18389
Comments
@lsorber ping |
Those values look reasonable, but in fact, that data set runs into a fundamental limitation of 64 bit floating point when one attempts to find the optimal (i.e. maximum likelihood) Yeo-Johnson parameter. Using
In other words, the result of the optimal Yeo-Johnson transform for this data set is not representable with 64 bit floats. In practice, a user can avoid this problem by using an additional preprocessing step to scale the data. For example, the data can be normalized to have mean 0 and standard deviation 1 before applying
Of course, whether or not such a step is valid for a user depends on why they are transforming the data in the first place and what they are going to do with the transformed data. |
@WarrenWeckesser what do you recommend for this issue? Document this limitation? |
The proposed fix for |
Scikit-learn and scipy have very similar implementations of Yeo-Johnson that are both prone to two sources of overflow:
Both types of overflow can be mitigated. The fix that I proposed for scikit-learn (scikit-learn/scikit-learn#26188) resolves them as follows:
The scikit-learn maintainers proposed to fix the issue in scipy and then have scikit-learn depend on its improved implementation. |
Yes we did, in the hope to serve the greater community and only have one yeojohnson implementation, i.e. the one in scipy. I hope the fix/ pr of @lsorber is welcome here. |
Item 1 of the proposed change makes sense, and in fact, we can do even better by using scipy/scipy/stats/_continuous_distns.py Lines 1948 to 1964 in b0d1a4c
So +1 for item 1. For item 2, can you give a more careful explanation and justification of the proposed regularization? Is this your own invention, or is this discussed in any literature? Do other libraries do something similar? If we implement this regularization (or some other version of it), we should make it optional, and keep the default behavior--i.e. standard maximum likelihood estimation--as is. I added maximum likelihood estimation of the Yeo-Johnson parameter to
As noted earlier,
|
Item 1: Even better, great! Item 2: First, a minimal working example that illustrates the problem: import numpy as np
from scipy.stats import yeojohnson, yeojohnson_llf
x = np.array([2003.0, 1950.0, 1997.0, 2000.0, 2009.0])
xt, lmb = yeojohnson(x)
log_likelihood = yeojohnson_llf(lmb, x) # RuntimeWarning: overflow encountered in power: out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda
# xt = array([2.01405528e+154, 5.67790578e+153, 1.74805534e+154, 1.87644716e+154, 2.31954974e+154])
# lmb = 47.23900784138352
# log_likelihood = -13.9893342 Notice that the lambda that maximises the log likelihood is rather large, and that the transformed data's average exponent has grown from about 3 to 154. This exponent is already beyond what single precision can represent, and is getting close to what double precision can represent. By adding a regularisation term that sums the squared exponents of the transformed data # xt = array([349.48615808, 342.69449234, 348.71975305, 349.10303333, 350.25194217])
# lmb = 0.7292862554669863
# log_likelihood = -15.2971991 The transformed data are now in a more reasonable range, at the cost of a 10% decrease in log likelihood. The regularisation weight could be zero by default so as not to change the current behaviour. This regularisation term is not standard AFAIK. I proposed it for scikit-learn to address the issues described above. Another way to deal with the overflow is to change the optimiser from Related to this issue: I came across a relatively recent paper [1] that proves that the Yeo-Johnson negative log likelihood objective is convex, and proposes to replace the Brent optimiser with a more robust exponential search. In Figure 2, they also illustrate overflow issues with Brent optimisation of the Yeo-Johnson log likelihood that are similar to the ones described in this issue. |
@lsorber could you submit the PR with at least item 1 and the change to |
@lsorber @lorentzenchr were you interested in submitting the fix? |
Yes, I'd like to help implement the fix according to your proposal. I haven't had much time the past few weeks. Will try to fit it in in the coming weeks! |
@WarrenWeckesser @mdhaber @lorentzenchr I implemented fixes (1) and (2) in #18852. These should resolve the two sources of overflow. |
Describe your issue.
scipy.stats.yeojohnson
can overflow for reasonable input.This is the same issue as scikit-learn/scikit-learn#23319.
Reproducing Code Example
Error message
SciPy/NumPy/Python version and system information
The text was updated successfully, but these errors were encountered: