Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Yeo-Johnson Power Transformer gives Numpy warning #18389

Closed
lorentzenchr opened this issue Apr 30, 2023 · 12 comments · Fixed by #18852
Closed

BUG: Yeo-Johnson Power Transformer gives Numpy warning #18389

lorentzenchr opened this issue Apr 30, 2023 · 12 comments · Fixed by #18852
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Milestone

Comments

@lorentzenchr
Copy link
Contributor

Describe your issue.

scipy.stats.yeojohnson can overflow for reasonable input.

This is the same issue as scikit-learn/scikit-learn#23319.

Reproducing Code Example

import numpy as np
from scipy.stats import yeojohnson


x = np.array([2003.0, 1950.0, 1997.0, 2000.0, 2009.0, 2009.0, 1980.0, 1999.0, 2007.0, 1991.0])
yeojohnson(x)

Error message

python3.9/site-packages/numpy/core/_methods.py:239: RuntimeWarning: overflow encountered in multiply
  x = um.multiply(x, x, out=x)
python3.9/site-packages/numpy/core/_methods.py:250: RuntimeWarning: overflow encountered in reduce
  ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
Out[4]: 
(array([1.63003810e+154, 4.59874518e+153, 1.41487438e+154, 1.51873102e+154,
        1.87712777e+154, 1.87712777e+154, 9.45185553e+153, 1.48330911e+154,
        1.79094464e+154, 1.22758933e+154]),
 47.211105636582914)

SciPy/NumPy/Python version and system information

numpy 1.24.1
scipy 1.10.1
@lorentzenchr lorentzenchr added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Apr 30, 2023
@lorentzenchr
Copy link
Contributor Author

@lsorber ping

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Apr 30, 2023

Those values look reasonable, but in fact, that data set runs into a fundamental limitation of 64 bit floating point when one attempts to find the optimal (i.e. maximum likelihood) Yeo-Johnson parameter.

Using mpmath, one can compute the actual optimal parameter $\hat{\lambda}$ to be approximately 99.26. (I'm adding this to mpsci, but I haven't pushed it to github yet.) The problem is that for each value in that data set, $(x_i + 1)^\hat{\lambda}$ overflows when computed with 64 bit floats. E.g.

In [15]: x
Out[15]: 
array([2003., 1950., 1997., 2000., 2009., 2009., 1980., 1999., 2007.,
       1991.])

In [16]: (x + 1)**99.26
<ipython-input-16-42b558e52fe2>:1: RuntimeWarning: overflow encountered in power
  (x + 1)**99.26
Out[16]: array([inf, inf, inf, inf, inf, inf, inf, inf, inf, inf])

In other words, the result of the optimal Yeo-Johnson transform for this data set is not representable with 64 bit floats.


In practice, a user can avoid this problem by using an additional preprocessing step to scale the data. For example, the data can be normalized to have mean 0 and standard deviation 1 before applying yeojohnson. In scikit-learn, one would use sklearn.preprocessing.StandardScaler. In SciPy, scipy.stats.zscore can be used:

In [23]: from scipy.stats import zscore, yeojohnson

In [24]: z = zscore(x)

In [25]: yeojohnson(z)
Out[25]: 
(array([ 0.68110757, -1.01124976,  0.16177483,  0.39752566,  1.40110131,
         1.40110131, -0.54744521,  0.31375023,  1.13770228, -0.18007565]),
 2.387535792418072)

Of course, whether or not such a step is valid for a user depends on why they are transforming the data in the first place and what they are going to do with the transformed data.

@mdhaber
Copy link
Contributor

mdhaber commented Apr 30, 2023

@WarrenWeckesser what do you recommend for this issue? Document this limitation?

@rkern
Copy link
Member

rkern commented Apr 30, 2023

The proposed fix for sklearn's implementation is here: scikit-learn/scikit-learn#26188

@lsorber
Copy link
Contributor

lsorber commented Apr 30, 2023

Scikit-learn and scipy have very similar implementations of Yeo-Johnson that are both prone to two sources of overflow:

  1. The first source of overflow is caused by certain invocations of np.power with unbalanced bases and exponents.
  2. The second source of overflow is more difficult and occurs when λ grows for a marginal gain in log likelihood. In this case, applying the Yeo-Johnson transform on the input will cause overflow.

Both types of overflow can be mitigated. The fix that I proposed for scikit-learn (scikit-learn/scikit-learn#26188) resolves them as follows:

  1. For the first source of overflow: instances of np.power are replaced by instances of np.exp so that bases and exponents are combined before being exponentiated.
  2. For the second source of overflow: I proposed to regularize the exponents of the transformed input so that they do not blow up for a marginal gain in log likelihood. The regularization term has a small weight relative to the log likelihood objective, and so generally does not affect the result.

The scikit-learn maintainers proposed to fix the issue in scipy and then have scikit-learn depend on its improved implementation.

@lorentzenchr
Copy link
Contributor Author

The scikit-learn maintainers proposed to fix the issue in scipy and then have scikit-learn depend on its improved implementation.

Yes we did, in the hope to serve the greater community and only have one yeojohnson implementation, i.e. the one in scipy.

I hope the fix/ pr of @lsorber is welcome here.

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented May 5, 2023

Item 1 of the proposed change makes sense, and in fact, we can do even better by using expm1. That is, we can reformulate (x + 1)**lambda - 1 as expm1(lambda*log1p(x)). This is what I've used in my ufunc implementation of yeo_johnson in ufunclab, and in mpsci I created a function for this specific operation called pow1pm1. And in #18269, I added a private version of this function:

def _pow1pm1(x, y):
"""
Compute (1 + x)**y - 1.
Uses expm1 and xlog1py to avoid loss of precision when
(1 + x)**y is close to 1.
Note that the inverse of this function with respect to x is
``_pow1pm1(x, 1/y)``. That is, if
t = _pow1pm1(x, y)
then
x = _pow1pm1(t, 1/y)
"""
return np.expm1(sc.xlog1py(y, x))

So +1 for item 1.

For item 2, can you give a more careful explanation and justification of the proposed regularization? Is this your own invention, or is this discussed in any literature? Do other libraries do something similar?

If we implement this regularization (or some other version of it), we should make it optional, and keep the default behavior--i.e. standard maximum likelihood estimation--as is.


I added maximum likelihood estimation of the Yeo-Johnson parameter to mpsci, so we can see what happens when we go beyond the limits of 64 bit floating point. Here's the result with the problematic example given in the issue description:

In [32]: from mpmath import mp

In [33]: from mpsci.stats import yeojohnson, yeojohnson_mle

In [34]: mp.dps = 60

In [35]: x = [2003.0, 1950.0, 1997.0, 2000.0, 2009.0, 2009.0, 1980.0, 1999.0, 2007.0, 1991.0]

In [36]: lam = yeojohnson_mle(x)

In [37]: lam
Out[37]: mpf('99.2603430377095692214027096696779458486113245271318824001869314')

As noted earlier, lam is approximately 99.26. Applying the transformation to x with this parameter results in values that exceed the limits of 64 bit floating point:

In [38]: y = [yeojohnson(t, lam) for t in x]

In [39]: y
Out[39]: 
[mpf('5.63279975560728670409184723524702574286732961584874460239970401e+325'),
 mpf('3.93812476150628325049969640780086269978542991319206107433295936e+324'),
 mpf('4.18277195895845678666896113914815260903692857590066570907339741e+325'),
 mpf('4.8544798025568013606749153785769021529514883960848775935065347e+325'),
 mpf('7.57875745783412654673196268708113987074929199857455776564246893e+325'),
 mpf('7.57875745783412654673196268708113987074929199857455776564246893e+325'),
 mpf('1.79105212742691223323284175703197212837363318845136841674533767e+325'),
 mpf('4.61948940758876337368825468096648068687477286403608397083764839e+325'),
 mpf('6.86566966230914862048107869976874181132980715203295942852252195e+325'),
 mpf('3.10323961316675736639444415229487957559815712139825415660650761e+325')]

@lsorber
Copy link
Contributor

lsorber commented May 6, 2023

@WarrenWeckesser

Item 1: Even better, great!

Item 2: First, a minimal working example that illustrates the problem:

import numpy as np
from scipy.stats import yeojohnson, yeojohnson_llf
x = np.array([2003.0, 1950.0, 1997.0, 2000.0, 2009.0])
xt, lmb = yeojohnson(x)
log_likelihood = yeojohnson_llf(lmb, x)  # RuntimeWarning: overflow encountered in power: out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda

# xt = array([2.01405528e+154, 5.67790578e+153, 1.74805534e+154, 1.87644716e+154, 2.31954974e+154])
# lmb = 47.23900784138352
# log_likelihood = -13.9893342

Notice that the lambda that maximises the log likelihood is rather large, and that the transformed data's average exponent has grown from about 3 to 154. This exponent is already beyond what single precision can represent, and is getting close to what double precision can represent.

By adding a regularisation term that sums the squared exponents of the transformed data xt with a weight of 1e-4, we obtain the following result:

# xt = array([349.48615808, 342.69449234, 348.71975305, 349.10303333, 350.25194217])
# lmb = 0.7292862554669863
# log_likelihood = -15.2971991

The transformed data are now in a more reasonable range, at the cost of a 10% decrease in log likelihood. The regularisation weight could be zero by default so as not to change the current behaviour. This regularisation term is not standard AFAIK. I proposed it for scikit-learn to address the issues described above.

Another way to deal with the overflow is to change the optimiser from optimize.brent to optimize.fminbound (which also uses Brent's method) and pre-compute bounds on lambda that would prevent overflow from occurring in the optimisation. This is something I would consider regardless of the decision to add the regularisation term above.


Related to this issue: I came across a relatively recent paper [1] that proves that the Yeo-Johnson negative log likelihood objective is convex, and proposes to replace the Brent optimiser with a more robust exponential search. In Figure 2, they also illustrate overflow issues with Brent optimisation of the Yeo-Johnson log likelihood that are similar to the ones described in this issue.

[1] https://arxiv.org/abs/2210.01639

@mdhaber
Copy link
Contributor

mdhaber commented Jun 1, 2023

@lsorber could you submit the PR with at least item 1 and the change to fminbnd? Perhaps include the regularization idea in a separate commit so it can be considered separately. It sounds like a good practical solution, but we'll need to run it by the mailing list.

@mdhaber
Copy link
Contributor

mdhaber commented Jun 13, 2023

@lsorber @lorentzenchr were you interested in submitting the fix?

@lsorber
Copy link
Contributor

lsorber commented Jun 13, 2023

Yes, I'd like to help implement the fix according to your proposal. I haven't had much time the past few weeks. Will try to fit it in in the coming weeks!

@lsorber
Copy link
Contributor

lsorber commented Jul 10, 2023

@WarrenWeckesser @mdhaber @lorentzenchr I implemented fixes (1) and (2) in #18852. These should resolve the two sources of overflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
@rkern @WarrenWeckesser @lsorber @mdhaber @lorentzenchr @j-bowhay and others