BUG: Yeo-Johnson Power Transformer gives Numpy warning #18389

lorentzenchr · 2023-04-30T10:59:52Z

Describe your issue.

scipy.stats.yeojohnson can overflow for reasonable input.

This is the same issue as scikit-learn/scikit-learn#23319.

Reproducing Code Example

import numpy as np
from scipy.stats import yeojohnson


x = np.array([2003.0, 1950.0, 1997.0, 2000.0, 2009.0, 2009.0, 1980.0, 1999.0, 2007.0, 1991.0])
yeojohnson(x)

Error message

python3.9/site-packages/numpy/core/_methods.py:239: RuntimeWarning: overflow encountered in multiply
  x = um.multiply(x, x, out=x)
python3.9/site-packages/numpy/core/_methods.py:250: RuntimeWarning: overflow encountered in reduce
  ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
Out[4]: 
(array([1.63003810e+154, 4.59874518e+153, 1.41487438e+154, 1.51873102e+154,
        1.87712777e+154, 1.87712777e+154, 9.45185553e+153, 1.48330911e+154,
        1.79094464e+154, 1.22758933e+154]),
 47.211105636582914)

SciPy/NumPy/Python version and system information

numpy 1.24.1
scipy 1.10.1

The text was updated successfully, but these errors were encountered:

lorentzenchr · 2023-04-30T11:01:13Z

@lsorber ping

WarrenWeckesser · 2023-04-30T15:03:04Z

Those values look reasonable, but in fact, that data set runs into a fundamental limitation of 64 bit floating point when one attempts to find the optimal (i.e. maximum likelihood) Yeo-Johnson parameter.

Using mpmath, one can compute the actual optimal parameter $\hat{\lambda}$ to be approximately 99.26. (I'm adding this to mpsci, but I haven't pushed it to github yet.) The problem is that for each value in that data set, $(x_i + 1)^\hat{\lambda}$ overflows when computed with 64 bit floats. E.g.

In [15]: x
Out[15]: 
array([2003., 1950., 1997., 2000., 2009., 2009., 1980., 1999., 2007.,
       1991.])

In [16]: (x + 1)**99.26
<ipython-input-16-42b558e52fe2>:1: RuntimeWarning: overflow encountered in power
  (x + 1)**99.26
Out[16]: array([inf, inf, inf, inf, inf, inf, inf, inf, inf, inf])

In other words, the result of the optimal Yeo-Johnson transform for this data set is not representable with 64 bit floats.

In practice, a user can avoid this problem by using an additional preprocessing step to scale the data. For example, the data can be normalized to have mean 0 and standard deviation 1 before applying yeojohnson. In scikit-learn, one would use sklearn.preprocessing.StandardScaler. In SciPy, scipy.stats.zscore can be used:

In [23]: from scipy.stats import zscore, yeojohnson

In [24]: z = zscore(x)

In [25]: yeojohnson(z)
Out[25]: 
(array([ 0.68110757, -1.01124976,  0.16177483,  0.39752566,  1.40110131,
         1.40110131, -0.54744521,  0.31375023,  1.13770228, -0.18007565]),
 2.387535792418072)

Of course, whether or not such a step is valid for a user depends on why they are transforming the data in the first place and what they are going to do with the transformed data.

mdhaber · 2023-04-30T15:17:13Z

@WarrenWeckesser what do you recommend for this issue? Document this limitation?

rkern · 2023-04-30T15:24:05Z

The proposed fix for sklearn's implementation is here: scikit-learn/scikit-learn#26188

lsorber · 2023-04-30T16:44:35Z

Scikit-learn and scipy have very similar implementations of Yeo-Johnson that are both prone to two sources of overflow:

The first source of overflow is caused by certain invocations of np.power with unbalanced bases and exponents.
The second source of overflow is more difficult and occurs when λ grows for a marginal gain in log likelihood. In this case, applying the Yeo-Johnson transform on the input will cause overflow.

Both types of overflow can be mitigated. The fix that I proposed for scikit-learn (scikit-learn/scikit-learn#26188) resolves them as follows:

For the first source of overflow: instances of np.power are replaced by instances of np.exp so that bases and exponents are combined before being exponentiated.
For the second source of overflow: I proposed to regularize the exponents of the transformed input so that they do not blow up for a marginal gain in log likelihood. The regularization term has a small weight relative to the log likelihood objective, and so generally does not affect the result.

The scikit-learn maintainers proposed to fix the issue in scipy and then have scikit-learn depend on its improved implementation.

lorentzenchr · 2023-04-30T19:27:37Z

The scikit-learn maintainers proposed to fix the issue in scipy and then have scikit-learn depend on its improved implementation.

Yes we did, in the hope to serve the greater community and only have one yeojohnson implementation, i.e. the one in scipy.

I hope the fix/ pr of @lsorber is welcome here.

WarrenWeckesser · 2023-05-05T16:10:42Z

Item 1 of the proposed change makes sense, and in fact, we can do even better by using expm1. That is, we can reformulate (x + 1)**lambda - 1 as expm1(lambda*log1p(x)). This is what I've used in my ufunc implementation of yeo_johnson in ufunclab, and in mpsci I created a function for this specific operation called pow1pm1. And in #18269, I added a private version of this function:

scipy/scipy/stats/_continuous_distns.py

Lines 1948 to 1964 in b0d1a4c

    
           def _pow1pm1(x, y): 
        
               """ 
        
               Compute (1 + x)**y - 1. 
        
               Uses expm1 and xlog1py to avoid loss of precision when 
        
               (1 + x)**y is close to 1. 
        
               Note that the inverse of this function with respect to x is 
        
               ``_pow1pm1(x, 1/y)``.  That is, if 
        
                   t = _pow1pm1(x, y) 
        
               then 
        
                   x = _pow1pm1(t, 1/y) 
        
               """ 
        
               return np.expm1(sc.xlog1py(y, x))

So +1 for item 1.

For item 2, can you give a more careful explanation and justification of the proposed regularization? Is this your own invention, or is this discussed in any literature? Do other libraries do something similar?

If we implement this regularization (or some other version of it), we should make it optional, and keep the default behavior--i.e. standard maximum likelihood estimation--as is.

I added maximum likelihood estimation of the Yeo-Johnson parameter to mpsci, so we can see what happens when we go beyond the limits of 64 bit floating point. Here's the result with the problematic example given in the issue description:

In [32]: from mpmath import mp

In [33]: from mpsci.stats import yeojohnson, yeojohnson_mle

In [34]: mp.dps = 60

In [35]: x = [2003.0, 1950.0, 1997.0, 2000.0, 2009.0, 2009.0, 1980.0, 1999.0, 2007.0, 1991.0]

In [36]: lam = yeojohnson_mle(x)

In [37]: lam
Out[37]: mpf('99.2603430377095692214027096696779458486113245271318824001869314')

As noted earlier, lam is approximately 99.26. Applying the transformation to x with this parameter results in values that exceed the limits of 64 bit floating point:

In [38]: y = [yeojohnson(t, lam) for t in x]

In [39]: y
Out[39]: 
[mpf('5.63279975560728670409184723524702574286732961584874460239970401e+325'),
 mpf('3.93812476150628325049969640780086269978542991319206107433295936e+324'),
 mpf('4.18277195895845678666896113914815260903692857590066570907339741e+325'),
 mpf('4.8544798025568013606749153785769021529514883960848775935065347e+325'),
 mpf('7.57875745783412654673196268708113987074929199857455776564246893e+325'),
 mpf('7.57875745783412654673196268708113987074929199857455776564246893e+325'),
 mpf('1.79105212742691223323284175703197212837363318845136841674533767e+325'),
 mpf('4.61948940758876337368825468096648068687477286403608397083764839e+325'),
 mpf('6.86566966230914862048107869976874181132980715203295942852252195e+325'),
 mpf('3.10323961316675736639444415229487957559815712139825415660650761e+325')]

lsorber · 2023-05-06T08:31:47Z

@WarrenWeckesser

Item 1: Even better, great!

Item 2: First, a minimal working example that illustrates the problem:

import numpy as np
from scipy.stats import yeojohnson, yeojohnson_llf
x = np.array([2003.0, 1950.0, 1997.0, 2000.0, 2009.0])
xt, lmb = yeojohnson(x)
log_likelihood = yeojohnson_llf(lmb, x)  # RuntimeWarning: overflow encountered in power: out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda

# xt = array([2.01405528e+154, 5.67790578e+153, 1.74805534e+154, 1.87644716e+154, 2.31954974e+154])
# lmb = 47.23900784138352
# log_likelihood = -13.9893342

Notice that the lambda that maximises the log likelihood is rather large, and that the transformed data's average exponent has grown from about 3 to 154. This exponent is already beyond what single precision can represent, and is getting close to what double precision can represent.

By adding a regularisation term that sums the squared exponents of the transformed data xt with a weight of 1e-4, we obtain the following result:

# xt = array([349.48615808, 342.69449234, 348.71975305, 349.10303333, 350.25194217])
# lmb = 0.7292862554669863
# log_likelihood = -15.2971991

The transformed data are now in a more reasonable range, at the cost of a 10% decrease in log likelihood. The regularisation weight could be zero by default so as not to change the current behaviour. This regularisation term is not standard AFAIK. I proposed it for scikit-learn to address the issues described above.

Another way to deal with the overflow is to change the optimiser from optimize.brent to optimize.fminbound (which also uses Brent's method) and pre-compute bounds on lambda that would prevent overflow from occurring in the optimisation. This is something I would consider regardless of the decision to add the regularisation term above.

Related to this issue: I came across a relatively recent paper [1] that proves that the Yeo-Johnson negative log likelihood objective is convex, and proposes to replace the Brent optimiser with a more robust exponential search. In Figure 2, they also illustrate overflow issues with Brent optimisation of the Yeo-Johnson log likelihood that are similar to the ones described in this issue.

[1] https://arxiv.org/abs/2210.01639

mdhaber · 2023-06-01T06:51:27Z

@lsorber could you submit the PR with at least item 1 and the change to fminbnd? Perhaps include the regularization idea in a separate commit so it can be considered separately. It sounds like a good practical solution, but we'll need to run it by the mailing list.

mdhaber · 2023-06-13T15:04:55Z

@lsorber @lorentzenchr were you interested in submitting the fix?

lsorber · 2023-06-13T16:04:51Z

Yes, I'd like to help implement the fix according to your proposal. I haven't had much time the past few weeks. Will try to fit it in in the coming weeks!

lsorber · 2023-07-10T09:23:59Z

@WarrenWeckesser @mdhaber @lorentzenchr I implemented fixes (1) and (2) in #18852. These should resolve the two sources of overflow.

lorentzenchr added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Apr 30, 2023

j-bowhay added the scipy.stats label Apr 30, 2023

lsorber mentioned this issue Apr 30, 2023

Use scipy.stats.yeojohnson PowerTransformer scikit-learn/scikit-learn#26308

Open

lsorber mentioned this issue Jul 10, 2023

BUG: fix overflow in stats.yeojohnson #18852

Merged

lsorber mentioned this issue Jul 25, 2023

Yeo-Johnson Power Transformer gives Numpy warning scikit-learn/scikit-learn#23319

Open

mdhaber closed this as completed in #18852 Aug 4, 2023

mdhaber added this to the 1.12.0 milestone Sep 11, 2023

xuefeng-xu mentioned this issue Nov 28, 2023

BUG: fix overflow in stats.boxcox_normmax when method='mle' #19604

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Yeo-Johnson Power Transformer gives Numpy warning #18389

BUG: Yeo-Johnson Power Transformer gives Numpy warning #18389

lorentzenchr commented Apr 30, 2023

lorentzenchr commented Apr 30, 2023

WarrenWeckesser commented Apr 30, 2023 •

edited

mdhaber commented Apr 30, 2023

rkern commented Apr 30, 2023

lsorber commented Apr 30, 2023

lorentzenchr commented Apr 30, 2023

WarrenWeckesser commented May 5, 2023 •

edited

lsorber commented May 6, 2023

mdhaber commented Jun 1, 2023 •

edited

mdhaber commented Jun 13, 2023

lsorber commented Jun 13, 2023

lsorber commented Jul 10, 2023

BUG: Yeo-Johnson Power Transformer gives Numpy warning #18389

BUG: Yeo-Johnson Power Transformer gives Numpy warning #18389

Comments

lorentzenchr commented Apr 30, 2023

Describe your issue.

Reproducing Code Example

Error message

SciPy/NumPy/Python version and system information

lorentzenchr commented Apr 30, 2023

WarrenWeckesser commented Apr 30, 2023 • edited

mdhaber commented Apr 30, 2023

rkern commented Apr 30, 2023

lsorber commented Apr 30, 2023

lorentzenchr commented Apr 30, 2023

WarrenWeckesser commented May 5, 2023 • edited

lsorber commented May 6, 2023

mdhaber commented Jun 1, 2023 • edited

mdhaber commented Jun 13, 2023

lsorber commented Jun 13, 2023

lsorber commented Jul 10, 2023

WarrenWeckesser commented Apr 30, 2023 •

edited

WarrenWeckesser commented May 5, 2023 •

edited

mdhaber commented Jun 1, 2023 •

edited