'stats.boxcox' return all same values #6873

KarlQu1990 · 2016-12-20T11:14:28Z

I applyed stats.boxbox to my data and the returned values are all the same, which seems really unreasonable! it returned this same result in scipy=0.18.1 and scipy=0.17.1. the optimal lambda in my case is -5.501196436791543.
your can download my data(data.txt).
I also tried the boxcox function in R and it returned reasonable result. So i think the boxcox_normmax function should do more test and should be used carefully.

The text was updated successfully, but these errors were encountered:

ev-br · 2017-01-08T17:45:09Z

@Qukaiyi could you please add a minimal reproducible example, something we can directly run? Also it would be helpful if you explain what precisely you think is wrong with scipy.stats.boxcox. If there's a discrepancy with R, please also add specific R commands you're using.
As written, this report is not actionable.

KarlQu1990 · 2017-01-09T08:40:43Z

@ev-br I‘m sorry. here is my example in python and R:
First, load my example data in here(example_data.txt):

In [1]: data = np.genfromtxt('example_data.txt',delimiter=',');
print(data)
print('count:',len(data))
Out[1]: [   15957.   112079.  1039553. ...,  3166209.   175913.   159211.]
count: 414

Now, I use scipy.stats.boxcox and scipy.stats.boxcox_normmax:

In[2]: scipy.stats.boxcox(data)
Out[2](array([ 0.18177864,  0.18177864,  0.18177864, ...,  0.18177864,
         0.18177864,  0.18177864]), -5.501196436791543)

In[3]:scipy.stats.boxcox_normmax(data,method='all')
Out[3]: array([-8.12354424, -5.50119644])

We see that boxcox_normmax choose a lambda far from (-2,2) interval. and the transformed values are all the same.

And the following is the result in R:

library(forecast)
data = scan('example_data.txt',sep=',')

lmba1 = BoxCox.lambda(data,method="guerrero") ;lmba1
[1] -0.178229
lmba2 = BoxCox.lambda(data,method="loglik") ;lmba2
[2] -0.05

BoxCox(data,lambda=lmba1 )
  [1] 4.610912 4.904361 5.135813 5.102642 4.957027 5.020616 4.963566 4.804480
  [9] 5.108647 4.977802 4.947819 5.064398 5.146567 4.871337 5.185068 5.072204
 [17] 5.026774 5.183878 4.762700 5.057233 4.925822 5.004983 4.944475 4.748675
 [25] 5.123928 4.809082 4.930418 5.030869 5.061115 4.828695 5.159066 5.156983
......

BoxCox(data,lambda=lmba2 )
  [1]  7.672289  8.817117  9.995678  9.804398  9.057570  9.367238  9.088384
  [8]  8.394422  9.838348  9.156260  9.014549  9.594710 10.059748  8.672853
 [15] 10.298290  9.636629  9.398482 10.290689  8.228768  9.556607  8.913490
 [22]  9.288967  8.999032  8.174477  9.926068  8.413047  8.934409  9.419391
......

The following is a simple description of R's BoxCox:

Description
If method=="guerrero", Guerrero's (1993) method is used,
where lambda minimizes the coefficient of variation for subseries of x.

If method=="loglik", the value of lambda is chosen to maximize the profile 
log likelihood of a linear model fitted to x. For non-seasonal data, a linear
time trend is fitted while for seasonal data, a linear time trend with seasonal
dummy variables is used.

Usage
BoxCox.lambda(x, method=c("guerrero","loglik"), lower=-1, upper=2)

Last, I run python again with the lambdas from the R and get the following result:

In[4]: scipy.stats.boxcox(data,-0.178229)
Out[4]: 
[ 4.61091154  4.90436036  5.13581228 ...,  5.22132242  4.95889455   4.94720076]

In[5]: scipy.stats.boxcox(data,-0.05)
Out[5]: 
[  7.67228932   8.81711694   9.99567795 ...,  10.53756266   9.06635173   9.01167889]

We can see, the results ofr R and python are almost the same When using the same lambda .
I hope it will help you to find the problem.
Thanks for your time!

WarrenWeckesser · 2017-01-09T09:49:20Z

Apparently the code that computes the MLE of lambda in boxcox_normmax allows an intermediate value of lambda to become negative enough (and the values in the input array are big enough) that the terms in x**lambda all become less than the machine epsilon, and then x**lambda - 1 is an array of all -1:

In [139]: x[:6]
Out[139]: array([   15957.,   112079.,  1039553.,   711775.,   173111.,   307382.])

In [140]: x[:6]**-5
Out[140]: 
array([  9.66593287e-22,   5.65429890e-26,   8.23695740e-31,
         5.47376754e-30,   6.43245378e-27,   3.64424729e-28])

In [141]: x[:6]**-5 - 1
Out[141]: array([-1., -1., -1., -1., -1., -1.])

When this happens, any further calculation performed by boxcox_normmax is basically garbage--all precision has been lost.

Note that you get several warnings when you compute boxcox_normmax(x, method='mle'):

In [142]: boxcox_normmax(x, method='mle')
/Users/warren/miniconda3scipy/lib/python3.5/site-packages/scipy-0.19.0.dev0+677074c-py3.5-macosx-10.6-x86_64.egg/scipy/stats/morestats.py:901: RuntimeWarning: divide by zero encountered in log
  llf -= N / 2.0 * np.log(np.sum((y - y_mean)**2. / N, axis=0))
/Users/warren/miniconda3scipy/lib/python3.5/site-packages/scipy-0.19.0.dev0+677074c-py3.5-macosx-10.6-x86_64.egg/scipy/optimize/optimize.py:1851: RuntimeWarning: invalid value encountered in double_scalars
  p = (x - v) * tmp2 - (x - w) * tmp1
/Users/warren/miniconda3scipy/lib/python3.5/site-packages/scipy-0.19.0.dev0+677074c-py3.5-macosx-10.6-x86_64.egg/scipy/optimize/optimize.py:1852: RuntimeWarning: invalid value encountered in double_scalars
  tmp2 = 2.0 * (tmp2 - tmp1)
/Users/warren/miniconda3scipy/lib/python3.5/site-packages/scipy-0.19.0.dev0+677074c-py3.5-macosx-10.6-x86_64.egg/scipy/optimize/optimize.py:1850: RuntimeWarning: invalid value encountered in double_scalars
  tmp2 = (x - v) * (fx - fw)
/Users/warren/miniconda3scipy/lib/python3.5/site-packages/scipy-0.19.0.dev0+677074c-py3.5-macosx-10.6-x86_64.egg/scipy/optimize/optimize.py:1849: RuntimeWarning: invalid value encountered in double_scalars
  tmp1 = (x - w) * (fx - fv)
Out[142]: -5.501196436791543

A somewhat surprising work-around is to tweak the brack argument of boxcox_normmax. I found that changing either end point by just a small amount was sufficient:

In [143]: boxcox_normmax(x, brack=(-1.9, 2.0),  method='mle')
Out[143]: -0.051654325496244997

In [144]: boxcox_normmax(x, brack=(-2.0, 1.99),  method='mle')
Out[144]: -0.051654321046728034

So instead of using boxcox(x), you can use:

In [162]: lmax = boxcox_normmax(x, brack=(-1.9, 2.0),  method='mle')

In [163]: lmax
Out[163]: -0.051654325496244997

In [164]: y = boxcox(x, lmax)

In [165]: y[:4]
Out[165]: array([ 7.61609432,  8.74095402,  9.89497438,  9.70797019])

KarlQu1990 · 2017-01-09T10:14:35Z

@WarrenWeckesser Thank you for your quick reply!
Your suggestion of tweaking the brack argument is find practice and work for me now.
Thanks a lot!

WarrenWeckesser · 2017-01-09T10:37:41Z

@Qukaiyi Thanks for providing the data and the additional information. That made it easy to track down the problem.

In case anyone else is wondering about the discrepancy between R and scipy in the estimate of lambda: apparently the R function BoxCox.lambda only returns a single digit of precision. For example:

> w = c(123, 3445, 456, 4323, 445, 333, 231, 324, 1010, 876, 448, 501, 876, 777, 39)
> BoxCox.lambda(w, method='loglik')
[1] 0.05
> BoxCox.lambda(c(100, 150, 160, 165, 190, 201), method='loglik')
[1] 2

In scipy:

In [273]: w = np.array([123, 3445, 456, 4323, 445, 333, 231, 324, 1010, 876, 448, 501, 876, 777, 39])

In [274]: boxcox_normmax(w, method='mle')
Out[274]: 0.045188742484661037

In [275]: boxcox_normmax([100, 150, 160, 165, 190, 201], method='mle')
Out[275]: 2.3893769498007482

KarlQu1990 · 2017-01-09T10:50:05Z

@WarrenWeckesser Get it! Scipy has higher precision than R.

omtinez · 2017-10-07T00:22:38Z

When should we expect a fix for this? What is the proposed solution?

rgommers · 2017-10-29T04:22:14Z

@omtinez only when someone gets around to fixing it, no one is working on it at the moment so no timeline can be given.

omtinez · 2017-10-29T19:56:52Z

Would you accept a pull request for this? My proposed solution would be to find the minimum and maximum lambdas that would not lead to loss of precision, then use the triplet form of brack which, if I'm reading the documentation correctly, should guarantee that the output is within range.

rgommers · 2017-10-31T09:51:09Z

Would you accept a pull request for this?

Yes, that would be great!

bartolkaruza · 2018-08-09T13:23:53Z

I'm running into this issue and the proposed workaround does not seem to work for my case, unfortunately.

from scipy import stats

serie = [4640, 4460, 4500, 4980, 5120, 4940, 4540, 5000, 5160, 4800, 5300, 4800, 4660, 4480, 4900, 4860, 5860, 5680, 4700, 5260, 5240, 6160, 5040, 4800, 4680, 4700, 5300, 4780, 5880, 7980, 6800, 6240, 5340, 5300, 7020, 4680, 4680, 5080, 5860, 5440, 6320, 5440, 6900, 6780, 5460, 5100, 5720, 4760, 4960, 4760, 6860, 6900, 5900, 7440, 4820, 6000, 7360, 5820, 5920, 4720, 4820, 5440, 5400, 6940, 5440, 5440, 5980, 6040, 5640, 6240, 5840, 4760, 5120, 5020, 5400, 5140, 5440, 6260, 5060, 5860, 6500, 5300, 4520, 4360, 4600, 4560, 4680, 5100, 5000, 6420, 5940, 5860, 6080, 6060, 5320, 4540, 5160, 5080, 5220, 5200, 5080, 5700, 6060, 5600, 5580, 5880, 5240, 4720, 4560, 5060, 4300, 5120, 5340, 6140, 5600, 5960, 6140, 6380, 4920, 4460, 4700, 5000, 5040, 4880, 4840, 5860]

lmax = stats.boxcox_normmax(serie, brack=(-1.9, 2.1),  method='mle')
serie_bc = stats.boxcox(serie, lmax)

results in

[0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487]

bartolkaruza · 2018-08-09T13:47:39Z

Here is another case:

serie = [2980, 2930, 2740, 4400, 6580, 3600, 3120, 6210, 7590, 4920, 4650, 3270, 2880, 2760, 2590, 2650, 3930, 6700, 3020, 4130, 4800, 5170, 4820, 3250, 2980, 2770, 2560, 2570, 2820, 6130, 3460, 4140, 5330, 4750, 4710, 3410, 2760, 2850, 2730, 2650, 3080, 3370, 4940, 3680, 4530, 4210, 3930, 2960, 2840, 2570, 2570, 2530, 2570, 3160, 2890, 2840, 3810, 3800, 3040, 2660, 2640, 2560, 2540, 2580, 2550, 2550, 2630, 3060, 3150, 3490, 3070, 2770, 2660, 2590, 2550, 2520, 2530, 2620, 2630, 2960, 3360, 3080, 2790, 2640, 2540, 2580, 2560, 2520, 2540, 2670, 2540, 2710, 2980, 3000, 2870, 2670, 2580, 2560, 2530, 2580, 2540, 2560, 2630, 2740, 2940, 3160, 2970, 2780, 2540, 2570, 2530, 2520, 2530, 2670, 2570, 2730, 2950, 3140, 2780, 2660, 2620, 2560, 2510, 2530, 2510, 2560]

Lokdal · 2018-09-13T21:59:27Z

I have this problem too!

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform when all transformed values are equal

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform for valid corner cases * Adds test for Box-Cox transform for invalid corner cases

omtinez · 2018-09-15T02:35:58Z

I finally had some time to look at this. Both the MLE and the Pearson version of this function were broken for different (but similar) reasons. The PR fixes the main problems, but some inputs still result in numerical instability (e.g. [0.5]). This is due to the optimizer getting too greedy, but fixing that is beyond the scope of this PR.

rgommers · 2018-09-15T17:55:10Z

This is due to the optimizer getting too greedy, but fixing that is beyond the scope of this PR.

Thanks @omtinez. I agree that that's better left for another PR. Yours should close this issue, and then a new one can be opened for remaining edge cases.

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform for valid corner cases * Adds test for Box-Cox transform for invalid corner cases

tmcken · 2018-10-31T03:53:21Z

Are the two lines of the annotated code below correct? Take the first one for example. According to YJ definition, it should be lmbda !=0. As lmbda can take on both positive and negative values, lmbda<1e-19 is a very poor substitute for lmbda!=0, is it not?

Perhaps these two lines should instead read as follows?
abs(lmda)<1e-19
abs(lmda-2)<1e-19

I am missing something?

def _yeo_johnson_transform(self, x, lmbda):
    out = np.zeros(shape=x.shape, dtype=x.dtype)
    pos = x >= 0  # binary mask
    # when x >= 0
    if lmbda < 1e-19:  # <------- is this correct???
        out[pos] = np.log(x[pos] + 1)
    else:  # lmbda != 0
        out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda
    # when x < 0
    if lmbda < 2 - 1e-19:  # <------- is this correct???
        out[~pos] = -(np.power(-x[~pos] + 1, 2 - lmbda) - 1) / (2 - lmbda)
    else:  # lmbda == 2
        out[~pos] = -np.log(-x[~pos] + 1)

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform for valid corner cases * Adds test for Box-Cox transform for invalid corner cases

omtinez · 2018-11-30T06:12:08Z

Can we reopen this issue? It has not been fixed, still waiting for #9271 to be merged.

For anyone that wants a workaround, you can clone my fork since the PR with the fix is not getting any traction: https://github.com/omtinez/scipy

77QingLiu · 2018-12-06T02:23:53Z

have the same problem too

77QingLiu · 2018-12-06T02:28:42Z

Do we have some temporary solution toward this issue? Thanks

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform for valid corner cases * Adds test for Box-Cox transform for invalid corner cases

WarrenWeckesser · 2019-04-17T04:49:15Z

For the record, these questions on stackoverflow are likely the same issue:

This change reformulates the expression for the log-likelihood function computed by boxcox_llf in a way that is mathematically equivalent to the old version but avoids the loss of precision that can occur in the subtraction y**λ - 1. For conciseness, let T(y; λ) be the Box-Cox transformation T(y; λ) = { (y**λ - 1)/λ if λ ≠ 0 { ln(y) if λ = 0 As explained in a comment in scipygh-6873, a problem arises if y is sufficiently large and λ is sufficiently negative (e.g. 5000**-5 is 3.2e-19). When y**λ approaches the floating point epsilon, the subtraction y**λ - 1 suffers catastrophic loss of precision. When this occurs in the log-likelihood function, it results in the optimizer returning garbage. The log-likelihood function (for a vector Y) is L(Y; λ) = -n*ln(var(T(Y; λ)))/2 + (λ - 1)*sum(ln(Y)) where n is the length of Y. The Box-Cox transformation T only appears as the argument to var; we only use the transform to compute the variance of the transformed data. The variance is invariant with respect to a constant shift, so, assuming λ ≠ 0, var(T(y; λ)) = var(Y**λ/λ - 1/λ) = var(Y**λ/λ) That is, we can compute var(T(y; λ)) without the subtraction in the Box-Cox transformation. Closes scipygh-6873.

WarrenWeckesser · 2019-04-17T08:38:38Z

Proposed solution: #10072

ev-br added scipy.stats query A question or suggestion that requires further information labels Jan 8, 2017

WarrenWeckesser added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Jan 9, 2017

WarrenWeckesser removed the query A question or suggestion that requires further information label Jan 9, 2017

WarrenWeckesser mentioned this issue Jun 30, 2017

boxcox fails unpredictably #7534

Closed

omtinez pushed a commit to omtinez/scipy that referenced this issue Sep 14, 2018

MNT: Fix Box-Cox transform

205840a

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform when all transformed values are equal

omtinez pushed a commit to omtinez/scipy that referenced this issue Sep 14, 2018

MNT: Fix Box-Cox transform

35806ac

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform when all transformed values are equal

omtinez pushed a commit to omtinez/scipy that referenced this issue Sep 14, 2018

MNT: Fix Box-Cox transform

794a50d

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform when all transformed values are equal

omtinez pushed a commit to omtinez/scipy that referenced this issue Sep 14, 2018

MNT: Fix Box-Cox transform

6292046

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform when all transformed values are equal

omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018

MNT: Fix Box-Cox transform

c7c8898

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform when all transformed values are equal

omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018

MAINT: Fix Box-Cox transform

cffc364

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform when all transformed values are equal

omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018

MNT: Fix Box-Cox transform

121fcae

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform when all transformed values are equal

omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018

MAINT: Fix Box-Cox transform

976ac9f

* Fixes scipy#6873 * Short-circuits MLE and Pearson in certain edge cases * Adds test for Box-Cox transform when all transformed values are equal

omtinez mentioned this issue Sep 14, 2018

MAINT: Fix Box-Cox transform #9271

Closed

chrisb83 mentioned this issue Oct 4, 2018

ENH: Added Yeo-Johnson power transformation #9305

Merged

9 tasks

NicolasHug mentioned this issue Nov 5, 2018

[MRG] Fixes to YeoJohnson transform scikit-learn/scikit-learn#12522

Merged

KarlQu1990 closed this as completed Nov 6, 2018

KarlQu1990 reopened this Nov 30, 2018

WarrenWeckesser mentioned this issue Apr 17, 2019

BUG: stats: Fix boxcox_llf to avoid loss of precision. #10072

Merged

rgommers closed this as completed in #10072 Apr 19, 2019

rgommers added this to the 1.3.0 milestone Apr 19, 2019

bcbrock mentioned this issue Sep 12, 2019

PowerTransformer 'divide by zero encountered in log' + proposed fix scikit-learn/scikit-learn#14959

Closed

trentmc mentioned this issue Jun 10, 2024

Fix #1174: Can't launch ARIMA plots // BoxCox unstable oceanprotocol/pdr-backend#1175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'stats.boxcox' return all same values #6873

'stats.boxcox' return all same values #6873

KarlQu1990 commented Dec 20, 2016

ev-br commented Jan 8, 2017

KarlQu1990 commented Jan 9, 2017

WarrenWeckesser commented Jan 9, 2017

KarlQu1990 commented Jan 9, 2017

WarrenWeckesser commented Jan 9, 2017

KarlQu1990 commented Jan 9, 2017

omtinez commented Oct 7, 2017

rgommers commented Oct 29, 2017

omtinez commented Oct 29, 2017

rgommers commented Oct 31, 2017

bartolkaruza commented Aug 9, 2018

bartolkaruza commented Aug 9, 2018

Lokdal commented Sep 13, 2018

omtinez commented Sep 15, 2018

rgommers commented Sep 15, 2018

tmcken commented Oct 31, 2018

omtinez commented Nov 30, 2018

77QingLiu commented Dec 6, 2018

77QingLiu commented Dec 6, 2018

WarrenWeckesser commented Apr 17, 2019

WarrenWeckesser commented Apr 17, 2019

'stats.boxcox' return all same values #6873

'stats.boxcox' return all same values #6873

Comments

KarlQu1990 commented Dec 20, 2016

ev-br commented Jan 8, 2017

KarlQu1990 commented Jan 9, 2017

WarrenWeckesser commented Jan 9, 2017

KarlQu1990 commented Jan 9, 2017

WarrenWeckesser commented Jan 9, 2017

KarlQu1990 commented Jan 9, 2017

omtinez commented Oct 7, 2017

rgommers commented Oct 29, 2017

omtinez commented Oct 29, 2017

rgommers commented Oct 31, 2017

bartolkaruza commented Aug 9, 2018

bartolkaruza commented Aug 9, 2018

Lokdal commented Sep 13, 2018

omtinez commented Sep 15, 2018

rgommers commented Sep 15, 2018

tmcken commented Oct 31, 2018

omtinez commented Nov 30, 2018

77QingLiu commented Dec 6, 2018

77QingLiu commented Dec 6, 2018

WarrenWeckesser commented Apr 17, 2019

WarrenWeckesser commented Apr 17, 2019