Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'stats.boxcox' return all same values #6873

Closed
KarlQu1990 opened this issue Dec 20, 2016 · 21 comments · Fixed by #10072
Closed

'stats.boxcox' return all same values #6873

KarlQu1990 opened this issue Dec 20, 2016 · 21 comments · Fixed by #10072
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Milestone

Comments

@KarlQu1990
Copy link

I applyed stats.boxbox to my data and the returned values are all the same, which seems really unreasonable! it returned this same result in scipy=0.18.1 and scipy=0.17.1. the optimal lambda in my case is -5.501196436791543.
your can download my data(data.txt).
I also tried the boxcox function in R and it returned reasonable result. So i think the boxcox_normmax function should do more test and should be used carefully.

@ev-br
Copy link
Member

ev-br commented Jan 8, 2017

@Qukaiyi could you please add a minimal reproducible example, something we can directly run? Also it would be helpful if you explain what precisely you think is wrong with scipy.stats.boxcox. If there's a discrepancy with R, please also add specific R commands you're using.
As written, this report is not actionable.

@ev-br ev-br added scipy.stats query A question or suggestion that requires further information labels Jan 8, 2017
@KarlQu1990
Copy link
Author

@ev-br I‘m sorry. here is my example in python and R:
First, load my example data in here(example_data.txt):

In [1]: data = np.genfromtxt('example_data.txt',delimiter=',');
print(data)
print('count:',len(data))
Out[1]: [   15957.   112079.  1039553. ...,  3166209.   175913.   159211.]
count: 414

Now, I use scipy.stats.boxcox and scipy.stats.boxcox_normmax:

In[2]: scipy.stats.boxcox(data)
Out[2](array([ 0.18177864,  0.18177864,  0.18177864, ...,  0.18177864,
         0.18177864,  0.18177864]), -5.501196436791543)

In[3]:scipy.stats.boxcox_normmax(data,method='all')
Out[3]: array([-8.12354424, -5.50119644])

We see that boxcox_normmax choose a lambda far from (-2,2) interval. and the transformed values are all the same.

And the following is the result in R:

library(forecast)
data = scan('example_data.txt',sep=',')

lmba1 = BoxCox.lambda(data,method="guerrero") ;lmba1
[1] -0.178229
lmba2 = BoxCox.lambda(data,method="loglik") ;lmba2
[2] -0.05

BoxCox(data,lambda=lmba1 )
  [1] 4.610912 4.904361 5.135813 5.102642 4.957027 5.020616 4.963566 4.804480
  [9] 5.108647 4.977802 4.947819 5.064398 5.146567 4.871337 5.185068 5.072204
 [17] 5.026774 5.183878 4.762700 5.057233 4.925822 5.004983 4.944475 4.748675
 [25] 5.123928 4.809082 4.930418 5.030869 5.061115 4.828695 5.159066 5.156983
......

BoxCox(data,lambda=lmba2 )
  [1]  7.672289  8.817117  9.995678  9.804398  9.057570  9.367238  9.088384
  [8]  8.394422  9.838348  9.156260  9.014549  9.594710 10.059748  8.672853
 [15] 10.298290  9.636629  9.398482 10.290689  8.228768  9.556607  8.913490
 [22]  9.288967  8.999032  8.174477  9.926068  8.413047  8.934409  9.419391
......

The following is a simple description of R's BoxCox:

Description
If method=="guerrero", Guerrero's (1993) method is used,
where lambda minimizes the coefficient of variation for subseries of x.

If method=="loglik", the value of lambda is chosen to maximize the profile 
log likelihood of a linear model fitted to x. For non-seasonal data, a linear
time trend is fitted while for seasonal data, a linear time trend with seasonal
dummy variables is used.

Usage
BoxCox.lambda(x, method=c("guerrero","loglik"), lower=-1, upper=2)

Last, I run python again with the lambdas from the R and get the following result:

In[4]: scipy.stats.boxcox(data,-0.178229)
Out[4]: 
[ 4.61091154  4.90436036  5.13581228 ...,  5.22132242  4.95889455   4.94720076]

In[5]: scipy.stats.boxcox(data,-0.05)
Out[5]: 
[  7.67228932   8.81711694   9.99567795 ...,  10.53756266   9.06635173   9.01167889]

We can see, the results ofr R and python are almost the same When using the same lambda .
I hope it will help you to find the problem.
Thanks for your time!

@WarrenWeckesser WarrenWeckesser added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Jan 9, 2017
@WarrenWeckesser
Copy link
Member

Apparently the code that computes the MLE of lambda in boxcox_normmax allows an intermediate value of lambda to become negative enough (and the values in the input array are big enough) that the terms in x**lambda all become less than the machine epsilon, and then x**lambda - 1 is an array of all -1:

In [139]: x[:6]
Out[139]: array([   15957.,   112079.,  1039553.,   711775.,   173111.,   307382.])

In [140]: x[:6]**-5
Out[140]: 
array([  9.66593287e-22,   5.65429890e-26,   8.23695740e-31,
         5.47376754e-30,   6.43245378e-27,   3.64424729e-28])

In [141]: x[:6]**-5 - 1
Out[141]: array([-1., -1., -1., -1., -1., -1.])

When this happens, any further calculation performed by boxcox_normmax is basically garbage--all precision has been lost.

Note that you get several warnings when you compute boxcox_normmax(x, method='mle'):

In [142]: boxcox_normmax(x, method='mle')
/Users/warren/miniconda3scipy/lib/python3.5/site-packages/scipy-0.19.0.dev0+677074c-py3.5-macosx-10.6-x86_64.egg/scipy/stats/morestats.py:901: RuntimeWarning: divide by zero encountered in log
  llf -= N / 2.0 * np.log(np.sum((y - y_mean)**2. / N, axis=0))
/Users/warren/miniconda3scipy/lib/python3.5/site-packages/scipy-0.19.0.dev0+677074c-py3.5-macosx-10.6-x86_64.egg/scipy/optimize/optimize.py:1851: RuntimeWarning: invalid value encountered in double_scalars
  p = (x - v) * tmp2 - (x - w) * tmp1
/Users/warren/miniconda3scipy/lib/python3.5/site-packages/scipy-0.19.0.dev0+677074c-py3.5-macosx-10.6-x86_64.egg/scipy/optimize/optimize.py:1852: RuntimeWarning: invalid value encountered in double_scalars
  tmp2 = 2.0 * (tmp2 - tmp1)
/Users/warren/miniconda3scipy/lib/python3.5/site-packages/scipy-0.19.0.dev0+677074c-py3.5-macosx-10.6-x86_64.egg/scipy/optimize/optimize.py:1850: RuntimeWarning: invalid value encountered in double_scalars
  tmp2 = (x - v) * (fx - fw)
/Users/warren/miniconda3scipy/lib/python3.5/site-packages/scipy-0.19.0.dev0+677074c-py3.5-macosx-10.6-x86_64.egg/scipy/optimize/optimize.py:1849: RuntimeWarning: invalid value encountered in double_scalars
  tmp1 = (x - w) * (fx - fv)
Out[142]: -5.501196436791543

A somewhat surprising work-around is to tweak the brack argument of boxcox_normmax. I found that changing either end point by just a small amount was sufficient:

In [143]: boxcox_normmax(x, brack=(-1.9, 2.0),  method='mle')
Out[143]: -0.051654325496244997

In [144]: boxcox_normmax(x, brack=(-2.0, 1.99),  method='mle')
Out[144]: -0.051654321046728034

So instead of using boxcox(x), you can use:

In [162]: lmax = boxcox_normmax(x, brack=(-1.9, 2.0),  method='mle')

In [163]: lmax
Out[163]: -0.051654325496244997

In [164]: y = boxcox(x, lmax)

In [165]: y[:4]
Out[165]: array([ 7.61609432,  8.74095402,  9.89497438,  9.70797019])

@WarrenWeckesser WarrenWeckesser removed the query A question or suggestion that requires further information label Jan 9, 2017
@KarlQu1990
Copy link
Author

@WarrenWeckesser Thank you for your quick reply!
Your suggestion of tweaking the brack argument is find practice and work for me now.
Thanks a lot!

@WarrenWeckesser
Copy link
Member

@Qukaiyi Thanks for providing the data and the additional information. That made it easy to track down the problem.

In case anyone else is wondering about the discrepancy between R and scipy in the estimate of lambda: apparently the R function BoxCox.lambda only returns a single digit of precision. For example:

> w = c(123, 3445, 456, 4323, 445, 333, 231, 324, 1010, 876, 448, 501, 876, 777, 39)
> BoxCox.lambda(w, method='loglik')
[1] 0.05
> BoxCox.lambda(c(100, 150, 160, 165, 190, 201), method='loglik')
[1] 2 

In scipy:

In [273]: w = np.array([123, 3445, 456, 4323, 445, 333, 231, 324, 1010, 876, 448, 501, 876, 777, 39])

In [274]: boxcox_normmax(w, method='mle')
Out[274]: 0.045188742484661037

In [275]: boxcox_normmax([100, 150, 160, 165, 190, 201], method='mle')
Out[275]: 2.3893769498007482

@KarlQu1990
Copy link
Author

@WarrenWeckesser Get it! Scipy has higher precision than R.

@omtinez
Copy link

omtinez commented Oct 7, 2017

When should we expect a fix for this? What is the proposed solution?

@rgommers
Copy link
Member

@omtinez only when someone gets around to fixing it, no one is working on it at the moment so no timeline can be given.

@omtinez
Copy link

omtinez commented Oct 29, 2017

Would you accept a pull request for this? My proposed solution would be to find the minimum and maximum lambdas that would not lead to loss of precision, then use the triplet form of brack which, if I'm reading the documentation correctly, should guarantee that the output is within range.

@rgommers
Copy link
Member

Would you accept a pull request for this?

Yes, that would be great!

@bartolkaruza
Copy link

I'm running into this issue and the proposed workaround does not seem to work for my case, unfortunately.

from scipy import stats

serie = [4640, 4460, 4500, 4980, 5120, 4940, 4540, 5000, 5160, 4800, 5300, 4800, 4660, 4480, 4900, 4860, 5860, 5680, 4700, 5260, 5240, 6160, 5040, 4800, 4680, 4700, 5300, 4780, 5880, 7980, 6800, 6240, 5340, 5300, 7020, 4680, 4680, 5080, 5860, 5440, 6320, 5440, 6900, 6780, 5460, 5100, 5720, 4760, 4960, 4760, 6860, 6900, 5900, 7440, 4820, 6000, 7360, 5820, 5920, 4720, 4820, 5440, 5400, 6940, 5440, 5440, 5980, 6040, 5640, 6240, 5840, 4760, 5120, 5020, 5400, 5140, 5440, 6260, 5060, 5860, 6500, 5300, 4520, 4360, 4600, 4560, 4680, 5100, 5000, 6420, 5940, 5860, 6080, 6060, 5320, 4540, 5160, 5080, 5220, 5200, 5080, 5700, 6060, 5600, 5580, 5880, 5240, 4720, 4560, 5060, 4300, 5120, 5340, 6140, 5600, 5960, 6140, 6380, 4920, 4460, 4700, 5000, 5040, 4880, 4840, 5860]

lmax = stats.boxcox_normmax(serie, brack=(-1.9, 2.1),  method='mle')
serie_bc = stats.boxcox(serie, lmax)

results in

[0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487
 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487 0.45081487]

@bartolkaruza
Copy link

Here is another case:

serie = [2980, 2930, 2740, 4400, 6580, 3600, 3120, 6210, 7590, 4920, 4650, 3270, 2880, 2760, 2590, 2650, 3930, 6700, 3020, 4130, 4800, 5170, 4820, 3250, 2980, 2770, 2560, 2570, 2820, 6130, 3460, 4140, 5330, 4750, 4710, 3410, 2760, 2850, 2730, 2650, 3080, 3370, 4940, 3680, 4530, 4210, 3930, 2960, 2840, 2570, 2570, 2530, 2570, 3160, 2890, 2840, 3810, 3800, 3040, 2660, 2640, 2560, 2540, 2580, 2550, 2550, 2630, 3060, 3150, 3490, 3070, 2770, 2660, 2590, 2550, 2520, 2530, 2620, 2630, 2960, 3360, 3080, 2790, 2640, 2540, 2580, 2560, 2520, 2540, 2670, 2540, 2710, 2980, 3000, 2870, 2670, 2580, 2560, 2530, 2580, 2540, 2560, 2630, 2740, 2940, 3160, 2970, 2780, 2540, 2570, 2530, 2520, 2530, 2670, 2570, 2730, 2950, 3140, 2780, 2660, 2620, 2560, 2510, 2530, 2510, 2560]

@Lokdal
Copy link

Lokdal commented Sep 13, 2018

I have this problem too!

omtinez pushed a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform when all transformed values are equal
omtinez pushed a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform when all transformed values are equal
omtinez pushed a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform when all transformed values are equal
omtinez pushed a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform when all transformed values are equal
omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform when all transformed values are equal
omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform when all transformed values are equal
omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform when all transformed values are equal
omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform when all transformed values are equal
omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform for valid corner cases
* Adds test for Box-Cox transform for invalid corner cases
omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform for valid corner cases
* Adds test for Box-Cox transform for invalid corner cases
omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform for valid corner cases
* Adds test for Box-Cox transform for invalid corner cases
omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform for valid corner cases
* Adds test for Box-Cox transform for invalid corner cases
omtinez added a commit to omtinez/scipy that referenced this issue Sep 14, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform for valid corner cases
* Adds test for Box-Cox transform for invalid corner cases
@omtinez
Copy link

omtinez commented Sep 15, 2018

I finally had some time to look at this. Both the MLE and the Pearson version of this function were broken for different (but similar) reasons. The PR fixes the main problems, but some inputs still result in numerical instability (e.g. [0.5]). This is due to the optimizer getting too greedy, but fixing that is beyond the scope of this PR.

@rgommers
Copy link
Member

This is due to the optimizer getting too greedy, but fixing that is beyond the scope of this PR.

Thanks @omtinez. I agree that that's better left for another PR. Yours should close this issue, and then a new one can be opened for remaining edge cases.

omtinez added a commit to omtinez/scipy that referenced this issue Sep 15, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform for valid corner cases
* Adds test for Box-Cox transform for invalid corner cases
@tmcken
Copy link

tmcken commented Oct 31, 2018

Are the two lines of the annotated code below correct? Take the first one for example. According to YJ definition, it should be lmbda !=0. As lmbda can take on both positive and negative values, lmbda<1e-19 is a very poor substitute for lmbda!=0, is it not?

Perhaps these two lines should instead read as follows?
abs(lmda)<1e-19
abs(lmda-2)<1e-19

I am missing something?

def _yeo_johnson_transform(self, x, lmbda):
    out = np.zeros(shape=x.shape, dtype=x.dtype)
    pos = x >= 0  # binary mask
    # when x >= 0
    if lmbda < 1e-19:  # <------- is this correct???
        out[pos] = np.log(x[pos] + 1)
    else:  # lmbda != 0
        out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda
    # when x < 0
    if lmbda < 2 - 1e-19:  # <------- is this correct???
        out[~pos] = -(np.power(-x[~pos] + 1, 2 - lmbda) - 1) / (2 - lmbda)
    else:  # lmbda == 2
        out[~pos] = -np.log(-x[~pos] + 1)

omtinez added a commit to omtinez/scipy that referenced this issue Nov 6, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform for valid corner cases
* Adds test for Box-Cox transform for invalid corner cases
omtinez added a commit to omtinez/scipy that referenced this issue Nov 6, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform for valid corner cases
* Adds test for Box-Cox transform for invalid corner cases
@omtinez
Copy link

omtinez commented Nov 30, 2018

Can we reopen this issue? It has not been fixed, still waiting for #9271 to be merged.

For anyone that wants a workaround, you can clone my fork since the PR with the fix is not getting any traction: https://github.com/omtinez/scipy

@KarlQu1990 KarlQu1990 reopened this Nov 30, 2018
@77QingLiu
Copy link

have the same problem too

@77QingLiu
Copy link

Do we have some temporary solution toward this issue? Thanks

omtinez added a commit to omtinez/scipy that referenced this issue Dec 10, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform for valid corner cases
* Adds test for Box-Cox transform for invalid corner cases
omtinez added a commit to omtinez/scipy that referenced this issue Dec 10, 2018
* Fixes scipy#6873
* Short-circuits MLE and Pearson in certain edge cases
* Adds test for Box-Cox transform for valid corner cases
* Adds test for Box-Cox transform for invalid corner cases
@WarrenWeckesser
Copy link
Member

For the record, these questions on stackoverflow are likely the same issue:

WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Apr 17, 2019
This change reformulates the expression for the log-likelihood
function computed by boxcox_llf in a way that is mathematically
equivalent to the old version but avoids the loss of precision
that can occur in the subtraction y**λ - 1.

For conciseness, let T(y; λ) be the Box-Cox transformation

    T(y; λ) = { (y**λ - 1)/λ  if λ ≠ 0
              { ln(y)         if λ = 0

As explained in a comment in scipygh-6873, a problem arises if y
is sufficiently large and λ is sufficiently negative (e.g.
5000**-5 is 3.2e-19).  When y**λ approaches the floating point
epsilon, the subtraction y**λ - 1 suffers catastrophic loss of
precision.  When this occurs in the log-likelihood function,
it results in the optimizer returning garbage.

The log-likelihood function (for a vector Y) is

    L(Y; λ) = -n*ln(var(T(Y; λ)))/2 + (λ - 1)*sum(ln(Y))

where n is the length of Y.  The Box-Cox transformation T only
appears as the argument to var; we only use the transform to
compute the variance of the transformed data.  The variance is
invariant with respect to a constant shift, so, assuming λ ≠ 0,

    var(T(y; λ)) = var(Y**λ/λ - 1/λ) = var(Y**λ/λ)

That is, we can compute var(T(y; λ)) without the subtraction
in the Box-Cox transformation.

Closes scipygh-6873.
WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Apr 17, 2019
This change reformulates the expression for the log-likelihood
function computed by boxcox_llf in a way that is mathematically
equivalent to the old version but avoids the loss of precision
that can occur in the subtraction y**λ - 1.

For conciseness, let T(y; λ) be the Box-Cox transformation

    T(y; λ) = { (y**λ - 1)/λ  if λ ≠ 0
              { ln(y)         if λ = 0

As explained in a comment in scipygh-6873, a problem arises if y
is sufficiently large and λ is sufficiently negative (e.g.
5000**-5 is 3.2e-19).  When y**λ approaches the floating point
epsilon, the subtraction y**λ - 1 suffers catastrophic loss of
precision.  When this occurs in the log-likelihood function,
it results in the optimizer returning garbage.

The log-likelihood function (for a vector Y) is

    L(Y; λ) = -n*ln(var(T(Y; λ)))/2 + (λ - 1)*sum(ln(Y))

where n is the length of Y.  The Box-Cox transformation T only
appears as the argument to var; we only use the transform to
compute the variance of the transformed data.  The variance is
invariant with respect to a constant shift, so, assuming λ ≠ 0,

    var(T(y; λ)) = var(Y**λ/λ - 1/λ) = var(Y**λ/λ)

That is, we can compute var(T(y; λ)) without the subtraction
in the Box-Cox transformation.

Closes scipygh-6873.
@WarrenWeckesser
Copy link
Member

Proposed solution: #10072

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
9 participants