BUG: fix overflow in stats.yeojohnson #18852

lsorber · 2023-07-10T09:10:25Z

Reference issue

Closes #18389.

What does this implement/fix?

This PR fixes two sources of overflow in stats.yeojohnson:

The first source of overflow is caused by certain invocations of np.power with unbalanced bases and exponents.
The second source of overflow occurs when the optimal transform is beyond what floating point can represent.

melissawm · 2023-07-10T15:06:22Z

Docs build failure is unrelated here.

mdhaber · 2023-07-10T23:23:51Z

Item 1 is simple and uncontroversial; I could merge it today if it were a separate PR.

For the second part, It looks like you used the analytical bounds approach (mentioned in #18389 (comment)) instead of the "regularization term" (what you originally proposed in that issue), right?

In that case, you mentioned that this problem is convex, so do I understand that this doesn't change behavior for already working code because any local optimum is a global optimum? Is it strictly convex such that the solution is unique? Can you provide some background about how the bounds are derived and add some comments in the code about the magic numbers (e.g. 20)?

Re: the regularization term - it sounds like that is a distinct enhancement that one could make to find a suboptimal - but nearly optimal - solution that avoids overflow? Or is it no longer necessary if you have these analytical bounds?

Thanks!

scipy/stats/tests/test_morestats.py

lsorber · 2023-07-11T07:21:36Z

Item 1 is simple and uncontroversial; I could merge it today if it were a separate PR.

👍

For the second part, It looks like you used the analytical bounds approach (mentioned in #18389 (comment)) instead of the "regularization term" (what you originally proposed in that issue), right?

That's right. With the regularisation approach, the trouble is that it would also change users' existing solutions. The advantage of the bounded optimisation is that it doesn't change users' solutions (given that the Yeo-Johnson negative log likelihood is strictly convex).

In that case, you mentioned that this problem is convex, so do I understand that this doesn't change behavior for already working code because any local optimum is a global optimum? Is it strictly convex such that the solution is unique?

It is indeed strictly convex [1] and therefore existing solutions should not change when using a different optimiser.

Can you provide some background about how the bounds are derived and add some comments in the code about the magic numbers (e.g. 20)?

Sure, here's a commented version of the implementation:

# Convert input data to a NumPy array, without copying the data if possible.
x = np.asarray(x)

# Take the data's dtype if already in floating point, else assume conversion to float64 by the optimiser.
dtype = x.dtype if np.issubdtype(x.dtype, np.floating) else np.float64

# Given that floating point is bounded by (finfo.tiny, finfo.max), we don't want to go to the edges all the way.
# Instead, we'll use (finfo.tiny - finfo.eps, finfo.max + finfo.eps) so that there's a small buffer on either side.
b = np.log(np.finfo(dtype).eps)

# We're optimising lambda for the given data in `x`, but later on the same Yeo-Johnson transform may be used on new data.
# We don't know what values we'll observe then, but we could assume that those values will be less than Q3 + 10 * 1.5 * IQR, where Q3 + 1.5 * IQR is a standard way to determine an outlier threshold on `x`, and 10 is a safety factor.
# We can approximate this formula as follows: Q3 < max(x) and IQR < max(x), so we have max(x) + 10 * 1.5 * max(x) = 16 * max(x).
# Then we round this to get an even 20 * max(x).
# We take the absolute value to deal with negative data, and use `np.nanmax` to avoid early failure.
max_x = 20 * np.nanmax(np.abs(x))

# The Yeo-Johnson transform `((x + 1) ** lam - 1) / lam` can be approximated by `(x + 1) ** lam`.
# The lambda resulting in the smallest (largest) output C from the approximate transform is then `log(C) / log(max(x) + 1)`.
# We divide this by a factor of 2 because the Yeo-Johnson log likelihood function calculates the variance of the transformed data, which could also overflow.
# The factor of 2 could be avoided by improving the log likelihood implementation, but then again it would prevent users to compute the variance of the transformed data as well.
# The adjustment needed for negative data is minor and isn't worth making it a special case, so we use the same bounds as we do for positive data.
lb = (np.log(np.finfo(dtype).tiny) - b) / 2 / np.log(max_x + 1)
ub = (np.log(np.finfo(dtype).max) + b) / 2 / np.log(max_x + 1)

# We set the `optimize.fminbound` tolerance to match that of `optimize.brent` so that we get the same result if the optimal lambda lies within the bounds.
tol_brent = 1.48e-08

return optimize.fminbound(_neg_llf, lb, ub, args=(x,), xtol=tol_brent)

[1] https://arxiv.org/pdf/2210.01639.pdf

mdhaber · 2023-07-11T14:14:50Z

Given that floating point is bounded by (finfo.tiny, finfo.max), we don't want to go to the edges all the way.BInstead, we'll use (finfo.tiny - finfo.eps, finfo.max + finfo.eps) so that there's a small buffer on either side.

If you don't want to go to the edges (only approach them from within, I'm assuming), wouldn't it be the other way around - add to tiny and subtract from max?
Also, finfo.tiny + finfo.eps == finfo.eps, and finfo.max - finfo.eps == finfo.max. Please consider np.nextafter if you only need to go one floating point number over.
Then again, since you're working with logs, and $\log(a) + \log(b) = \log(ab)$, I don't think the comment and code match as they are. Which is intended?

Q3 + 10 * 1.5 * IQR

This explains one magic number with two new ones : ) what are 10 and 1.5? (Perhaps there is something like Chebyshev's inequality but for quartiles that I'm not familiar with?)

The lambda resulting in the smallest (largest) output C from the approximate transform

Ok, so the point is to find bounds on lambda such that 1) the transformed data does not go out of bounds and 2) new data is unlikely to go out of bounds (according to some argument about population based on the original data).
So just to check my understanding, it sounds like part 2 of the sentence could affect existing code because your bounds are a little more conservative than they need to be just to transform the existing data? (That's probably ok as long as it's not too conservative.)
Also, this is based on an approximation of the transform rather than the transform itself. Is the approximation conservative in the sense that if the approximately transformed data stays in bounds, the exactly transformed data is guaranteed to stay within bounds?

mdhaber · 2023-07-13T23:08:46Z

@lsorber there are several maintainers at the SciPy conference right now, so I think we can merge this shortly after we hear back from you about the questions above. Thanks!

lsorber · 2023-07-14T08:39:50Z

Given that floating point is bounded by (finfo.tiny, finfo.max), we don't want to go to the edges all the way.BInstead, we'll use (finfo.tiny - finfo.eps, finfo.max + finfo.eps) so that there's a small buffer on either side.

If you don't want to go to the edges (only approach them from within, I'm assuming), wouldn't it be the other way around - add to tiny and subtract from max?
Also, finfo.tiny + finfo.eps == finfo.eps, and finfo.max - finfo.eps == finfo.max. Please consider np.nextafter if you only need to go one floating point number over.
Then again, since you're working with logs, and log⁡(a)+log⁡(b)=log⁡(ab), I don't think the comment and code match as they are. Which is intended?

Apologies for not being clear. Those formulas were intended to be in log space, so the float exponent range would be (log10(tiny) - log10(eps), log10(max) + log10(eps)). For double precision, this would evaluate to the base-10 exponent range (-308 - (-16), 308 + (-16)) = (-292, 292).

Q3 + 10 * 1.5 * IQR

This explains one magic number with two new ones : ) what are 10 and 1.5? (Perhaps there is something like Chebyshev's inequality but for quartiles that I'm not familiar with?)

The Q3 + 1.5 * (Q3 - Q1) threshold is a classic way to define outliers for any dataset [1] [2], and is for example used to draw the whiskers of Box plots [3]. The resulting Yeo-Johnson transform should however also be applicable to a dataset's outliers without over- or underflow. Therefore, we need a threshold beyond which we almost certainly won't observe any values in the future. That's why as a first step, I increased Tukey's constant k by an order of magnitude from 1.5 to 15. Then, to avoid actually computing Q3 and the IQR, I replaced those values by the maximum observed value in the data. And finally, I rounded 16 to 20.

The effect of doing this is that we're saying that if you fit a Yeo-Johnson transform on a dataset, that you may still safely apply it on values that are up to 20 times larger than the largest observed value in the dataset. If a dataset's largest value is 100, that means the optimal transform is computable without overflow for values up to 2000. At the end of the day, the multiplier 20 is indeed arbitrary. You could argue that we should increase it to say 100, which I think is defensible, but comes at the cost of reducing the domain to find an optimal lambda in.

The lambda resulting in the smallest (largest) output C from the approximate transform

Ok, so the point is to find bounds on lambda such that 1) the transformed data does not go out of bounds and 2) new data is unlikely to go out of bounds (according to some argument about population based on the original data).
So just to check my understanding, it sounds like part 2 of the sentence could affect existing code because your bounds are a little more conservative than they need to be just to transform the existing data? (That's probably ok as long as it's not too conservative.)

That's correct. For datasets that are prone to over- or underflow, I think most users would rather trade off a small bit of likelihood to reduce the chance of over- or underflow to near zero. The 20 multiplier is a trade-off between those two goals: achievable likelihood and the risk of over- or underflow.

Also, this is based on an approximation of the transform rather than the transform itself. Is the approximation conservative in the sense that if the approximately transformed data stays in bounds, the exactly transformed data is guaranteed to stay within bounds?

The approximation is as conservative as it can be in that we give lambda access to the full range of floating point values, modulo the space we reserve for three factors:

We reserve log10(eps) on either side of the exponent range so that we don't inadvertently create over- or underflow by allowing transformations all the way up to the edge of the range.
We reserve some space in the exponent range to allow safe transformation of new data that is outside of the range of observed values.
We reserve about half of what remains to allow safe computation of the variance of the transformed data.

That said, there is an additional effect of dividing by lambda that is not accounted for. It would take some further analysis to make sure the proposed bounds covers every possible source of over- and underflow. Without writing up a full analysis, I'm pretty confident that overflow is avoided in all cases, but that underflow can still happen in some extreme cases (see below).

I added an additional commit with some improvements:

Implementation:
a. I upgraded the bounds calculation itself to use np.log1p as well.
b. The bounds for negative data are now explicitly handled as a special case.
Tests:
a. I added overflow tests for extremely large positive data. Overflow does not occur even for data as large as 1e300.
b. I added underflow tests for extremely small positive data. The current bounds prevent underflow for data up to 1e-150. If the data is even smaller than that, underflow may occur as a result of the division by lambda that is ignored by the bounds approximation.
c. I added over- and underflow tests for negative data.

[1] https://en.wikipedia.org/wiki/Interquartile_range#Outliers
[2] https://en.wikipedia.org/wiki/Outlier#Tukey's_fences
[3] https://en.wikipedia.org/wiki/Box_plot#Whiskers

mdhaber

Thanks for the replies @lsorber. I think this is close. Much of what I'd suggest is just to make the intent of the code more obvious to future readers by refactoring slightly and including code comments.

BTW I thought I'd mention that there does seem to be an inverse to the exact transformation (Corrected per comment below.):
https://www.wolframalpha.com/input?i=solve+%28%28x%2B1%29%5El-1%29%2Fl+%3D+c+for+l

import numpy as np
from scipy import special
x = 25
c = 1000
# Branch k=-1 found by trial and error. Is it always the same?
l = np.real(-special.lambertw(- ((x+1)**(-1/c)*special.log1p(x))/c, k=-1)/special.log1p(x)) - 1/c  # 2.3873997498774027
((x + 1)**l-1)/l  # 1000.0

@lorentzenchr do these comments look appropriate to you? Do you have any others?

scipy/stats/_morestats.py

scipy/stats/tests/test_morestats.py

lsorber · 2023-07-25T09:42:51Z

BTW I thought I'd mention that there does seem to be an inverse to the exact transformation:
https://www.wolframalpha.com/input?i=solve+%28x%2B1%29%5E%28l-1%29%2Fl+%3D+c+for+l

There's a small error in that formula, I believe it should be: https://www.wolframalpha.com/input?i=solve+%28%28x%2B1%29%5El-1%29%2Fl+%3D+c+for+l

I don't expect the exact inverse to yield a significantly larger search space for lambda though, so I'm not sure it's worth the additional complexity in the implementation.

lorentzenchr

Some comments, likely mostly my non-understanding.

scipy/stats/_morestats.py

lorentzenchr · 2023-07-25T13:38:52Z

scipy/stats/_morestats.py

    return loglike


-def yeojohnson_normmax(x, brack=(-2, 2)):
+def yeojohnson_normmax(x, brack=None):


Just as notification: This is a a public API function and a slight change of the default parameters.

We discussed in #18852 (comment) and the reply that although this looks like a change, it should not materially change the solution found because the problem is strictly convex.

In addition to @mdhaber's comment: if a brack is supplied, the implementation uses optimize.brent as before.

scipy/stats/_morestats.py

scipy/stats/tests/test_morestats.py

lorentzenchr · 2023-07-25T14:20:38Z

scipy/stats/_morestats.py

+        if np.any(x < 0):
+            lb, ub = 2 - ub, 2 - lb


Did you mean np.all(x < 0)?
Or should it read

lb, ub = min(lb, 2 - ub), max(ub, 2 - lb)

?

I used np.any because it is slightly more defensive in case there are non-finite values such as NaN in the data. It could be np.all if you prefer that. The min/max suggestion you make is compact, but I'm not sure it's equivalent.

I‘m not sure if those reflections formulas apply if a only one single value in x < 0.

I improved the bounds so that the cases {all positive, all negative, mixed positive & negative} data are handled separately:

# Compute lb, ub for all positive data # Reflect the bounds if the data is all negative if np.all(x < 0): lb, ub = 2 - ub, 2 - lb # Take the most conservative of the positive and negative bounds if the data has mixed signs elif np.any(x < 0): lb, ub = max(2 - ub, lb), min(2 - lb, ub)

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

lsorber · 2023-07-27T12:55:03Z

Added a test for the case where the input data is all-zero.

mdhaber · 2023-07-28T06:03:59Z

I happened to notice that the case of all integers gives an error in main (which could be fixed by ensuring that the dtype is at least float), so I tested with this PR.

from scipy import stats
stats.yeojohnson_normmax([1, 2, 3.])  # 0.5907038225450774
stats.yeojohnson_normmax([1, 2, 3])  # 69.11083988572132

Good that there's no error; surprising that it affects the result!

Besides that, looks like there is still #18852 (comment), which I something I was wondering about and needed to look at more closely.

lsorber · 2023-07-28T13:04:58Z

I happened to notice that the case of all integers gives an error in main (which could be fixed by ensuring that the dtype is at least float), so I tested with this PR.
from scipy import stats
stats.yeojohnson_normmax([1, 2, 3.])  # 0.5907038225450774
stats.yeojohnson_normmax([1, 2, 3])  # 69.11083988572132
Good that there's no error; surprising that it affects the result!

Besides that, looks like there is still #18852 (comment), which I something I was wondering about and needed to look at more closely.

I added a test for integer data to prevent regressions for integer data in the future.

EDIT: Improved this test to actually cover the discrepancy you noticed. This PR now addresses more than just overflow/underflow, but also corrects the handling of all-zero input, non-finite input, and integer input.

lorentzenchr · 2023-07-28T16:45:55Z

scipy/stats/_morestats.py

+        # Compute the bounds by approximating the inverse of the Yeo-Johnson
+        # transform on the smallest and largest floating point exponents, given
+        # the largest data we expect to observe. See [1] for further details.
+        # [1] https://github.com/scipy/scipy/pull/18852


Could you link to the specific comment in the PR?

mdhaber

@lsorber I'm willing to merge this as-is because overall I think it's a big improvement, but I think it's worth a closer look at the bounds when there is both positive and negative data. Shall I merge this and open an issue about it? That way, at least this much is done and we can just tweak the bounds. Or would you prefer to keep working here?

There is definitely more we can do, though.

I think we need to somehow indicate when the returned value of lambda is at one of the bounds, in which case it is is (almost certainly) not the truly optimal value of lambda. This is the case of the MRE in BUG: Yeo-Johnson Power Transformer gives Numpy warning #18389 (comment).
It's worth reconsidering whether we need to adjust the bounds by log(epsilon). That might be a bit too conservative.
The calculation of the likelihood only requires the log of the variance of the transformed data. I think we can calculate that from the log of the transformed data directly using the logsumexp trick (applied in at least two ways). It seems to me that the user should have the option to calculate the truly optimal value of lambda, even if the transformed data cannot be represented in float64.

scipy/stats/tests/test_morestats.py

mdhaber · 2023-07-30T20:33:14Z

scipy/stats/_morestats.py

+        elif np.any(x < 0):
+            lb, ub = max(2 - ub, lb), min(2 - lb, ub)


Consider the following:

import numpy as np from scipy import stats rng = np.random.default_rng(483258345) x = rng.random(size=1000) # x[0] = -1 x[-1] = 1e200 stats.yeojohnson_normmax(x)

With the line commented out, this solves successfully. With the introduction of a negative element (no matter how slightly negative), we get ValueError: The lower bound exceeds the upper bound..

Clearly, this example was constructed to exploit the issue, but the main idea is that

lb, ub = max(2 - ub, lb), min(2 - lb, ub)

is conservative in the sense that it avoids overflow, but it can be overly-restrictive when there is some positive data and some negative data. Before this line, lb and ub tend to be of roughly equal magnitude and opposite sign, so the effect of this line is roughly that it increases the lower bound by aproximately 2, leaving the upper bound unchanged.

@lorentzenchr 's suggestion was

lb, ub = min(2 - ub, lb), max(2 - lb, ub)

But I imagine that could be non-conservative (w.r.t. overflow) in some cases because ultimately it has the effect of increasing the upper by by approximately 2, leaving the lower bound unchanged.

When there is data of large magnitude, the bounds can be quite tight, so adding 2 can make a big difference.

Maybe the most correct thing is to look at the magnitudes of the most negative and most positive elements separately?

The reason this happens is because we currently assume that if you observe {-1, 1e200}, you may also observe -1e200 in the future. This could be avoided by making different assumptions for the positive and negative values separately as you suggest (although even then, we're making assumptions on future values).

mdhaber · 2023-07-31T07:54:28Z

Here's the gist of how we can compute the LLF without overflow.

import numpy as np
from scipy import stats
from scipy import special
import matplotlib.pyplot as plt
rng = np.random.default_rng(483258345)
x = rng.random(size=10)
lmb = stats.yeojohnson_normmax(x)

def _log_yeojohnson(x, lmb):
    logm1 = np.full_like(x, np.pi*1j, dtype=np.complex128)  # log(-1)
    return np.real(special.logsumexp([lmb*special.log1p(x), logm1], axis=0)
                   - np.log(lmb+0j))

lyj = _log_yeojohnson(x, lmb)
yj0 = stats.yeojohnson(x, lmb)
lyj0 = np.log(yj0)
np.testing.assert_allclose(lyj, lyj0)

def _log_mean(logx):
    return special.logsumexp(logx) - np.log(len(logx))

lyj_mean = _log_mean(lyj)
lyj_mean0 = np.log(np.mean(yj0))
np.testing.assert_allclose(lyj_mean, lyj_mean0)

def _log_var(logx):
    logmean = _log_mean(logx)
    pij = np.full_like(logx, np.pi*1j, dtype=np.complex128)
    logxmu = special.logsumexp([logx, logmean+pij], axis=0)
    return np.real(special.logsumexp(2*logxmu) - np.log(len(logx)))

lyj_var = _log_var(lyj)
lyj_var0 = np.log(np.var(yj0))

np.testing.assert_allclose(lyj_var, lyj_var0)

def _yeo_johnson_llf(lmb, x):
    # only for positive x right now
    n = len(x)
    logvar = _log_var(_log_yeojohnson(x, lmb))
    return -n/2*logvar + (lmb-1)*special.log1p(x).sum()

llf = _yeo_johnson_llf(lmb, x)
llf0 = stats.yeojohnson_llf(lmb, x)
np.testing.assert_allclose(llf, llf0)

# As an example, compute the LLF for the data in gh-18389 as a function of lambda

x = np.array([2003.0, 1950.0, 1997.0, 2000.0, 2009.0, 2009.0, 1980.0, 1999.0, 2007.0, 1991.0])
lmbs = np.linspace(98, 101)
llfs = []
for lmb in lmbs:
    llfs.append(_yeo_johnson_llf(lmb, x))

plt.plot(lmbs, llfs)
plt.xlabel('lambda')
plt.ylabel('(positive) LLF')
plt.show()

This shows that the true maximum for the data occurs near 99.3 as in #18389 (comment).

So I'd suggest that we:

merge this PR
Fix this shortcoming BUG: fix overflow in stats.yeojohnson #18852 (comment)
Warn the user when the returned value of lambda is not optimal
Add an option to yeojohnson_normmax to return the true optimal value of lambda (default) or the best that will not cause the transformed data to overflow

lsorber · 2023-08-04T13:53:52Z

So I'd suggest that we:

merge this PR

Fix this shortcoming BUG: fix overflow in stats.yeojohnson #18852 (comment)

Warn the user when the returned value of lambda is not optimal

Add an option to yeojohnson_normmax to return the true optimal value of lambda (default) or the best that will not cause the transformed data to overflow

Agree.
My preference would be to create a separate issue to compute the exact over/underflow-avoiding bounds for lambda.
No opinion.
Agree with the suggested improvement. I would however prefer that by default the optimal lambda that does not cause over/underflow is returned, assuming that the most common use case will be to take that lambda and use it to transform floating point data.

mdhaber · 2023-08-04T14:09:13Z

Ok, I'll open an issue. Re: 2, though, previously it sounded like you did not want to use the exact formula. You do now? (~~Would you try implementing that? I'm not sure that Wolfram Alpha is inverting it correctly!~~ I needed to to use the $n=-1$ branch of the Lambert W.) Also, even with that formula, we'd also need to adjust the way signed data is treated, right?

lsorber · 2023-08-06T14:25:00Z

Ok, I'll open an issue.

👍

Re: 2, though, previously it sounded like you did not want to use the exact formula. You do now?

For the purposes of this particular PR I thought that going after an exact solution would lead us down a pretty deep rabbit hole that I wasn't prepare to dive into. However, for a follow-up issue I do think it would be nice to have an exact solution that covers the remaining edge cases.

(~~Would you try implementing that? I'm not sure that Wolfram Alpha is inverting it correctly!~~ I needed to to use the n=−1 branch of the Lambert W.)

Let me know if you would like any help with the follow-up issue. I have a pretty busy schedule, so it might go a bit slower with me involved though.

Also, even with that formula, we'd also need to adjust the way signed data is treated, right?

Yes, true. Using an exact formula for lambda would also interact with the improved bounds logic, so I thought it might be nice to improve both the formula for lambda and the bounds logic in one go.

mdhaber · 2023-08-06T15:45:17Z

Using an exact formula for lambda would also interact...

Is there interaction? ISTM you'd calculate the domain of $\lambda$ for positive and negative observations separately, then take the intersection of the two domains.

If the branch of Lambert W needed stays the same (I'll check), it seems like it's just:

Treat the maximum positive observation as you do now except use the exact inverse transformation instead of the approximate inverse. (Should also check whether the maximum observation is 0 or 2 and treat those cases.)
Do the equivalent thing for the most negative observation.
The upper bound of $\lambda$ is the value that would cause overflow for the most positive observation or the value that would cause overflow for the most negative observation, whichever is smaller.
The lower bound of $\lambda$ is the value that would cause underflow for the most positive observation or the value that would cause underflow for the most negative observation, whichever is greater.

The thing is, if I do it then we have to wait for a review, which would also take a while, probably. When do you think you could draft it? Or perhaps
@lorenzo @dschmitz89 @OmarManzoor would be interested?

dschmitz89 · 2023-08-06T20:46:10Z

I am not familiar with the Yeo-Johnson transform, cannot really help out here.

lsorber · 2023-08-07T08:04:36Z

Using an exact formula for lambda would also interact...

Is there interaction? ISTM you'd calculate the domain of λ for positive and negative observations separately, then take the intersection of the two domains.

If the branch of Lambert W needed stays the same (I'll check), it seems like it's just:

Treat the maximum positive observation as you do now except use the exact inverse transformation instead of the approximate inverse. (Should also check whether the maximum observation is 0 or 2 and treat those cases.)

Do the equivalent thing for the most negative observation.

The upper bound of λ is the value that would cause overflow for the most positive observation or the value that would cause overflow for the most negative observation, whichever is smaller.

The lower bound of λ is the value that would cause underflow for the most positive observation or the value that would cause underflow for the most negative observation, whichever is greater.

LGTM, that's about what I had in mind as well. There is some interaction in the sense that it's easier to apply steps (3) and (4) with the exact bounds because we wouldn't have to work around the possibility that the lower bound is greater than the upper bound.

The thing is, if I do it then we have to wait for a review, which would also take a while, probably. When do you think you could draft it? Or perhaps @lorenzo @dschmitz89 @OmarManzoor would be interested?

If you'd like me to help, I may be able to have a first draft in a week or two. I also have time at the end of the month to address the review commends and finalise the PR.

mdhaber · 2023-08-07T14:03:24Z

Sure @lsorber that would be great.

mdhaber · 2023-09-11T05:48:23Z

@lsorber friendly ping!

lsorber · 2023-09-16T07:09:38Z

Hi @mdhaber, thanks for the reminder. I had begun working on this, but it was less straightforward than I expected. I'll create a PR when it's closer to a review ready state.

lsorber · 2024-01-03T13:34:42Z

Hi @mdhaber, apologies that it took longer than expected, but I finally got around to wrapping up my work on the exact bounds PR, see #19802. As I mentioned earlier, using an exact formula is not as straightforward as it may seem for a variety of reasons:

Under- and overflow can occur in both the full transform as well as in its numerator, an ideally both should be avoided.
Other parts of SciPy such as scipy.optimize.fminbound are not equipped to avoid under- and overflow.
The exact inverse has several solutions (branches) to choose from when inverting for lambda.

TST: add overflow tests for stats.yeojohnson

17b8d0e

github-actions bot added the scipy.stats label Jul 10, 2023

BUG: fix overflow in stats.yeojohnson

722b71c

lsorber mentioned this pull request Jul 10, 2023

BUG: Yeo-Johnson Power Transformer gives Numpy warning #18389

Closed

mdhaber reviewed Jul 10, 2023

View reviewed changes

scipy/stats/tests/test_morestats.py Outdated Show resolved Hide resolved

TST: parametrize stats.yeojohnson overflow tests

bcfe912

mdhaber added the maintenance Items related to regular maintenance tasks label Jul 12, 2023

TST: add more tests for stats.yeojohnson

e32d240

mdhaber reviewed Jul 15, 2023

View reviewed changes

TST: improve overflow tests for stats.yeojohnson

0a1def0

lorentzenchr reviewed Jul 25, 2023

View reviewed changes

lsorber and others added 2 commits July 27, 2023 13:55

STY: apply suggestions from code review

7001de1

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

TST: add test for all-zero input to stats.yeojohnson

9cc0d5e

lsorber added 2 commits July 28, 2023 12:57

TST: add test for integer input to stats.yeojohnson

68e234e

BUG: improve bounds for signed data

88dd811

lorentzenchr reviewed Jul 28, 2023

View reviewed changes

BUG: correct output for integer input to stats.yeojohnson

bd407db

lorentzenchr approved these changes Jul 29, 2023

View reviewed changes

mdhaber approved these changes Jul 30, 2023

View reviewed changes

mdhaber merged commit 79b72d2 into scipy:main Aug 4, 2023
23 of 25 checks passed

mdhaber mentioned this pull request Aug 4, 2023

ENH: stats.boxcox/yeojohnson: additional improvements surrounding overflow #19016

Open

11 tasks

lsorber deleted the fix-yeo-johnson-overflow branch September 4, 2023 06:38

mdhaber added this to the 1.12.0 milestone Sep 11, 2023

xuefeng-xu mentioned this pull request Nov 28, 2023

BUG: fix overflow in stats.boxcox_normmax when method='mle' #19604

Merged

xuefeng-xu mentioned this pull request Dec 7, 2023

ENH: Yeo-Johnson compute LLF in log-space and adjust lambda when transformation overflow #19652

Draft

lsorber mentioned this pull request Jan 3, 2024

ENH: improve stats.yeojohnson bounds #19802

Draft

mdhaber mentioned this pull request Jan 22, 2024

ENH:stats:Use explicit formula for gamma.fit('mm') #19932

Merged

		elif np.any(x < 0):
		lb, ub = max(2 - ub, lb), min(2 - lb, ub)

BUG: fix overflow in stats.yeojohnson #18852

BUG: fix overflow in stats.yeojohnson #18852

Conversation

lsorber commented Jul 10, 2023

Reference issue

What does this implement/fix?

melissawm commented Jul 10, 2023

mdhaber commented Jul 10, 2023 • edited

lsorber commented Jul 11, 2023

mdhaber commented Jul 11, 2023 • edited

mdhaber commented Jul 13, 2023 • edited

lsorber commented Jul 14, 2023

mdhaber left a comment • edited

Choose a reason for hiding this comment

lsorber commented Jul 25, 2023

lorentzenchr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lsorber commented Jul 27, 2023

mdhaber commented Jul 28, 2023 • edited

lsorber commented Jul 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber left a comment • edited

Choose a reason for hiding this comment

mdhaber Jul 30, 2023 • edited

Choose a reason for hiding this comment

lsorber Aug 4, 2023 • edited

Choose a reason for hiding this comment

mdhaber commented Jul 31, 2023 • edited

lsorber commented Aug 4, 2023 • edited

mdhaber commented Aug 4, 2023 • edited

lsorber commented Aug 6, 2023

mdhaber commented Aug 6, 2023

dschmitz89 commented Aug 6, 2023

lsorber commented Aug 7, 2023

mdhaber commented Aug 7, 2023

mdhaber commented Sep 11, 2023

lsorber commented Sep 16, 2023

lsorber commented Jan 3, 2024 • edited

mdhaber commented Jul 10, 2023 •

edited

mdhaber commented Jul 11, 2023 •

edited

mdhaber commented Jul 13, 2023 •

edited

mdhaber left a comment •

edited

mdhaber commented Jul 28, 2023 •

edited

lsorber commented Jul 28, 2023 •

edited

mdhaber left a comment •

edited

mdhaber Jul 30, 2023 •

edited

lsorber Aug 4, 2023 •

edited

mdhaber commented Jul 31, 2023 •

edited

lsorber commented Aug 4, 2023 •

edited

mdhaber commented Aug 4, 2023 •

edited

lsorber commented Jan 3, 2024 •

edited