Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix overflow in stats.yeojohnson #18852

Merged
merged 10 commits into from Aug 4, 2023

Conversation

lsorber
Copy link
Contributor

@lsorber lsorber commented Jul 10, 2023

Reference issue

Closes #18389.

What does this implement/fix?

This PR fixes two sources of overflow in stats.yeojohnson:

  1. The first source of overflow is caused by certain invocations of np.power with unbalanced bases and exponents.
  2. The second source of overflow occurs when the optimal transform is beyond what floating point can represent.

@melissawm
Copy link
Contributor

Docs build failure is unrelated here.

@mdhaber
Copy link
Contributor

mdhaber commented Jul 10, 2023

Item 1 is simple and uncontroversial; I could merge it today if it were a separate PR.

For the second part, It looks like you used the analytical bounds approach (mentioned in #18389 (comment)) instead of the "regularization term" (what you originally proposed in that issue), right?

In that case, you mentioned that this problem is convex, so do I understand that this doesn't change behavior for already working code because any local optimum is a global optimum? Is it strictly convex such that the solution is unique? Can you provide some background about how the bounds are derived and add some comments in the code about the magic numbers (e.g. 20)?

Re: the regularization term - it sounds like that is a distinct enhancement that one could make to find a suboptimal - but nearly optimal - solution that avoids overflow? Or is it no longer necessary if you have these analytical bounds?

Thanks!

@lsorber
Copy link
Contributor Author

lsorber commented Jul 11, 2023

Item 1 is simple and uncontroversial; I could merge it today if it were a separate PR.

👍

For the second part, It looks like you used the analytical bounds approach (mentioned in #18389 (comment)) instead of the "regularization term" (what you originally proposed in that issue), right?

That's right. With the regularisation approach, the trouble is that it would also change users' existing solutions. The advantage of the bounded optimisation is that it doesn't change users' solutions (given that the Yeo-Johnson negative log likelihood is strictly convex).

In that case, you mentioned that this problem is convex, so do I understand that this doesn't change behavior for already working code because any local optimum is a global optimum? Is it strictly convex such that the solution is unique?

It is indeed strictly convex [1] and therefore existing solutions should not change when using a different optimiser.

Can you provide some background about how the bounds are derived and add some comments in the code about the magic numbers (e.g. 20)?

Sure, here's a commented version of the implementation:

# Convert input data to a NumPy array, without copying the data if possible.
x = np.asarray(x)

# Take the data's dtype if already in floating point, else assume conversion to float64 by the optimiser.
dtype = x.dtype if np.issubdtype(x.dtype, np.floating) else np.float64

# Given that floating point is bounded by (finfo.tiny, finfo.max), we don't want to go to the edges all the way.
# Instead, we'll use (finfo.tiny - finfo.eps, finfo.max + finfo.eps) so that there's a small buffer on either side.
b = np.log(np.finfo(dtype).eps)

# We're optimising lambda for the given data in `x`, but later on the same Yeo-Johnson transform may be used on new data.
# We don't know what values we'll observe then, but we could assume that those values will be less than Q3 + 10 * 1.5 * IQR, where Q3 + 1.5 * IQR is a standard way to determine an outlier threshold on `x`, and 10 is a safety factor.
# We can approximate this formula as follows: Q3 < max(x) and IQR < max(x), so we have max(x) + 10 * 1.5 * max(x) = 16 * max(x).
# Then we round this to get an even 20 * max(x).
# We take the absolute value to deal with negative data, and use `np.nanmax` to avoid early failure.
max_x = 20 * np.nanmax(np.abs(x))

# The Yeo-Johnson transform `((x + 1) ** lam - 1) / lam` can be approximated by `(x + 1) ** lam`.
# The lambda resulting in the smallest (largest) output C from the approximate transform is then `log(C) / log(max(x) + 1)`.
# We divide this by a factor of 2 because the Yeo-Johnson log likelihood function calculates the variance of the transformed data, which could also overflow.
# The factor of 2 could be avoided by improving the log likelihood implementation, but then again it would prevent users to compute the variance of the transformed data as well.
# The adjustment needed for negative data is minor and isn't worth making it a special case, so we use the same bounds as we do for positive data.
lb = (np.log(np.finfo(dtype).tiny) - b) / 2 / np.log(max_x + 1)
ub = (np.log(np.finfo(dtype).max) + b) / 2 / np.log(max_x + 1)

# We set the `optimize.fminbound` tolerance to match that of `optimize.brent` so that we get the same result if the optimal lambda lies within the bounds.
tol_brent = 1.48e-08

return optimize.fminbound(_neg_llf, lb, ub, args=(x,), xtol=tol_brent)

[1] https://arxiv.org/pdf/2210.01639.pdf

@mdhaber
Copy link
Contributor

mdhaber commented Jul 11, 2023

Given that floating point is bounded by (finfo.tiny, finfo.max), we don't want to go to the edges all the way.BInstead, we'll use (finfo.tiny - finfo.eps, finfo.max + finfo.eps) so that there's a small buffer on either side.

  1. If you don't want to go to the edges (only approach them from within, I'm assuming), wouldn't it be the other way around - add to tiny and subtract from max?
    Also, finfo.tiny + finfo.eps == finfo.eps, and finfo.max - finfo.eps == finfo.max. Please consider np.nextafter if you only need to go one floating point number over.
    Then again, since you're working with logs, and $\log(a) + \log(b) = \log(ab)$, I don't think the comment and code match as they are. Which is intended?

Q3 + 10 * 1.5 * IQR

  1. This explains one magic number with two new ones : ) what are 10 and 1.5? (Perhaps there is something like Chebyshev's inequality but for quartiles that I'm not familiar with?)

The lambda resulting in the smallest (largest) output C from the approximate transform

  1. Ok, so the point is to find bounds on lambda such that 1) the transformed data does not go out of bounds and 2) new data is unlikely to go out of bounds (according to some argument about population based on the original data).
    So just to check my understanding, it sounds like part 2 of the sentence could affect existing code because your bounds are a little more conservative than they need to be just to transform the existing data? (That's probably ok as long as it's not too conservative.)

  2. Also, this is based on an approximation of the transform rather than the transform itself. Is the approximation conservative in the sense that if the approximately transformed data stays in bounds, the exactly transformed data is guaranteed to stay within bounds?

@mdhaber mdhaber added the maintenance Items related to regular maintenance tasks label Jul 12, 2023
@mdhaber
Copy link
Contributor

mdhaber commented Jul 13, 2023

@lsorber there are several maintainers at the SciPy conference right now, so I think we can merge this shortly after we hear back from you about the questions above. Thanks!

@lsorber
Copy link
Contributor Author

lsorber commented Jul 14, 2023

Given that floating point is bounded by (finfo.tiny, finfo.max), we don't want to go to the edges all the way.BInstead, we'll use (finfo.tiny - finfo.eps, finfo.max + finfo.eps) so that there's a small buffer on either side.

  1. If you don't want to go to the edges (only approach them from within, I'm assuming), wouldn't it be the other way around - add to tiny and subtract from max?
    Also, finfo.tiny + finfo.eps == finfo.eps, and finfo.max - finfo.eps == finfo.max. Please consider np.nextafter if you only need to go one floating point number over.
    Then again, since you're working with logs, and log⁡(a)+log⁡(b)=log⁡(ab), I don't think the comment and code match as they are. Which is intended?

Apologies for not being clear. Those formulas were intended to be in log space, so the float exponent range would be (log10(tiny) - log10(eps), log10(max) + log10(eps)). For double precision, this would evaluate to the base-10 exponent range (-308 - (-16), 308 + (-16)) = (-292, 292).

Q3 + 10 * 1.5 * IQR

  1. This explains one magic number with two new ones : ) what are 10 and 1.5? (Perhaps there is something like Chebyshev's inequality but for quartiles that I'm not familiar with?)

The Q3 + 1.5 * (Q3 - Q1) threshold is a classic way to define outliers for any dataset [1] [2], and is for example used to draw the whiskers of Box plots [3]. The resulting Yeo-Johnson transform should however also be applicable to a dataset's outliers without over- or underflow. Therefore, we need a threshold beyond which we almost certainly won't observe any values in the future. That's why as a first step, I increased Tukey's constant k by an order of magnitude from 1.5 to 15. Then, to avoid actually computing Q3 and the IQR, I replaced those values by the maximum observed value in the data. And finally, I rounded 16 to 20.

The effect of doing this is that we're saying that if you fit a Yeo-Johnson transform on a dataset, that you may still safely apply it on values that are up to 20 times larger than the largest observed value in the dataset. If a dataset's largest value is 100, that means the optimal transform is computable without overflow for values up to 2000. At the end of the day, the multiplier 20 is indeed arbitrary. You could argue that we should increase it to say 100, which I think is defensible, but comes at the cost of reducing the domain to find an optimal lambda in.

The lambda resulting in the smallest (largest) output C from the approximate transform

  1. Ok, so the point is to find bounds on lambda such that 1) the transformed data does not go out of bounds and 2) new data is unlikely to go out of bounds (according to some argument about population based on the original data).
    So just to check my understanding, it sounds like part 2 of the sentence could affect existing code because your bounds are a little more conservative than they need to be just to transform the existing data? (That's probably ok as long as it's not too conservative.)

That's correct. For datasets that are prone to over- or underflow, I think most users would rather trade off a small bit of likelihood to reduce the chance of over- or underflow to near zero. The 20 multiplier is a trade-off between those two goals: achievable likelihood and the risk of over- or underflow.

  1. Also, this is based on an approximation of the transform rather than the transform itself. Is the approximation conservative in the sense that if the approximately transformed data stays in bounds, the exactly transformed data is guaranteed to stay within bounds?

The approximation is as conservative as it can be in that we give lambda access to the full range of floating point values, modulo the space we reserve for three factors:

  1. We reserve log10(eps) on either side of the exponent range so that we don't inadvertently create over- or underflow by allowing transformations all the way up to the edge of the range.
  2. We reserve some space in the exponent range to allow safe transformation of new data that is outside of the range of observed values.
  3. We reserve about half of what remains to allow safe computation of the variance of the transformed data.

That said, there is an additional effect of dividing by lambda that is not accounted for. It would take some further analysis to make sure the proposed bounds covers every possible source of over- and underflow. Without writing up a full analysis, I'm pretty confident that overflow is avoided in all cases, but that underflow can still happen in some extreme cases (see below).

I added an additional commit with some improvements:

  1. Implementation:
    a. I upgraded the bounds calculation itself to use np.log1p as well.
    b. The bounds for negative data are now explicitly handled as a special case.
  2. Tests:
    a. I added overflow tests for extremely large positive data. Overflow does not occur even for data as large as 1e300.
    b. I added underflow tests for extremely small positive data. The current bounds prevent underflow for data up to 1e-150. If the data is even smaller than that, underflow may occur as a result of the division by lambda that is ignored by the bounds approximation.
    c. I added over- and underflow tests for negative data.

[1] https://en.wikipedia.org/wiki/Interquartile_range#Outliers
[2] https://en.wikipedia.org/wiki/Outlier#Tukey's_fences
[3] https://en.wikipedia.org/wiki/Box_plot#Whiskers

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the replies @lsorber. I think this is close. Much of what I'd suggest is just to make the intent of the code more obvious to future readers by refactoring slightly and including code comments.

BTW I thought I'd mention that there does seem to be an inverse to the exact transformation (Corrected per comment below.):
https://www.wolframalpha.com/input?i=solve+%28%28x%2B1%29%5El-1%29%2Fl+%3D+c+for+l

import numpy as np
from scipy import special
x = 25
c = 1000
# Branch k=-1 found by trial and error. Is it always the same?
l = np.real(-special.lambertw(- ((x+1)**(-1/c)*special.log1p(x))/c, k=-1)/special.log1p(x)) - 1/c  # 2.3873997498774027
((x + 1)**l-1)/l  # 1000.0

@lorentzenchr do these comments look appropriate to you? Do you have any others?

scipy/stats/_morestats.py Outdated Show resolved Hide resolved
scipy/stats/_morestats.py Outdated Show resolved Hide resolved
scipy/stats/_morestats.py Outdated Show resolved Hide resolved
scipy/stats/_morestats.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_morestats.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_morestats.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_morestats.py Outdated Show resolved Hide resolved
@lsorber
Copy link
Contributor Author

lsorber commented Jul 25, 2023

BTW I thought I'd mention that there does seem to be an inverse to the exact transformation:
https://www.wolframalpha.com/input?i=solve+%28x%2B1%29%5E%28l-1%29%2Fl+%3D+c+for+l

There's a small error in that formula, I believe it should be: https://www.wolframalpha.com/input?i=solve+%28%28x%2B1%29%5El-1%29%2Fl+%3D+c+for+l

I don't expect the exact inverse to yield a significantly larger search space for lambda though, so I'm not sure it's worth the additional complexity in the implementation.

Copy link
Contributor

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, likely mostly my non-understanding.

scipy/stats/_morestats.py Show resolved Hide resolved
return loglike


def yeojohnson_normmax(x, brack=(-2, 2)):
def yeojohnson_normmax(x, brack=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as notification: This is a a public API function and a slight change of the default parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed in #18852 (comment) and the reply that although this looks like a change, it should not materially change the solution found because the problem is strictly convex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to @mdhaber's comment: if a brack is supplied, the implementation uses optimize.brent as before.

scipy/stats/_morestats.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_morestats.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_morestats.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_morestats.py Show resolved Hide resolved
Comment on lines 1698 to 1699
if np.any(x < 0):
lb, ub = 2 - ub, 2 - lb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean np.all(x < 0)?
Or should it read

lb, ub = min(lb, 2 - ub), max(ub, 2 - lb)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used np.any because it is slightly more defensive in case there are non-finite values such as NaN in the data. It could be np.all if you prefer that. The min/max suggestion you make is compact, but I'm not sure it's equivalent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I‘m not sure if those reflections formulas apply if a only one single value in x < 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I improved the bounds so that the cases {all positive, all negative, mixed positive & negative} data are handled separately:

# Compute lb, ub for all positive data
# Reflect the bounds if the data is all negative
if np.all(x < 0):
    lb, ub = 2 - ub, 2 - lb
# Take the most conservative of the positive and negative bounds if the data has mixed signs
elif np.any(x < 0):
    lb, ub = max(2 - ub, lb), min(2 - lb, ub)

lsorber and others added 2 commits July 27, 2023 13:55
@lsorber
Copy link
Contributor Author

lsorber commented Jul 27, 2023

Added a test for the case where the input data is all-zero.

@mdhaber
Copy link
Contributor

mdhaber commented Jul 28, 2023

I happened to notice that the case of all integers gives an error in main (which could be fixed by ensuring that the dtype is at least float), so I tested with this PR.

from scipy import stats
stats.yeojohnson_normmax([1, 2, 3.])  # 0.5907038225450774
stats.yeojohnson_normmax([1, 2, 3])  # 69.11083988572132

Good that there's no error; surprising that it affects the result!

Besides that, looks like there is still #18852 (comment), which I something I was wondering about and needed to look at more closely.

@lsorber
Copy link
Contributor Author

lsorber commented Jul 28, 2023

I happened to notice that the case of all integers gives an error in main (which could be fixed by ensuring that the dtype is at least float), so I tested with this PR.

from scipy import stats
stats.yeojohnson_normmax([1, 2, 3.])  # 0.5907038225450774
stats.yeojohnson_normmax([1, 2, 3])  # 69.11083988572132

Good that there's no error; surprising that it affects the result!

Besides that, looks like there is still #18852 (comment), which I something I was wondering about and needed to look at more closely.

I added a test for integer data to prevent regressions for integer data in the future.

EDIT: Improved this test to actually cover the discrepancy you noticed. This PR now addresses more than just overflow/underflow, but also corrects the handling of all-zero input, non-finite input, and integer input.

# Compute the bounds by approximating the inverse of the Yeo-Johnson
# transform on the smallest and largest floating point exponents, given
# the largest data we expect to observe. See [1] for further details.
# [1] https://github.com/scipy/scipy/pull/18852
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you link to the specific comment in the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lsorber I'm willing to merge this as-is because overall I think it's a big improvement, but I think it's worth a closer look at the bounds when there is both positive and negative data. Shall I merge this and open an issue about it? That way, at least this much is done and we can just tweak the bounds. Or would you prefer to keep working here?

There is definitely more we can do, though.

  1. I think we need to somehow indicate when the returned value of lambda is at one of the bounds, in which case it is is (almost certainly) not the truly optimal value of lambda. This is the case of the MRE in BUG: Yeo-Johnson Power Transformer gives Numpy warning #18389 (comment).
  2. It's worth reconsidering whether we need to adjust the bounds by log(epsilon). That might be a bit too conservative.
  3. The calculation of the likelihood only requires the log of the variance of the transformed data. I think we can calculate that from the log of the transformed data directly using the logsumexp trick (applied in at least two ways). It seems to me that the user should have the option to calculate the truly optimal value of lambda, even if the transformed data cannot be represented in float64.

scipy/stats/tests/test_morestats.py Show resolved Hide resolved
Comment on lines +1704 to +1705
elif np.any(x < 0):
lb, ub = max(2 - ub, lb), min(2 - lb, ub)
Copy link
Contributor

@mdhaber mdhaber Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following:

import numpy as np
from scipy import stats
rng = np.random.default_rng(483258345)
x = rng.random(size=1000)
# x[0] = -1
x[-1] = 1e200
stats.yeojohnson_normmax(x)

With the line commented out, this solves successfully. With the introduction of a negative element (no matter how slightly negative), we get ValueError: The lower bound exceeds the upper bound..

Clearly, this example was constructed to exploit the issue, but the main idea is that

lb, ub = max(2 - ub, lb), min(2 - lb, ub)

is conservative in the sense that it avoids overflow, but it can be overly-restrictive when there is some positive data and some negative data. Before this line, lb and ub tend to be of roughly equal magnitude and opposite sign, so the effect of this line is roughly that it increases the lower bound by aproximately 2, leaving the upper bound unchanged.

@lorentzenchr 's suggestion was

lb, ub = min(2 - ub, lb), max(2 - lb, ub)

But I imagine that could be non-conservative (w.r.t. overflow) in some cases because ultimately it has the effect of increasing the upper by by approximately 2, leaving the lower bound unchanged.

When there is data of large magnitude, the bounds can be quite tight, so adding 2 can make a big difference.

Maybe the most correct thing is to look at the magnitudes of the most negative and most positive elements separately?

Copy link
Contributor Author

@lsorber lsorber Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this happens is because we currently assume that if you observe {-1, 1e200}, you may also observe -1e200 in the future. This could be avoided by making different assumptions for the positive and negative values separately as you suggest (although even then, we're making assumptions on future values).

@mdhaber
Copy link
Contributor

mdhaber commented Jul 31, 2023

Here's the gist of how we can compute the LLF without overflow.

import numpy as np
from scipy import stats
from scipy import special
import matplotlib.pyplot as plt
rng = np.random.default_rng(483258345)
x = rng.random(size=10)
lmb = stats.yeojohnson_normmax(x)

def _log_yeojohnson(x, lmb):
    logm1 = np.full_like(x, np.pi*1j, dtype=np.complex128)  # log(-1)
    return np.real(special.logsumexp([lmb*special.log1p(x), logm1], axis=0)
                   - np.log(lmb+0j))

lyj = _log_yeojohnson(x, lmb)
yj0 = stats.yeojohnson(x, lmb)
lyj0 = np.log(yj0)
np.testing.assert_allclose(lyj, lyj0)

def _log_mean(logx):
    return special.logsumexp(logx) - np.log(len(logx))

lyj_mean = _log_mean(lyj)
lyj_mean0 = np.log(np.mean(yj0))
np.testing.assert_allclose(lyj_mean, lyj_mean0)

def _log_var(logx):
    logmean = _log_mean(logx)
    pij = np.full_like(logx, np.pi*1j, dtype=np.complex128)
    logxmu = special.logsumexp([logx, logmean+pij], axis=0)
    return np.real(special.logsumexp(2*logxmu) - np.log(len(logx)))

lyj_var = _log_var(lyj)
lyj_var0 = np.log(np.var(yj0))

np.testing.assert_allclose(lyj_var, lyj_var0)

def _yeo_johnson_llf(lmb, x):
    # only for positive x right now
    n = len(x)
    logvar = _log_var(_log_yeojohnson(x, lmb))
    return -n/2*logvar + (lmb-1)*special.log1p(x).sum()

llf = _yeo_johnson_llf(lmb, x)
llf0 = stats.yeojohnson_llf(lmb, x)
np.testing.assert_allclose(llf, llf0)

# As an example, compute the LLF for the data in gh-18389 as a function of lambda

x = np.array([2003.0, 1950.0, 1997.0, 2000.0, 2009.0, 2009.0, 1980.0, 1999.0, 2007.0, 1991.0])
lmbs = np.linspace(98, 101)
llfs = []
for lmb in lmbs:
    llfs.append(_yeo_johnson_llf(lmb, x))

plt.plot(lmbs, llfs)
plt.xlabel('lambda')
plt.ylabel('(positive) LLF')
plt.show()
image

This shows that the true maximum for the data occurs near 99.3 as in #18389 (comment).

So I'd suggest that we:

  • merge this PR
  • Fix this shortcoming BUG: fix overflow in stats.yeojohnson #18852 (comment)
  • Warn the user when the returned value of lambda is not optimal
  • Add an option to yeojohnson_normmax to return the true optimal value of lambda (default) or the best that will not cause the transformed data to overflow

@lsorber
Copy link
Contributor Author

lsorber commented Aug 4, 2023

So I'd suggest that we:

  • merge this PR
  • Fix this shortcoming BUG: fix overflow in stats.yeojohnson #18852 (comment)
  • Warn the user when the returned value of lambda is not optimal
  • Add an option to yeojohnson_normmax to return the true optimal value of lambda (default) or the best that will not cause the transformed data to overflow
  1. Agree.
  2. My preference would be to create a separate issue to compute the exact over/underflow-avoiding bounds for lambda.
  3. No opinion.
  4. Agree with the suggested improvement. I would however prefer that by default the optimal lambda that does not cause over/underflow is returned, assuming that the most common use case will be to take that lambda and use it to transform floating point data.

@mdhaber mdhaber merged commit 79b72d2 into scipy:main Aug 4, 2023
23 of 25 checks passed
@mdhaber
Copy link
Contributor

mdhaber commented Aug 4, 2023

Ok, I'll open an issue. Re: 2, though, previously it sounded like you did not want to use the exact formula. You do now? (Would you try implementing that? I'm not sure that Wolfram Alpha is inverting it correctly! I needed to to use the $n=-1$ branch of the Lambert W.) Also, even with that formula, we'd also need to adjust the way signed data is treated, right?

@lsorber
Copy link
Contributor Author

lsorber commented Aug 6, 2023

Ok, I'll open an issue.

👍

Re: 2, though, previously it sounded like you did not want to use the exact formula. You do now?

For the purposes of this particular PR I thought that going after an exact solution would lead us down a pretty deep rabbit hole that I wasn't prepare to dive into. However, for a follow-up issue I do think it would be nice to have an exact solution that covers the remaining edge cases.

(Would you try implementing that? I'm not sure that Wolfram Alpha is inverting it correctly! I needed to to use the n=−1 branch of the Lambert W.)

Let me know if you would like any help with the follow-up issue. I have a pretty busy schedule, so it might go a bit slower with me involved though.

Also, even with that formula, we'd also need to adjust the way signed data is treated, right?

Yes, true. Using an exact formula for lambda would also interact with the improved bounds logic, so I thought it might be nice to improve both the formula for lambda and the bounds logic in one go.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 6, 2023

Using an exact formula for lambda would also interact...

Is there interaction? ISTM you'd calculate the domain of $\lambda$ for positive and negative observations separately, then take the intersection of the two domains.

If the branch of Lambert W needed stays the same (I'll check), it seems like it's just:

  • Treat the maximum positive observation as you do now except use the exact inverse transformation instead of the approximate inverse. (Should also check whether the maximum observation is 0 or 2 and treat those cases.)
  • Do the equivalent thing for the most negative observation.
  • The upper bound of $\lambda$ is the value that would cause overflow for the most positive observation or the value that would cause overflow for the most negative observation, whichever is smaller.
  • The lower bound of $\lambda$ is the value that would cause underflow for the most positive observation or the value that would cause underflow for the most negative observation, whichever is greater.

The thing is, if I do it then we have to wait for a review, which would also take a while, probably. When do you think you could draft it? Or perhaps
@lorenzo @dschmitz89 @OmarManzoor would be interested?

@dschmitz89
Copy link
Contributor

I am not familiar with the Yeo-Johnson transform, cannot really help out here.

@lsorber
Copy link
Contributor Author

lsorber commented Aug 7, 2023

Using an exact formula for lambda would also interact...

Is there interaction? ISTM you'd calculate the domain of λ for positive and negative observations separately, then take the intersection of the two domains.

If the branch of Lambert W needed stays the same (I'll check), it seems like it's just:

  • Treat the maximum positive observation as you do now except use the exact inverse transformation instead of the approximate inverse. (Should also check whether the maximum observation is 0 or 2 and treat those cases.)
  • Do the equivalent thing for the most negative observation.
  • The upper bound of λ is the value that would cause overflow for the most positive observation or the value that would cause overflow for the most negative observation, whichever is smaller.
  • The lower bound of λ is the value that would cause underflow for the most positive observation or the value that would cause underflow for the most negative observation, whichever is greater.

LGTM, that's about what I had in mind as well. There is some interaction in the sense that it's easier to apply steps (3) and (4) with the exact bounds because we wouldn't have to work around the possibility that the lower bound is greater than the upper bound.

The thing is, if I do it then we have to wait for a review, which would also take a while, probably. When do you think you could draft it? Or perhaps @lorenzo @dschmitz89 @OmarManzoor would be interested?

If you'd like me to help, I may be able to have a first draft in a week or two. I also have time at the end of the month to address the review commends and finalise the PR.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 7, 2023

Sure @lsorber that would be great.

@lsorber lsorber deleted the fix-yeo-johnson-overflow branch September 4, 2023 06:38
@mdhaber
Copy link
Contributor

mdhaber commented Sep 11, 2023

@lsorber friendly ping!

@mdhaber mdhaber added this to the 1.12.0 milestone Sep 11, 2023
@lsorber
Copy link
Contributor Author

lsorber commented Sep 16, 2023

Hi @mdhaber, thanks for the reminder. I had begun working on this, but it was less straightforward than I expected. I'll create a PR when it's closer to a review ready state.

@lsorber
Copy link
Contributor Author

lsorber commented Jan 3, 2024

Hi @mdhaber, apologies that it took longer than expected, but I finally got around to wrapping up my work on the exact bounds PR, see #19802. As I mentioned earlier, using an exact formula is not as straightforward as it may seem for a variety of reasons:

  1. Under- and overflow can occur in both the full transform as well as in its numerator, an ideally both should be avoided.
  2. Other parts of SciPy such as scipy.optimize.fminbound are not equipped to avoid under- and overflow.
  3. The exact inverse has several solutions (branches) to choose from when inverting for lambda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Yeo-Johnson Power Transformer gives Numpy warning
5 participants