Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix negative overflow in stats.boxcox_normmax #19691

Merged
merged 9 commits into from Dec 19, 2023

Conversation

xuefeng-xu
Copy link
Contributor

@xuefeng-xu xuefeng-xu commented Dec 13, 2023

Reference issue

Follow up #19604 (The prior fix positive overflow, this fix negative overflow)
Towards #19016

What does this implement/fix?

Negative overflow #19631 (comment)

There are two cases to check, see #19631 (comment) for more detail.

  • When xmax > 1 and lmb > 1, check if BC(xmax) > ymax
  • When xmin < 1 and lmb < 0, check if BC(xmin) < -ymax

Additional information

The second check is improved from xmin < 1 and lmb < 1 to xmin < 1 and lmb < 0
See #19631 (comment) for explanation and

from scipy.special import boxcox
print(boxcox(np.finfo(np.float64).smallest_normal, 0))
# -708.3964185322641

@lucascolley lucascolley added maintenance Items related to regular maintenance tasks scipy.stats labels Dec 13, 2023
@mdhaber
Copy link
Contributor

mdhaber commented Dec 13, 2023

@xuefeng-xu can you explain the changes after the first commit? What cases did it not consider?

@xuefeng-xu
Copy link
Contributor Author

The following mainly use the fact that the transformation function is increasing in both lmb and x

The possible positive overflow case is when xmax > 1 and lmb > 1
The other cases are impossible to cause positive overflow

  • xmax < 1, BC(xmax) < BC(lmb, x=1) = 0 (the upper bound is 0)
  • xmax > 1 and lmb < 1: BC(xmax) < BC(lmb=1, xmax) = xmax-1 (the upper bound is xmax-1)
    The output is finite in these cases since the input x must be finite.

The possible negtive overflow case is when xmin < 1 and lmb < 0
The other cases are impossible to cause negative overflow

  • xmin > 1, BC(xmin) > BC(lmb, x=1) = 0 (the lower bound is 0)
  • xmin < 1 and lmb > 0, BC(xmin) > BC(lmb=0, xmin) = log xmin (the lower bound is log xmin)
    Let xmin be the smallest positive representable value, we can see the minimum value is finite (approximatly -709).

@xuefeng-xu
Copy link
Contributor Author

Actually, I want to improve the negative overflow case from xmin < 1 and lmb < 0 to xmin < 1 and lmb < -1
Since BC(xmin) > BC(lmb=-1, xmin) = 1-1/xmin, the lower bound is 1-1/xmin.

But because smallest_normal is not actually the smallest positive representable value in a NumPy floating point type, it might still cause negative overflow.

print(np.finfo(np.float64).smallest_normal) # 2.2250738585072014e-308
print(boxcox(np.finfo(np.float64).smallest_normal, -1)) # -4.4942328371556665e+307
print(boxcox(1e-323,-1)) # -inf

@mdhaber
Copy link
Contributor

mdhaber commented Dec 13, 2023

Thanks, but that says what you did, whereas I'm asking about the difference between what you did and what I had. For instance, is there an example case that your commits fixed that f393477 did not? Also, why move the test case from the class that tests boxcox_normmax, where these changes are actually made, to the class that tests boxcox?

@xuefeng-xu
Copy link
Contributor Author

xuefeng-xu commented Dec 13, 2023

is there an example case that your commits fixed that f393477 did not?

No. The first commit consider the following two cases, I just narrow the scope.

  • When lmb > 0, check if BC(xmax) > ymax
  • When lmb < 0, check if BC(xmin) < -ymax

why move the test case from the class that tests boxcox_normmax, where these changes are actually made, to the class that tests boxcox?

Previously in #19604, I added the test in for boxcox. I will move this test to boxcox_normmax.

@mdhaber
Copy link
Contributor

mdhaber commented Dec 13, 2023

I just narrow the scope.

Do you mean it's more efficient (e.g. it doesn't perform an expensive calculation if it doesn't need to)? Before I review your code in detail, I need to better understand the motivation for the extra complexity. I try to avoid quadruply-nested if statements when possible. If there was a four-line change that fixed the bug, I need to understand why the 20-line change is better before I dig into it.

@xuefeng-xu
Copy link
Contributor Author

it doesn't perform expensive calculations if it doesn't need to

Yes, that is my intention. Although it seems that the calculation might not be expensive.

@mdhaber
Copy link
Contributor

mdhaber commented Dec 13, 2023

Yes, that is my intention. Although it seems that the calculation might not be expensive.

I see. So you were using theory to avoid actually performing a boxcox transformation when you could tell just from the data that the boxcox transformation would not overflow.

Now that I understand the intent, I could take the time to check that this implementation is indeed faster, verify the mathematical argument for avoiding the direct overflow check, and then I could review the code to see that it correctly implements the math. But I think there is a balance to be struck between code performance, code complexity, and review time. If you'd like to go this route, can you demonstrate that the performance improvement is worth that extra work?

Considering that the function also performs a comparatively expensive optimization, my guess is that these extra checks to avoid a call to boxcox with scalar arguments turns will not noticably reduce execution time. If that's correct, I would prefer to go with a simpler solution that I have already verified.

@xuefeng-xu
Copy link
Contributor Author

Thanks, I'll also do some checks. And I'm happy to switch to the simpler version if there's no noticable performance improvement.

@xuefeng-xu
Copy link
Contributor Author

Hi @mdhaber, I've simplified the code. The main difference from f393477 is to check the two conditions that may cause overflow.

sign_lmbm1 = np.sign(res - 1)
x_treme = np.max(x) if np.any(sign_lmbm1 > 0) else np.min(x)

# There are two conditions of overflow to check
# 1. x>1, lmb>1; 2. x<1, lmb<1
mask = False
if np.any((x_treme - 1) * sign_lmbm1 > 0):
    mask = special.boxcox(x_treme, res) * sign_lmbm1 > ymax

For the negtive overflow case, I didn't use the previous proprosed condition (x<1, lmd<0) because it will introduce code complexity.

@mdhaber
Copy link
Contributor

mdhaber commented Dec 18, 2023

I see, but again, if f393477 worked and has already been reviewed by me, you need to explain why / demonstrate that the modifications are beneficial. We seem to be having trouble communicating since we've gone back and forth a few times on this point. I'll try a few different ways of phrasing my request. Answers to any of these questions might get at what I'm looking for.

  • What motivated you to change f393477? Why did that not solve the problem completely?
  • You said: "I'm happy to switch to the simpler version if there's no noticable performance improvement." So is there a notable performance improvement? (If so, show me, please!)
  • In your opinion, is 4c0c0c8 simpler, faster, more readable, or (insert good quality here) than f393477? How do these changes improve what was there?

Thanks!

@xuefeng-xu
Copy link
Contributor Author

xuefeng-xu commented Dec 18, 2023

What motivated you to change f393477?

One advantage I think is that the overflow analysis is base on theory, i.e. the choice of x_treme and lmb is based on the lemma of Box-Cox function. First, check if lmb is large than 1 instead of 0 to decide positive or negative overflow (see #19631 (comment) lemma 2). Second, the if statement (x_treme - 1) * sign_lmbm1 > 0 didn't bring much complexity in the code.

Moreover, it can also applied in YJ with only little modifications. Since the YJ transformation function is more complicate. If we can leverage some nice properties, the analysis would be much eaiser.

if np.any(x_treme * sign_lmbm1 > 0): # 1. x>0, lmb>1; 2. x<0, lmb<1
    mask = _yeojohnson_transform(x_treme, res) * sign_lmbm1 > ymax

So is there a notable performance improvement?

I didn't find a notable speed improvement.

@mdhaber
Copy link
Contributor

mdhaber commented Dec 18, 2023

the overflow analysis is base on theory

Both are based on theory. I think that a consequence of lemma 5 is that the maximum value of $y$ with negative $\lambda$ occurs when $x$ is as large as possible and $\lambda \approx 0$. Likewise, the minimum value of $y$ with positive $\lambda$ occurs when $x$ is as small as possible and $\lambda \approx 0$.

import special
special.boxcox(1e300, -1e-300) # ~690
special.boxcox(1e-300, 1e-300) # ~-690

Neither of these are close to overflowing. Consequently, there cannot be positive overflow when $\lambda &lt; 0$, and there cannot be negative overflow when $\lambda &gt; 0$. This is why f393477 used the sign of lmbda.

Moreover, it can also applied in YJ with only little modifications.

We're trying backport a patch into 1.12. Since we're short on time, I want to do the simplest thing possible that is correct rather than exploring lots of options. Does that make sense?

In the interest of quickly backporting this fix before release of 1.12, please use f393477 if it is correct, making only necessary changes (like adjusting the warning). If it's correct, it seems only natural to do so, since it existed first, the diff is minimal, and i have already reviewed it. We can adapt to the case of user-specified ymax and harmonize with YJ later.

@xuefeng-xu
Copy link
Contributor Author

Ok, let's get this move forward. I have reverted to f393477 and only updated the warning messge.

@mdhaber
Copy link
Contributor

mdhaber commented Dec 19, 2023

Thanks @xuefeng-xu for the improvements! If there is a better way to perform the checks, you can change it (with objective justification) in gh-19631, but it's good to keep this simple for the backport. I'll just run the Cirrus checks and squash merge when they pass.

@mdhaber mdhaber added the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Dec 19, 2023
@mdhaber mdhaber added this to the 1.12.0 milestone Dec 19, 2023
@mdhaber mdhaber merged commit a725b5a into scipy:main Dec 19, 2023
29 checks passed
tylerjereddy pushed a commit to tylerjereddy/scipy that referenced this pull request Dec 19, 2023
* MAINT: stats.boxcox_normmax: avoid negative overflow

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
@tylerjereddy tylerjereddy removed the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants