Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: don't override sf for loguniform/reciprocal distribution #18614

Merged
merged 5 commits into from Jun 5, 2023

Conversation

OmarManzoor
Copy link
Contributor

Reference issue

Towards: gh-17832

What does this implement/fix?

  • Overrides and adds the survival function for the LogUniform/Reciprocal distribution
  • Adds a reference distribution
  • Adds tests to test the survival function

Additional information

cc: @mdhaber

@OmarManzoor
Copy link
Contributor Author

@mdhaber I think the _sf method is not being picked when we call loguniform.sf. That's why the tests are failing.

@OmarManzoor
Copy link
Contributor Author

image

Code

from scipy import stats
import numpy as np
import mpmath
from mpmath import mp
import matplotlib.pyplot as plt
rng = np.random.default_rng()
mp.dps = 50

class Reciprocal(ReferenceDistribution):
    """
    Also the LogUniform distribution
    """
    def __init__(self, *, a, b):
        super().__init__(a=a, b=b)

    def _pdf(self, x, a, b):
        return mp.one / (x * (mp.log(b) - mp.log(a)))
    
    def _cdf(self, x, a, b):
        return (mp.log(x) - mp.log(a)) / (mp.log(b) - mp.log(a))

def loguniform_sf(x, a, b):
      return (np.log(b) - np.log(x)) / (np.log(b) - np.log(a))

a, b = 1.0, 5.0
x = np.logspace(-3, 2)
log_uniform = Reciprocal(a=a, b=b)
mpmath_values = np.array([log_uniform.sf(_x) for _x in x], np.float64)
plt.loglog(x, stats.loguniform.sf(x, a, b), label="loguniform main", ls="dashed")
plt.loglog(x, loguniform_sf(x, a, b), label="loguniform pr", ls="dotted")
plt.loglog(x, mpmath_values, label="loguniform mpmath", ls="dashdot")
plt.legend()
plt.show()

@mdhaber
Copy link
Contributor

mdhaber commented Jun 2, 2023

I think the plot shows that the survival function in main has a maximum of 1, right? That is the maximum of any survival function. If the other formulations exceed that, it is not an advantage or a disadvantage; I assume the formulation just doesn't inherently return 0 outside the support of the distribution.

Please look for improvement near the right end of the support. There is unlikely to be a problem with 1-cdf at the left end.

@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Jun 2, 2023

I think the plot shows that the survival function in main has a maximum of 1, right? That is the maximum of any survival function. If the other formulations exceed that, it is not an advantage or a disadvantage; I assume the formulation just doesn't inherently return 0 outside the support of the distribution.

Please look for improvement near the right end of the support. There is unlikely to be a problem with 1-cdf at the left end.

I see. That means that the difference between mpmath, pr with main towards the left is not significant. I think we can close this PR then?

@mdhaber
Copy link
Contributor

mdhaber commented Jun 2, 2023

Not necessarily. Look near the right side of the support. All distributions without custom _sf suffer there because that is where the CDF is close to one, and the default implementation is to subtract the CDF from 1. When you subtract a number close to 1 from 1, you lose precision due to catastrophic cancellation. If your custom _sf avoids this, then it will provide a substantial improvement.

@mdhaber mdhaber added the enhancement A new feature or improvement label Jun 2, 2023
@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Jun 2, 2023

Not necessarily. Look near the right side of the support. All distributions without custom _sf suffer there because that is where the CDF is close to one, and the default implementation is to subtract the CDF from 1. When you subtract a number close to 1 from 1, you lose precision due to catastrophic cancellation. If your custom _sf avoids this, then it will provide a substantial improvement.

Okay thank you, I will try that.

@OmarManzoor
Copy link
Contributor Author

Not necessarily. Look near the right side of the support. All distributions without custom _sf suffer there because that is where the CDF is close to one, and the default implementation is to subtract the CDF from 1. When you subtract a number close to 1 from 1, you lose precision due to catastrophic cancellation. If your custom _sf avoids this, then it will provide a substantial improvement.

I tried various values for a and b using ranges for x but the right side of the curve aligns very well between main and mpmath/pr. Do you have any suggestions for values I could try which might show an improvement?

@mdhaber
Copy link
Contributor

mdhaber commented Jun 3, 2023

Do you have any suggestions for values I could try which might show an improvement?

It turns out that I don't. The PR doesn't really avoid catastrophic cancellation after all. You need a more accurate way to calculate log(b) - log(x) when x is close to b. This is possible (e.g. with asymptotic expansion) but it's not really worth it IMO because there is limited precision for user input when x is close to b, b != 0 (b > a > 0 is a requirement of the distribution). If the user really wants to study the SF really close to the right endpoint, they would need to be able to specify b - x (small relative to b) rather than x and b separately, and that is not possible with the current infrastructure.

import numpy as np
from scipy import stats, special
import matplotlib.pyplot as plt
from mpmath import mp

from scipy.stats.tests.test_generation.reference_distributions import ReferenceDistribution
mp.dps = 100

class Reciprocal(ReferenceDistribution):
    def __init__(self, *, a, b):
        super().__init__(a=a, b=b)

    def _pdf(self, x, a, b):
        return mp.one / (x * (mp.log(b) - mp.log(a)))

    def _sf(self, x, a, b):
        res = (mp.log(b) - mp.log(x)) / (mp.log(b) - mp.log(a))
        return res

def loguniform_sf(x, a, b):  # PR
    return (np.log(b) - np.log(x)) / (np.log(b) - np.log(a))

def loguniform_sf2(x, a, b):  # "PR2"
    def log_delta(a, b):
        # Wolfram Alpha "asymptotic expansion log(b)-log(b-x) as x approaches 0"
        x = b - a
        i = np.arange(1, 20)[:, np.newaxis]
        return np.sum(x ** i / (i * b ** i), axis=0)

    @np.vectorize
    def log_delta2(a, b):
        # log_delta with logsumexp trick to prevent overflow for large x
        x = b - a
        i = np.arange(1, 100)
        return np.exp(
            special.logsumexp(i * np.log(x) - np.log(i) - i * np.log(b)))

    return log_delta2(x, b)/(np.log(b) - np.log(a))

a, b = 1, 5
x = np.logspace(-13, -1)

# a, b = 1e-21, 1e-10
# x = np.logspace(-20, -11)
log_uniform = Reciprocal(a=a, b=b)

mpmath = log_uniform.sf(b-x, dtype=object)
main = stats.loguniform.sf(b-x, a, b)
pr = loguniform_sf(b-x, a, b)
pr2 = loguniform_sf2(b-x, a, b)

err_main = np.abs((main - mpmath) / mpmath).astype(np.float64)
err_pr = np.abs((pr - mpmath) / mpmath).astype(np.float64)
err_pr2 = np.abs((pr2 - mpmath) / mpmath).astype(np.float64)

plt.loglog(x, err_main, label="main", ls="dashdot")
plt.loglog(x, err_pr, label="PR", ls="dotted")
plt.loglog(x, err_pr2, label="PR2")

plt.legend()
plt.show()

image

I'd suggest changing this PR to replace the custom _sf with a comment in the code pointing the reader to this PR. It would also be helpful if you could figure out a way to make the code in gh-17832 indicate that an override has been considered but rejected (e.g. with a different mark in the table).

If you're interested in doing more of these, I'd suggest going systematically through the distributions to prioritize those that really need improvement. Also, before implementing the override, please write tests (that don't pass in main) so that the goal of the override is explicit. This will give us some assurance that the improvement will justify the effort that goes into it. Not everything you study will lead to a PR, but the work is still useful if you keep track of it in gh-17832 (or a document linked from there).

@OmarManzoor
Copy link
Contributor Author

Do you have any suggestions for values I could try which might show an improvement?

It turns out that I don't. The PR doesn't really avoid catastrophic cancellation after all. You need a more accurate way to calculate log(b) - log(x) when x is close to b. This is possible (e.g. with asymptotic expansion) but it's not really worth it IMO because there is limited precision for user input when x is close to b, b != 0 (b > a > 0 is a requirement of the distribution). If the user really wants to study the SF really close to the right endpoint, they would need to be able to specify b - x (small relative to b) rather than x and b separately, and that is not possible with the current infrastructure.

I'd suggest changing this PR to replace the custom _sf with a comment in the code pointing the reader to this PR. It would also be helpful if you could figure out a way to make the code in gh-17832 indicate that an override has been considered but rejected (e.g. with a different mark in the table).

If you're interested in doing more of these, I'd suggest going systematically through the distributions to prioritize those that really need improvement. Also, before implementing the override, please write tests (that don't pass in main) so that the goal of the override is explicit. This will give us some assurance that the improvement will justify the effort that goes into it. Not everything you study will lead to a PR, but the work is still useful if you keep track of it in gh-17832 (or a document linked from there).

Thank you for the detailed explanation. I understand that we can leave out the addition of sf for this distribution.

Comment on lines 8406 to 8408
Details related to the decision of not defining
the surival function for this distribution can be
found in the PR: https://github.com/scipy/scipy/pull/18614
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a (code) comment to developers reviewing the code, not users reviewing the documentation. Users would likely take this to mean that the survival function of this distribution is not implemented, which is not what we mean to convey. They don't need to know whether _sf is overridden; only developers do.

@mdhaber mdhaber merged commit 5a1ad4c into scipy:main Jun 5, 2023
24 checks passed
@OmarManzoor OmarManzoor deleted the loguniform_sf branch June 5, 2023 05:04
@mdhaber mdhaber added this to the 1.12.0 milestone Sep 11, 2023
@tylerjereddy tylerjereddy added maintenance Items related to regular maintenance tasks and removed enhancement A new feature or improvement labels Dec 8, 2023
@tylerjereddy tylerjereddy changed the title ENH: override sf for loguniform/reciprocal distribution MAINT: don't override sf for loguniform/reciprocal distribution Dec 8, 2023
@tylerjereddy
Copy link
Contributor

I adjusted the label to maintenance given that we just add a comment here (as I go through PRs of interest to relnotes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants