MAINT: don't override sf for loguniform/reciprocal distribution #18614

OmarManzoor · 2023-06-02T09:40:59Z

Reference issue

Towards: gh-17832

What does this implement/fix?

Overrides and adds the survival function for the LogUniform/Reciprocal distribution
Adds a reference distribution
Adds tests to test the survival function

Additional information

cc: @mdhaber

OmarManzoor · 2023-06-02T10:15:00Z

@mdhaber I think the _sf method is not being picked when we call loguniform.sf. That's why the tests are failing.

OmarManzoor · 2023-06-02T13:06:03Z

Code

from scipy import stats
import numpy as np
import mpmath
from mpmath import mp
import matplotlib.pyplot as plt
rng = np.random.default_rng()
mp.dps = 50

class Reciprocal(ReferenceDistribution):
    """
    Also the LogUniform distribution
    """
    def __init__(self, *, a, b):
        super().__init__(a=a, b=b)

    def _pdf(self, x, a, b):
        return mp.one / (x * (mp.log(b) - mp.log(a)))
    
    def _cdf(self, x, a, b):
        return (mp.log(x) - mp.log(a)) / (mp.log(b) - mp.log(a))

def loguniform_sf(x, a, b):
      return (np.log(b) - np.log(x)) / (np.log(b) - np.log(a))

a, b = 1.0, 5.0
x = np.logspace(-3, 2)
log_uniform = Reciprocal(a=a, b=b)
mpmath_values = np.array([log_uniform.sf(_x) for _x in x], np.float64)
plt.loglog(x, stats.loguniform.sf(x, a, b), label="loguniform main", ls="dashed")
plt.loglog(x, loguniform_sf(x, a, b), label="loguniform pr", ls="dotted")
plt.loglog(x, mpmath_values, label="loguniform mpmath", ls="dashdot")
plt.legend()
plt.show()

mdhaber · 2023-06-02T16:14:36Z

I think the plot shows that the survival function in main has a maximum of 1, right? That is the maximum of any survival function. If the other formulations exceed that, it is not an advantage or a disadvantage; I assume the formulation just doesn't inherently return 0 outside the support of the distribution.

Please look for improvement near the right end of the support. There is unlikely to be a problem with 1-cdf at the left end.

OmarManzoor · 2023-06-02T17:16:11Z

I think the plot shows that the survival function in main has a maximum of 1, right? That is the maximum of any survival function. If the other formulations exceed that, it is not an advantage or a disadvantage; I assume the formulation just doesn't inherently return 0 outside the support of the distribution.

Please look for improvement near the right end of the support. There is unlikely to be a problem with 1-cdf at the left end.

I see. That means that the difference between mpmath, pr with main towards the left is not significant. I think we can close this PR then?

mdhaber · 2023-06-02T18:42:47Z

Not necessarily. Look near the right side of the support. All distributions without custom _sf suffer there because that is where the CDF is close to one, and the default implementation is to subtract the CDF from 1. When you subtract a number close to 1 from 1, you lose precision due to catastrophic cancellation. If your custom _sf avoids this, then it will provide a substantial improvement.

OmarManzoor · 2023-06-02T18:46:09Z

Not necessarily. Look near the right side of the support. All distributions without custom _sf suffer there because that is where the CDF is close to one, and the default implementation is to subtract the CDF from 1. When you subtract a number close to 1 from 1, you lose precision due to catastrophic cancellation. If your custom _sf avoids this, then it will provide a substantial improvement.

Okay thank you, I will try that.

OmarManzoor · 2023-06-03T11:09:36Z

Not necessarily. Look near the right side of the support. All distributions without custom _sf suffer there because that is where the CDF is close to one, and the default implementation is to subtract the CDF from 1. When you subtract a number close to 1 from 1, you lose precision due to catastrophic cancellation. If your custom _sf avoids this, then it will provide a substantial improvement.

I tried various values for a and b using ranges for x but the right side of the curve aligns very well between main and mpmath/pr. Do you have any suggestions for values I could try which might show an improvement?

mdhaber · 2023-06-03T14:33:51Z

Do you have any suggestions for values I could try which might show an improvement?

It turns out that I don't. The PR doesn't really avoid catastrophic cancellation after all. You need a more accurate way to calculate log(b) - log(x) when x is close to b. This is possible (e.g. with asymptotic expansion) but it's not really worth it IMO because there is limited precision for user input when x is close to b, b != 0 (b > a > 0 is a requirement of the distribution). If the user really wants to study the SF really close to the right endpoint, they would need to be able to specify b - x (small relative to b) rather than x and b separately, and that is not possible with the current infrastructure.

import numpy as np
from scipy import stats, special
import matplotlib.pyplot as plt
from mpmath import mp

from scipy.stats.tests.test_generation.reference_distributions import ReferenceDistribution
mp.dps = 100

class Reciprocal(ReferenceDistribution):
    def __init__(self, *, a, b):
        super().__init__(a=a, b=b)

    def _pdf(self, x, a, b):
        return mp.one / (x * (mp.log(b) - mp.log(a)))

    def _sf(self, x, a, b):
        res = (mp.log(b) - mp.log(x)) / (mp.log(b) - mp.log(a))
        return res

def loguniform_sf(x, a, b):  # PR
    return (np.log(b) - np.log(x)) / (np.log(b) - np.log(a))

def loguniform_sf2(x, a, b):  # "PR2"
    def log_delta(a, b):
        # Wolfram Alpha "asymptotic expansion log(b)-log(b-x) as x approaches 0"
        x = b - a
        i = np.arange(1, 20)[:, np.newaxis]
        return np.sum(x ** i / (i * b ** i), axis=0)

    @np.vectorize
    def log_delta2(a, b):
        # log_delta with logsumexp trick to prevent overflow for large x
        x = b - a
        i = np.arange(1, 100)
        return np.exp(
            special.logsumexp(i * np.log(x) - np.log(i) - i * np.log(b)))

    return log_delta2(x, b)/(np.log(b) - np.log(a))

a, b = 1, 5
x = np.logspace(-13, -1)

# a, b = 1e-21, 1e-10
# x = np.logspace(-20, -11)
log_uniform = Reciprocal(a=a, b=b)

mpmath = log_uniform.sf(b-x, dtype=object)
main = stats.loguniform.sf(b-x, a, b)
pr = loguniform_sf(b-x, a, b)
pr2 = loguniform_sf2(b-x, a, b)

err_main = np.abs((main - mpmath) / mpmath).astype(np.float64)
err_pr = np.abs((pr - mpmath) / mpmath).astype(np.float64)
err_pr2 = np.abs((pr2 - mpmath) / mpmath).astype(np.float64)

plt.loglog(x, err_main, label="main", ls="dashdot")
plt.loglog(x, err_pr, label="PR", ls="dotted")
plt.loglog(x, err_pr2, label="PR2")

plt.legend()
plt.show()

I'd suggest changing this PR to replace the custom _sf with a comment in the code pointing the reader to this PR. It would also be helpful if you could figure out a way to make the code in gh-17832 indicate that an override has been considered but rejected (e.g. with a different mark in the table).

If you're interested in doing more of these, I'd suggest going systematically through the distributions to prioritize those that really need improvement. Also, before implementing the override, please write tests (that don't pass in main) so that the goal of the override is explicit. This will give us some assurance that the improvement will justify the effort that goes into it. Not everything you study will lead to a PR, but the work is still useful if you keep track of it in gh-17832 (or a document linked from there).

OmarManzoor · 2023-06-03T18:52:00Z

Do you have any suggestions for values I could try which might show an improvement?

It turns out that I don't. The PR doesn't really avoid catastrophic cancellation after all. You need a more accurate way to calculate log(b) - log(x) when x is close to b. This is possible (e.g. with asymptotic expansion) but it's not really worth it IMO because there is limited precision for user input when x is close to b, b != 0 (b > a > 0 is a requirement of the distribution). If the user really wants to study the SF really close to the right endpoint, they would need to be able to specify b - x (small relative to b) rather than x and b separately, and that is not possible with the current infrastructure.

I'd suggest changing this PR to replace the custom _sf with a comment in the code pointing the reader to this PR. It would also be helpful if you could figure out a way to make the code in gh-17832 indicate that an override has been considered but rejected (e.g. with a different mark in the table).

If you're interested in doing more of these, I'd suggest going systematically through the distributions to prioritize those that really need improvement. Also, before implementing the override, please write tests (that don't pass in main) so that the goal of the override is explicit. This will give us some assurance that the improvement will justify the effort that goes into it. Not everything you study will lead to a PR, but the work is still useful if you keep track of it in gh-17832 (or a document linked from there).

Thank you for the detailed explanation. I understand that we can leave out the addition of sf for this distribution.

mdhaber · 2023-06-03T19:16:39Z

scipy/stats/_continuous_distns.py

+    Details related to the decision of not defining
+    the surival function for this distribution can be
+    found in the PR: https://github.com/scipy/scipy/pull/18614


Please make this a (code) comment to developers reviewing the code, not users reviewing the documentation. Users would likely take this to mean that the survival function of this distribution is not implemented, which is not what we mean to convey. They don't need to know whether _sf is overridden; only developers do.

tylerjereddy · 2023-12-08T19:53:06Z

I adjusted the label to maintenance given that we just add a comment here (as I go through PRs of interest to relnotes)

ENH: override sf for loguniform/reciprocal distribution

e268d75

github-actions bot added the scipy.stats label Jun 2, 2023

Add support for reciprocal reference distribution

66e6d38

mdhaber added the enhancement A new feature or improvement label Jun 2, 2023

Add comment in docstring referring to PR

d111ab1

Remove reference distribution

16361b5

mdhaber reviewed Jun 3, 2023

View reviewed changes

Adjustment: move the comment into the code

eca0831

mdhaber merged commit 5a1ad4c into scipy:main Jun 5, 2023
24 checks passed

OmarManzoor deleted the loguniform_sf branch June 5, 2023 05:04

mdhaber added this to the 1.12.0 milestone Sep 11, 2023

tylerjereddy added maintenance Items related to regular maintenance tasks and removed enhancement A new feature or improvement labels Dec 8, 2023

tylerjereddy changed the title ~~ENH: override sf for loguniform/reciprocal distribution~~ MAINT: don't override sf for loguniform/reciprocal distribution Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: don't override sf for loguniform/reciprocal distribution #18614

MAINT: don't override sf for loguniform/reciprocal distribution #18614

OmarManzoor commented Jun 2, 2023

OmarManzoor commented Jun 2, 2023

OmarManzoor commented Jun 2, 2023

mdhaber commented Jun 2, 2023 •

edited

OmarManzoor commented Jun 2, 2023 •

edited

mdhaber commented Jun 2, 2023

OmarManzoor commented Jun 2, 2023 •

edited

OmarManzoor commented Jun 3, 2023

mdhaber commented Jun 3, 2023 •

edited

OmarManzoor commented Jun 3, 2023

mdhaber Jun 3, 2023

tylerjereddy commented Dec 8, 2023

MAINT: don't override sf for loguniform/reciprocal distribution #18614

MAINT: don't override sf for loguniform/reciprocal distribution #18614

Conversation

OmarManzoor commented Jun 2, 2023

Reference issue

What does this implement/fix?

Additional information

OmarManzoor commented Jun 2, 2023

OmarManzoor commented Jun 2, 2023

mdhaber commented Jun 2, 2023 • edited

OmarManzoor commented Jun 2, 2023 • edited

mdhaber commented Jun 2, 2023

OmarManzoor commented Jun 2, 2023 • edited

OmarManzoor commented Jun 3, 2023

mdhaber commented Jun 3, 2023 • edited

OmarManzoor commented Jun 3, 2023

mdhaber Jun 3, 2023

Choose a reason for hiding this comment

tylerjereddy commented Dec 8, 2023

mdhaber commented Jun 2, 2023 •

edited

OmarManzoor commented Jun 2, 2023 •

edited

OmarManzoor commented Jun 2, 2023 •

edited

mdhaber commented Jun 3, 2023 •

edited