Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: stats.tukeylambda: Bad behavior of the cdf() and sf() methods in the tails. #21370

Open
WarrenWeckesser opened this issue Aug 12, 2024 · 6 comments
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.special scipy.stats

Comments

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Aug 12, 2024

Describe your issue.

When lam < 0, as x gets larger, tukeylambda.cdf(x, lam) saturates at 0.9999999999999929 instead of 1 and tukeylambda.sf(x, lam) levels off at 7.105427357601002e-15 instead of decreasing to 0. tukeylambda.pdf(x, lam) levels off at 5.9894274089194295e-22 in the positive and negative tails.

The source of the problem is in the bisection method that is used in special.tklmbda, which can be found in https://github.com/scipy/scipy/blob/main/scipy/special/xsf/cephes/tukey.h. That code needs to be refined.

Reproducing Code Example

In [6]: tukeylambda.cdf([1e4, 1e6, 1e8, 1e10, 1e12, 1e24], -0.5).tolist()
Out[6]: 
[0.999999960015991,
 0.9999999999959996,
 0.9999999999999929,
 0.9999999999999929,
 0.9999999999999929,
 0.9999999999999929]

In [7]: tukeylambda.sf([1e4, 1e6, 1e8, 1e10, 1e12, 1e24], -0.5).tolist()
Out[7]: 
[3.998400899263288e-08,
 4.000355602329364e-12,
 7.105427357601002e-15,
 7.105427357601002e-15,
 7.105427357601002e-15,
 7.105427357601002e-15]

In [8]: tukeylambda.pdf([1e4, 1e6, 1e8, 1e10, 1e12, 1e24], -0.5).tolist()
Out[8]: 
[7.995203177218486e-12,
 8.001066830697682e-18,
 5.9894274089194295e-22,
 5.9894274089194295e-22,
 5.9894274089194295e-22,
 5.9894274089194295e-22]
SciPy/NumPy/Python version and system information
1.15.0.dev0+1377.235e778 2.0.0 sys.version_info(major=3, minor=12, micro=4, releaselevel='final', serial=0)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /usr/include/x86_64-linux-gnu/openblas-pthread/
    lib directory: /usr/lib/x86_64-linux-gnu/openblas-pthread/
    name: openblas
    openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER=1 NO_CBLAS=
      NO_LAPACK= NO_LAPACKE=1 NO_AFFINITY=1 USE_OPENMP=0 generic MAX_THREADS=64
    pc file directory: /usr/lib/x86_64-linux-gnu/pkgconfig
    version: 0.3.20
  lapack:
    detection method: pkgconfig
    found: true
    include directory: /usr/include/x86_64-linux-gnu/openblas-pthread/
    lib directory: /usr/lib/x86_64-linux-gnu/openblas-pthread/
    name: openblas
    openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER=1 NO_CBLAS=
      NO_LAPACK= NO_LAPACKE=1 NO_AFFINITY=1 USE_OPENMP=0 generic MAX_THREADS=64
    pc file directory: /usr/lib/x86_64-linux-gnu/pkgconfig
    version: 0.3.20
  pybind11:
    detection method: config-tool
    include directory: unknown
    name: pybind11
    version: 2.13.1
Compilers:
  c:
    commands: cc
    linker: ld.bfd
    name: gcc
    version: 11.4.0
  c++:
    commands: c++
    linker: ld.bfd
    name: gcc
    version: 11.4.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.10
  fortran:
    commands: gfortran
    linker: ld.bfd
    name: gcc
    version: 11.4.0
  pythran:
    include directory: ../../../../../py3.12.4/lib/python3.12/site-packages/pythran
    version: 0.16.1
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
  cross-compiled: false
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
Python Information:
  path: /home/warren/py3.12.4/bin/python3
  version: '3.12'
@WarrenWeckesser WarrenWeckesser added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats labels Aug 12, 2024
@fancidev
Copy link
Contributor

fancidev commented Aug 14, 2024

Interesting catch!

It seems Newton’s method may work for this problem and may consume fewer iterations, but a closer look is probably necessary to make sure it is robust for different inputs.

@WarrenWeckesser
Copy link
Member Author

I have a fix in progress. Currently I'm using the toms748_solve function from Boost/math, and so far it is working well. I still have some details to work out, so a pull request is not imminent.

@fancidev
Copy link
Contributor

fancidev commented Aug 15, 2024

Btw, when $p$ is close to $0.5$, the numerical accuracy of tukeylambda.ppf has room for improvement. For a contrived example, the relative error of tukeylambda(-0.2).ppf(np.nextafter(0.5,0)) is 56%. For a less contrived example, the relative error of tukeylambda(0.2).ppf(0.5001) is 3e-13.

@WarrenWeckesser
Copy link
Member Author

@fancidev I noticed that too. In fact, it was while investigating that behavior that I ran into the loss of precision of special.logit, resulting in #21388.

I'm experimenting with a technique that gives good results over a wide range of λ, but I think it will be slow, even when implemented in C++. It will be part of the changes that I am working on for tukeylambda. If you have any ideas for improving the numerical behavior of tukeylambda.ppf, let us know!

@fancidev
Copy link
Contributor

fancidev commented Aug 15, 2024

If you have any ideas for improving the numerical behavior of tukeylambda.ppf, let us know!

For $p$ close to 0.5 and $\lambda$ not very large (in absolute value), Taylor expansion around 0.5 could help to improve the accuracy of tukeylambda.ppf,

$$ Q(p) = \frac{1}{\lambda}\left[p^\lambda-(1-p)^\lambda\right] $$

For $0 &lt; p &lt; 1$, let $u := p-\frac{1}{2}$, so $-\frac{1}{2} &lt; u &lt; \frac{1}{2}$. Using the Taylor series $(1+x)^a = \sum_{k=0}^\infty {a \choose k} x^k$, which converges for $|x| &lt; 1$, the quantile function can be written as

$$ \begin{align*} Q (u) &= \frac{1}{\lambda}\left[ \left(\frac12+u\right)^\lambda - \left(\frac12-u\right)^\lambda\right] \\ &= \frac{1}{\lambda \cdot 2^\lambda}\left[ \left(1+2u\right)^\lambda - \left(1-2u\right)^\lambda\right] \\ &= \frac{1}{\lambda \cdot 2^\lambda}\left[ \sum_{k=0}^\infty {\lambda \choose k } (2u)^k - \sum_{k=0}^\infty {\lambda \choose k } (-2u)^k\right] \\ &= \frac{1}{2^{\lambda-1}}\sum_{k=0}^\infty \frac{1}{\lambda} {\lambda \choose 2k+1 } (2u)^{2k+1} \\ &= \frac{1}{2^{\lambda-1}} \left[ (2u) + \frac{(\lambda-1)(\lambda-2)}{6} (2u)^3 + \frac{(\lambda-1)(\lambda-2)(\lambda-3)(\lambda-4)}{120}(2u)^5 + \cdots \right] \end{align*} $$

It remains to determine the domain of $\lambda$ and $p$ for which the Taylor expansion is applicable, as well as the number of terms needed to achieve desired accuracy. Hopefully outside this domain the direct formula works well enough.


Wolfram Alpha suggests that an alternative first-order approximation for $p \approx 0.5$ is $Q(p) \approx 2^{-\lambda} \mathrm{logit}(p)$.

@fancidev
Copy link
Contributor

fancidev commented Aug 16, 2024

For $\lambda &lt; 0$ and $p \le 0.5$, an accurate formula for tukeylambda.ppf(p, lambda) appears to be

$$ \begin{align*} Q(p) &= \frac{1}{\lambda}\left[p^\lambda-(1-p)^\lambda\right]\\ &= \frac{1}{\lambda} p^\lambda\left[1-\left(\frac{1-p}{p}\right)^\lambda\right]\\ &=\frac{1}{\lambda} p^\lambda\left[1-e^{\lambda \ln\frac{1-p}{p}} \right] \\ &= \frac{1}{\lambda} p^\lambda\left[1-e^{-\lambda t } \right] \end{align*} $$

where $t \equiv \ln [p/(1-p)]$. The last term is evaluated using expm1. t is evaluated using logit (following gh-21388). Numerical experiments show that this expression achieves relative error of a few machine epsilon, provided no premature overflow occurs. Premature overflow occurs if $p^\lambda$ overflows but $Q(p)$ doesn’t.

For $\lambda &gt; 0$ and $p \le 0.5$, a similar derivation yields the formula

$$ \begin{align*} Q(p) &= \frac{1}{\lambda}\left[p^\lambda-(1-p)^\lambda\right]\\ &= \frac{1}{\lambda} (1-p)^\lambda\left[\left(\frac{p}{1-p}\right)^\lambda-1\right]\\ &=\frac{1}{\lambda} (1-p)^\lambda\left[e^{\lambda \ln\frac{p}{1-p}} -1\right] \\ &= \frac{1}{\lambda} (1-p)^\lambda\left[e^{\lambda t } -1\right] \end{align*} $$

The last term is evaluated using expm1. The term $(1-p)^\lambda$ would be evaluated using the (non-existent) pow1p function. If such a function exists, then numerical experiments suggest that the expression achieves relative error of a few machine epsilon if the result does not underflow. (An example that underflows is $Q(10/74; \lambda=5000) \approx -1.1\times 10^{-319}$.)

If $p &gt; 0.5$, it is accurate to evaluate $Q(p)$ by $Q(p)=-Q(1-p)$.


pow1p(x,a) is defined as $(1+x)^a$ for $x \ge -1$ and $a &gt; 0$. Write $1+x=s+t$ where $|t|\ll |s|$; this may be done using the Fast2Sum algorithm. Then

$$ (1+x)^a = (s+t)^a=s^a \left(1+\frac{t}{s}\right)^a=s^a e^{a\ln\left(1+\frac{t}{s}\right)} $$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.special scipy.stats
Projects
None yet
Development

No branches or pull requests

3 participants