Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement ppf and isf for invgauss #13665

Closed
wants to merge 25 commits into from
Closed

Conversation

zoj613
Copy link
Contributor

@zoj613 zoj613 commented Mar 9, 2021

Reference issue

Follow up to #13616 and closes #13666

What does this implement/fix?

This PR adds methods ppf and isf to invgauss distribution.

Additional information

This PR adds an implementation of ppf and isf for invgauss and while also improving the pdf method so that it is able to return the correct value when the mean is infinity. The implementation is described in [1]. The tests are based on the examples presented in the article.

References

[1] Giner, Goknur and G. Smyth. “statmod: Probability Calculations for the Inverse Gaussian Distribution.” R J. 8 (2016): 339.

@zoj613 zoj613 marked this pull request as ready for review March 9, 2021 14:48
@zoj613
Copy link
Contributor Author

zoj613 commented Mar 9, 2021

base branch values below. The ones from this PR are shown in the unittests:

pdf

In [4]: invgauss.pdf([-1, 0, 1, 2, np.inf, np.nan], mu=np.inf)
/home/scipy/scipy/stats/_continuous_distns.py:3495: RuntimeWarning: invalid value encountered in true_divide
  return 1.0/np.sqrt(2*np.pi*x**3.0)*np.exp(-1.0/(2*x)*((x-mu)/mu)**2)
Out[4]: array([ 0.,  0., nan, nan,  0., nan])

ppf

In [12]:         p1 = invgauss.cdf(invgauss.ppf(p, mu=1), mu=1)
    ...:         abs_error = np.abs(p - p1)
    ...:         # five-number summary of the absolute error
    ...:         np.percentile(abs_error, [0, 25, 50, np.mean(abs_error), 75, 100])
Out[12]: 
array([0.00000000e+00, 0.00000000e+00, 2.16840434e-19, 0.00000000e+00,
       3.46944695e-18, 6.66133815e-16])

In [14]:         q = invgauss.ppf(p, mu=1)
    ...:         q1 = invgauss.ppf(invgauss.cdf(q, mu=1), mu=1)
    ...:         rel_error = np.abs(q1 - q) / q
    ...:         np.percentile(rel_error, [0, 25, 50, np.mean(rel_error), 75, 100])
Out[14]: 
array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       2.35639297e-16, 3.20712499e-14])

isf

In [18]: invgauss.isf(1e-16, mu=1.05) / 0.7
Out[18]: 99.96907854049056  # incorrect, it is supposed to be 98.47

In [19]: invgauss.isf(1e-17, mu=1.05) / 0.7
Out[19]: 142.85714285714286  # does not overflow but incorrect value. it should be 105.4208

speed

In [23]: %timeit invgauss.ppf(p, mu=1)
52.3 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: %timeit invgauss.isf(p, mu=1)
53.8 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@zoj613
Copy link
Contributor Author

zoj613 commented Mar 9, 2021

Speed of this PR compared to base branch:

In [4]: %timeit invgauss.ppf(p, mu=1)
15.7 ms ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit invgauss.isf(p, mu=1)
14.2 ms ± 99.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

about 3.3 times faster on my machine.

@zoj613
Copy link
Contributor Author

zoj613 commented Mar 10, 2021

@mdhaber I think the latest commit should fix the build errors. Please take a look when you can. The issue was with a RuntimeWarning being raised because in the test I pass a NaN when ppf expects a number between 0 and 1. Since passing in a nan was intentional, I ignored the runtimewarning in the test case.

See

cond1 = (0 < q) & (q < 1)

@mdhaber
Copy link
Contributor

mdhaber commented Mar 10, 2021

norm and cauchy don't seem to produce warnings with nan input, so it might be better to do the comparison with np.errstate(invalid='ignore') so to avoid the warning being raised.

@zoj613
Copy link
Contributor Author

zoj613 commented Mar 10, 2021

norm and cauchy don't seem to produce warnings with nan input, so it might be better to do the comparison with np.errstate(invalid='ignore') so to avoid the warning being raised.

are you referring to them not raising the warning in the build environment? I did not see a tests with nan input in there for cauchy. Note that there is no runtime warning raised when calling the method in a python session locally, only in the build environment.

@zoj613
Copy link
Contributor Author

zoj613 commented Mar 14, 2021

the circleCI failure is not related to this PR. Seems like an issue with pythran not being installed in the build environment.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 14, 2021

Yes, I've gotten that, too.

@zoj613
Copy link
Contributor Author

zoj613 commented Mar 14, 2021

Yes, I've gotten that, too.

Seems like all it needed was a rebase, but the rebase introduced other unrelated failures on windows.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 16, 2021

That one is intermittent, so I re-ran it for you. Fingers crossed.

Update: tests passing. I think that will encourage another maintainer to take a look. I will pass on this one as I reviewed/merged gh-13616 independently, and it would be good for another stats maintainer to become familiar with these changes before the next release.

@rgommers rgommers added enhancement A new feature or improvement scipy.stats labels Mar 20, 2021
@zoj613
Copy link
Contributor Author

zoj613 commented Apr 9, 2021

That one is intermittent, so I re-ran it for you. Fingers crossed.

Update: tests passing. I think that will encourage another maintainer to take a look. I will pass on this one as I reviewed/merged gh-13616 independently, and it would be good for another stats maintainer to become familiar with these changes before the next release.

Thanks for the update. In that case I might as well ping @andyfaff and @ev-br for a possible review.

Copy link
Member

@ilayn ilayn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some non-stats related code comments.

scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
@zoj613
Copy link
Contributor Author

zoj613 commented Apr 10, 2021

@ilayn I addressed all the comments made, please take a look again when you can.

@zoj613
Copy link
Contributor Author

zoj613 commented Apr 10, 2021

Also cant build locally due to error:

---> 16 from scipy.stats._qmc_cy import (
     17     _cy_wrapper_centered_discrepancy,
     18     _cy_wrapper_wrap_around_discrepancy,

ModuleNotFoundError: No module named 'scipy.stats._qmc_cy'

Any idea how I can fix this @tupui ?

@tupui
Copy link
Member

tupui commented Apr 10, 2021

ModuleNotFoundError: No module named 'scipy.stats._qmc_cy'

Any idea how I can fix this @tupui ?

You might just have to merge master here. I am on master and I can build.

@zoj613
Copy link
Contributor Author

zoj613 commented Apr 10, 2021

ModuleNotFoundError: No module named 'scipy.stats._qmc_cy'
Any idea how I can fix this @tupui ?

You might just have to merge master here. I am on master and I can build.

Maybe im missing a step here because even after rebasing on the main branch + clearing build + re-building I get this error when importing invgauss locally.

----> 1 from scipy.stats import invgauss

~/scipy/scipy/stats/__init__.py in <module>
    441 from .kde import gaussian_kde
    442 from . import mstats
--> 443 from . import qmc
    444 from ._multivariate import *
    445 from . import contingency

~/scipy/scipy/stats/qmc.py in <module>
    232 
    233 """
--> 234 from ._qmc import *

~/scipy/scipy/stats/_qmc.py in <module>
     14     _categorize, initialize_direction_numbers, _MAXDIM, _MAXBIT
     15 )
---> 16 from scipy.stats._qmc_cy import (
     17     _cy_wrapper_centered_discrepancy,
     18     _cy_wrapper_wrap_around_discrepancy,

ModuleNotFoundError: No module named 'scipy.stats._qmc_cy'

I suppose it doesn't matter since it has nothing to do with this PR.

@tupui
Copy link
Member

tupui commented Apr 10, 2021

How are you building and which platform are you on?

I just checked out your branch, built and I could import from scipy.stats import invgauss.
Are you certain you have pulled everything from master?

@zoj613
Copy link
Contributor Author

zoj613 commented Apr 10, 2021

How are you building and which platform are you on?

I just checked out your branch, built and I could import from scipy.stats import invgauss.
Are you certain you have pulled everything from master?

I think I did pull everything because I rebased using the upstream master branch and it says this branch is up to date on the forked repo. I'm on Manjaro Linux 64bit. I guess it's not too big of a deal because the tests run, just unable to import on ipython locally. Another issue is the faliure to build docs. I noticed this is the case for other open PR's as well.

@tupui
Copy link
Member

tupui commented Apr 10, 2021

How are you building and which platform are you on?
I just checked out your branch, built and I could import from scipy.stats import invgauss.
Are you certain you have pulled everything from master?

I think I did pull everything because I rebased using the upstream master branch and it says this branch is up to date on the forked repo. I'm on Manjaro Linux 64bit. I guess it's not too big of a deal because the tests run, just unable to import on ipython locally. Another issue is the faliure to build docs. I noticed this is the case for other open PR's as well.

The doc build error is unrelated. We are having timeouts in the console. We still need to figure this out.

Do you actually have the files scipy/stats/_gmc_cy.pyx and the cxx, pyi ones?

@zoj613
Copy link
Contributor Author

zoj613 commented Apr 10, 2021

Do you actually have the files scipy/stats/_gmc_cy.pyx and the cxx, pyi ones?

yes, a shell search returns:

$ ls scipy/stats/_qmc
_qmc_cy.cxx  _qmc_cy.pyi  _qmc_cy.pyx  _qmc.py      _qmc.pyi  

@zoj613
Copy link
Contributor Author

zoj613 commented Aug 3, 2022

@mdhaber i've done some initial updates to the code after going over the paper again. The tests in test_ppf_isf pass except for those cases starting at line 2276. For some reason I don't get the correct values. I will investigate why this is a bit later in the week and also address the comments you left.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 3, 2022

Before working too much, maybe try out Boost's implementation? You can follow the example of the new skewnorm PR, or I can try it if you'd prefer.

Comment on lines 2276 to 2280
assert np.isclose(stats.invgauss.ppf(0.00013, mu=1, scale=3),
0.15039762631802803, atol=2.22e-16)
# test if it returns right tail values accurately
assert np.isclose(stats.invgauss.isf(1e-20, mu=1.5, scale=1 / 0.7),
126.3493, atol=2.22e-16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a different meaning of scale between the paper and SciPy, because these aren't passing, right? SciPy's distribution infrastructure handles scale for all distributions. There is no need to test with scales other than 1. If you try a scale other than 1 and it doesn't match what is in a paper, it's more likely that the parameterization is different.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 3, 2022

@zoj613 For this one, I'd also suggest that we close this and I'll open a PR that would use Boost for ppf/isf. I don't think I'd add the tests because the distribution infrastructure already performs the basic test to confirm that the methods are consistent with one another (e.g. see the ppf/cdf test here), tests of the domain (e.g. what happens with input NaNs), etc. Beyond that, it's a test of whether Boost's distributions are accurate or not. We can add additional tests, if we are concerned about the accuracy, but we don't need to add all the basic checks for each distribution separately.

The other part of this PR is behavior when np.isposinf(mu). This PR only adds it for pdf, ppf, and isf, it looks like? What about the cdf, moments, rvs, etc? I'd suggest that we not worry about making this change. It is very time-consuming and error-prone to add special cases like this to all methods of all distributions, especially with the changes are mixed in with others (like adding ppf/isf - it should be a separate PR, if anything). I think the time is better spent fixing other issues and higher-impact enhancements for now, then build a mechanism for these special cases into the overhauled distribution infrastructure (see gh-15928). When that's done, we can define special cases like this for the distribution at a high level (e.g. if np.isposinf, use invgauss methods instead) so that we won't need to add _lazywhere to every method and we won't have to write tests for every distribution separately.

@zoj613
Copy link
Contributor Author

zoj613 commented Aug 4, 2022

@zoj613 For this one, I'd also suggest that we close this and I'll open a PR that would use Boost for ppf/isf. I don't think I'd add the tests because the distribution infrastructure already performs the basic test to confirm that the methods are consistent with one another (e.g. see the ppf/cdf test here), tests of the domain (e.g. what happens with input NaNs), etc. Beyond that, it's a test of whether Boost's distributions are accurate or not. We can add additional tests, if we are concerned about the accuracy, but we don't need to add all the basic checks for each distribution separately.

The other part of this PR is behavior when np.isposinf(mu). This PR only adds it for pdf, ppf, and isf, it looks like? What about the cdf, moments, rvs, etc? I'd suggest that we not worry about making this change. It is very time-consuming and error-prone to add special cases like this to all methods of all distributions, especially with the changes are mixed in with others (like adding ppf/isf - it should be a separate PR, if anything). I think the time is better spent fixing other issues and higher-impact enhancements for now, then build a mechanism for these special cases into the overhauled distribution infrastructure (see gh-15928). When that's done, we can define special cases like this for the distribution at a high level (e.g. if np.isposinf, use invgauss methods instead) so that we won't need to add _lazywhere to every method and we won't have to write tests for every distribution separately.

If the boost implementation will provide comparable performance then i'm fine with closing this in favor of a simpler implementation. I do wonder if boost uses the same implementation from the paper because if not then accuracy issues might need to be addressed.

4.92818217e-16], atol=1e-12)

# tests if algorithm does not diverge for small probabilities.
assert np.isclose(stats.invgauss.ppf(0.00013, mu=1, scale=3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess for why these fail is that the parameterization used in the paper may be different than ours. That is, scale means something different in the paper. Ours is just a linear stretching of the real line, and the implementation is common to all distributions.

Comment on lines +2263 to +2264
assert_allclose(fns, [0., 0., 1.27054940e-21, 1.96413583e-17,
2.16840434e-19, 2.22044605e-16], atol=1e-08)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Boost, fns is:

array([0.00000000e+00, 0.00000000e+00, 2.96461532e-21, 0.00000000e+00,
       2.71050543e-19, 1.66533454e-16])

abs_error is:

array([2.96461532e-21, 5.59041745e-20, 2.71050543e-19, 2.16840434e-19,
       5.20417043e-18, 2.77555756e-17, 1.66533454e-16, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00])

Comment on lines +2272 to +2273
assert_allclose(fns2, [0., 0., 0., 9.15882970e-17, 1.16804249e-16,
4.92818217e-16], atol=1e-08)
Copy link
Contributor

@mdhaber mdhaber Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Boost, fns2, is:

array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       3.53458946e-16, 6.57090956e-16])

rel_error is:

array([3.58337988e-16, 4.45142890e-16, 3.53458946e-16, 0.00000000e+00,
       1.15801436e-16, 1.16804249e-16, 6.57090956e-16, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00])

)
assert_allclose(fns2, [0., 0., 1.17819649e-16, 0.,
3.50412747e-16, 1.68867330e-11], atol=1e-08)

Copy link
Contributor

@mdhaber mdhaber Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Boost would pass all the tests added here (except it doesn't handle the special case of mu='inf', which is a separate thing). It's also quite fast:

from scipy import stats
rng = np.random.default_rng(0)
rvs = stats.invgauss.rvs(1, size=1000, random_state=rng)
%timeit stats.invgauss.ppf(rvs, mu) # 885 µs with Boost, 7.26 ms this PR

@zoj613
Copy link
Contributor Author

zoj613 commented Aug 5, 2022

Boost seems like a big improvement especially given the simplicity it provides. So i'm happy for us to close this PR in favor of the boost implementation.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 5, 2022

Here's a somewhat brutal test:

import numpy as np
from scipy import stats

az = np.logspace(-3, 3, 11)
pz = np.logspace(-8, -1, 8)
pz = np.concatenate([pz, [0.5], 1-pz[::-1]])[:, np.newaxis]

def relative_error(res, ref):
    return np.float64(abs((res-ref)/ref))

xz = stats.invgauss.ppf(pz, az)  # time this step
res = stats.invgauss.cdf(xz, az)
err = relative_error(res, pz)

Times:
Main: 399 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This PR: 151 ms ± 891 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Boost: 482 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Histogram of the log of the roundtrip relative error:
image

Number of NaNs produced by ppf:
Main: 0
This PR: 65
Boost: 26 # it has trouble when mu is too small

I think I'll open the Boost PR and when the result is NaN revert to the default implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: invgauss.pdf should return correct output when mu=infinity
7 participants