Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: improve performance of the skewnorm cdf computation #15486

Closed
wants to merge 1 commit into from

Conversation

zoj613
Copy link
Contributor

@zoj613 zoj613 commented Jan 29, 2022

Reference issue

closes #15473

What does this implement/fix?

Implements a method described in [1]

Additional information

As noted in the corresponding issue, this approach is much faster since it does not rely on integrating the pdf directly. Plus it does not suffer from the issues of the alternative implementation using Owen's T function, discussed in #8473.

References

[1] Amsler, C., Papadopoulos, A. & Schmidt, P. Evaluating the cdf of the Skew Normal distribution. Empir Econ 60,3171–3202 (2021). https://doi.org/10.1007/s00181-020-01868-6

CC @WarrenWeckesser @mdhaber @ev-br

@zoj613
Copy link
Contributor Author

zoj613 commented Jan 29, 2022

PS: I cant run the tests locally because of the build issues outlined here: #8325 (comment), but the manual tests I ran as standalone function suggest that the implementation is correct.

@zoj613
Copy link
Contributor Author

zoj613 commented Jan 29, 2022

Currently failing for the values (x=-4, a=2):

AssertionError: 
Not equal to tolerance rtol=1e-08, atol=0

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 3.21992328e-22
Max relative difference: 0.03960623
 x: array(8.451832e-21) # got
 y: array(8.12984e-21)  # expected

@tupui tupui added enhancement A new feature or improvement scipy.stats labels Jan 30, 2022
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this could provide a nice speedup!

Is skewnorm accuracy well tested already? If not, I suppose it's OK not to add new tests. But since this is for speed, can you show some time comparisons in the PR?

Next time I review I'll need to check this against the reference and look at the test failure. Usually nonzero atol is OK, so I wonder why that test doesn't have one. Depends on what it's testing.

scipy/stats/_continuous_distns.py Show resolved Hide resolved
scipy/stats/_continuous_distns.py Show resolved Hide resolved
Comment on lines +7079 to +7080
* we use equation 14 when Q < 0, which is an approximation for the
lower tail of the distribution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i haven't read the paper yet, but is it more accurate or much faster than equation 3? Is it a good approximation for all values of the shape parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several algorithms outlined in the paper that have differing accuracy and speed, depending on the sign and absolute values of the shape parameter. Unfortunately, It seems rather hard or impractical to account for all different values while choosing the "best" algorithm. Equation 14 is just as accurate as the current implementation but as seen in my previous comment here: #15486 (comment), the accuracy isn't equal. The hard part is knowing which algorithm to choose from given the combination of x and shape and whether or not that combination represents a point at the tail of the distribution.

scipy/stats/_continuous_distns.py Show resolved Hide resolved
scipy/stats/_continuous_distns.py Show resolved Hide resolved
scipy/stats/_continuous_distns.py Show resolved Hide resolved
scipy/stats/_continuous_distns.py Show resolved Hide resolved
scipy/stats/_continuous_distns.py Show resolved Hide resolved
scipy/stats/_continuous_distns.py Show resolved Hide resolved
@mdhaber
Copy link
Contributor

mdhaber commented Aug 3, 2022

@zoj613 I started working on this to help finish it up, but I noticed that it fails our existing accuracy checks - partially due to the use of the approximation, and partially because of (1 - cdf) if orig_shape < 0.

It turns out that Boost has an implementation of the skewnorm distribution that is much faster than we can do here, so I am proposing we use it in gh-16770. We could still use some of the ideas from the paper, but I think it would need to be in a new PR. Would it be ok if we close this one in favor of gh-16770?

Boost also has the inverse Gaussian. If it looks like it would take care of gh-13665/gh-13666, shall I close gh-13665?

@zoj613
Copy link
Contributor Author

zoj613 commented Aug 3, 2022

@mdhaber Yes im fine with closing this. #16770 looks a lot simpler.

@mdhaber mdhaber closed this Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: skewnorm.cdf is very slow. Consider a much more efficient approach
3 participants