-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: stats.nct.pdf inconsistent behavior when compared to MATLAB and R #19348
Comments
Thanks for reporting the issue. It is a bug, and it is still present in the development version of SciPy. Here's a simplified example, with integer parameter values so it is easy to compare to the result computed by Wolfram Alpha:
With Wolfram Alpha, |
too much boost ?
|
Boost isn't used by |
The development version of boost/math gives the correct result in this case.
As noted in the comment in the SciPy code, the boost implementation has issues in the left tail. |
@WarrenWeckesser are you willing to open an issue with Boost about the left tail of the pdf? It would be nice if we could switch to Boost's implementation after all rather than stitching ours and theirs together. Here's a demonstration in SciPy (using Boost for the PDF): import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
dist = stats.nct(8, 8.5)
x = np.linspace(-3, 5)
plt.semilogy(x, dist.pdf(x))
plt.show() Or maybe in the meantime just going with Boost is the lesser of two evils. |
one possible candidate for change compared to my version is (because pdf code itself is unchanged0 |
Thanks @josef-pkt, that is in fact the source of the problem reported here. When the
The problem is that the underlying boost function I'll see if we can work around the boost bug in our wrapper of The problem with the PDF calculation in the left tail is a separate issue. @mdhaber, if you already have the evidence, go ahead and file a boost issue. I probably won't get to that in the immediate future. |
I ask because I don't ever work with C++. We'll see if they'll take the report in Python form. |
Fixes #1035. See also scipy/scipy#19348. Accuracy in left tail is still poor, and the reflection formula appears to be to blame as it's use causes the series to cancel out the first term, but it appears we have no real choice in the matter here. At least we do now get a few digits correct.
There is a fix for the left tail issue in the works at the Boost end now, however, the accuracy is still poor and I see no easy fix at present. Would I be correct in thinking that a handful of digits correct is "good enough" for most stats usages? |
Probably (although we are regularly surprised at what is requested). In any case, I think that fixing that discontinuity would be enough of an improvement to switch to the Boost implementation so we can fix the garbage output reported in this issue. Update: when we do fix this, check the cases in gh-19450. |
Can I ad that if one wants to use scipy to develop methods used in validation then accuracy is important? Was a user of Minitab in previous life in quality and developing sampling plans was part of the job. Accuracy similar to Minitab would be preferable. |
@jzmaddock is this now already part of the SciPy release? I am running into the same problems. Running SciPy 1.12.0. |
I can't comment on SciPy, but this would have been in the last Boost release (1.84), so it probably depends what Boost release is installed when SciPy is built? |
SciPy was recently updated to use 1.83 on main, so would need an update before the next release. It sounds like that isn't trivial and introduces some errors which would need resolving. |
Anything on our side that we need to look at? |
From looking at the source code, I managed to fix this myself by converting the calculation into log space before converting back. The problem with the scipy implementation (at least for me) was that the numbers got exceptionally large before being reduced down. I've written my own implementation for a project that I'm using Pytorch but the idea is the same Pi = torch.acos(torch.zeros(1)) * 2
def nct_pdf(x, df, nc, norm=True):
"""Non-central t distribution probability density function.
There are problems with the scipy implementation so this function fixes
those by performing most of the calculations in log space.
Most of the function is also implemented in pytorch.
Maths taken from: https://en.wikipedia.org/wiki/Noncentral_t-distribution#Probability_density_function
Args:
x (torch tensor) : Values to evaluate the pdf at.
df (int) : Degrees of freedom.
nc (float) : Non-centrallity parameter.
Returns:
pdf (torch.tensor) : Probability density at each x.
"""
n = torch.tensor(df)
nc = torch.tensor(nc)
x2 = x ** 2
mu2 = nc ** 2
mu2x2 = mu2 * x2
_2 = torch.tensor(2)
# Student T(x; mu = 0)
lgamma2 = torch.lgamma(n / 2)
lgamma12 = torch.lgamma((n + 1) / 2)
stmu_z = (
lgamma12 - lgamma2 - ((n + 1)/2) * torch.log(1 + x2/n)
- 0.5 * torch.log(n * Pi) - mu2/2
)
# A_v(x; mu) and B_v(x; mu) terms
zterm = mu2x2 / (2 * (x2 + n))
Av = loghyp1f1((df + 1)/2, .5, zterm)
Bv = (
torch.log(torch.sqrt(_2) * nc * x / torch.sqrt(x2 + n))
+ torch.lgamma(n/2 + 1) - lgamma12
+ loghyp1f1(df/2 + 1, 1.5, zterm)
)
pdf = torch.exp(stmu_z + Av) + torch.exp(stmu_z + Bv)
pdf = torch.nan_to_num(pdf)
if norm:
pdf /= torch.sum(pdf)
pdf = torch.nan_to_num(pdf)
return torch.clamp(pdf, 0, 1).to(x.dtype) where the function Happy to submit a pull request if this would be useful. |
I think that I get 2 extra errors with boost 1.84. Not sure which end they are due to.
|
If SciPy is just calling the Boost implementation then we shouldn't spuriously overflow - if we do it's a bug and we'll fix that.
I fear someone will have to explain that output to this non-Python person ;) |
Is scipy just calling boost? I've found some nct pdf code here - if this is being used it might benefit from performing the bulk of the calculations in log space. |
The new errors with BooostMath 1.84 are unrelated to this issue. They originate from Boost's implementation of the inverse survival function of the inverse gaussian distribution, see here. Wald is a special case of the inverse gaussian distribution for Side note: the RunTimeError in |
Hmmm, I don't have a mac here, but I have tried to reproduce and completely failed on Windows/MSVC, @mborland are you able to try on your Mac? Here's the test program I used, it tests every representable q value from 0.09999999 to 0.10000001, and takes a good couple of hours to run even in release mode. There is a bit of "wobble" in the quantiles returned (up to 9ULP on my machine), but that's to be expected given that (a) the root finder stops once it's "close enough" and (b) there may be multiple abscissa values that are equally correct. Which is to say, the quantile doesn't quite always monotonically decrease for increasing q.
|
Update: that was tested against develop, there was an old root finding bug which resurfaced in 1.84 causing excessive iterations to be used, I need to double check that... |
This is our bad: it's a resurfacing of boostorg/math#184 as a result of accepting a PR we shouldn't have :( The issue is now better tested and fixed again in 1.85 which will be out shortly (I hope). Apologies all round. |
Thanks for taking a look so quickly! |
thanks a lot @jzmaddock ! I think branching for the next SciPy version is happening very soon, so 1.85 may not make it in this time around - it should make it for the next minor release though, which by the sounds of it should help address this issue. Hope that answers your question @boersmamarcel ! |
This also seems to be fixed in from scipy import stats
from scipy.special import hyp1f1
print(stats.nct.pdf(8, 24, 13)) # 0.0017644351379121057
print(hyp1f1(13.0, [1.499999, 1.5, 1.500001], 61.0)) # [1.35509151e+39 1.35508577e+39 1.35508004e+39] Thanks to everyone who worked on this! |
Describe your issue.
Hello,
When computing stats.nctpdf with large t and nc values, the function returns 'inf'.
This behavior is inconsistent with both MATLAB and R. This "bug" appeared with version scipy 1.10 (behavior was consistent before).
I'm not sure whether it's truly a bug or whether MATLAB and R are both wrong in not returning inf.
The following R code (equivalent to the Python code) returns 0.0043403 instead of 'inf' (R version 4.3.1):
Same value is returned in MATLAB (R2023b):
Reproducing Code Example
Error message
SciPy/NumPy/Python version and system information
The text was updated successfully, but these errors were encountered: