Skip to content

truncnorm rvs() produces junk for ranges in the tail (Trac #962) #1489

Closed
scipy-gitbot opened this Issue Apr 25, 2013 · 8 comments

2 participants

@scipy-gitbot

Original ticket http://projects.scipy.org/scipy/ticket/962 on 2009-06-05 by trac user foo, assigned to unknown.

Even this fails completely:

>>> truncnorm(10, 15).rvs()
1.7976931348623157e+308
>>>

Ideally, this should work properly very far in the tail, e.g. truncnorm(10000, 20000).rvs(), and should never return values outside the provided range.

Tested only on x86-64, stock Ubuntu Jaunty.

@scipy-gitbot

@josef-pkt wrote on 2009-06-07

I think the main interesting part is why it doesn't cause an exception or produces nan.
The next best thing is to get a number that clearly signals something is wrong.

These functions are not designed to be used beyond the limit of the numerical precision, the calculations are done in double, and here we are far outside of the usual range for which the truncated normal is designed

just some examples

>>> stats.truncnorm.pdf(10, 10,15)
inf
>>> stats.truncnorm.pdf(11, 10,15)
inf
>>> stats.truncnorm.pdf(9, 10,15)
0.0

in this range of the standard normal the pdf is tiny

stats.norm.pdf(8)
5.0522710835368927e-015
stats.norm.pdf(10)
7.6945986267064199e-023
stats.norm.pdf(15)
5.5307095498444164e-050
from scipy import integrate
integrate.quad(stats.norm.pdf, 10, 15)
(7.6198530247681341e-024, 1.1818646176835592e-024)

the cdf is different from one (scipy.special) up to 8.2

1-stats.norm._cdf(7)
1.2798651027878805e-012
1-stats.norm._cdf(8)
6.6613381477509392e-016
1-stats.norm._cdf(8.2)
1.1102230246251565e-016
1-stats.norm._cdf(8.5)
0.0
1-stats.norm._cdf(10)
0.0

The reason for the random number is in the _ppf method that is used for the sampling.
It calls the private _ppf of the normal distribution for speed reason, which however doesn't check for the boundary explicitely, and return 1e+308 instead of inf.

stats.truncnorm.ppf(0.5, 10,15)
1.7976931348623157e+308

q=0.75;stats.norm._ppf(qstats.norm._cdf(15) + stats.norm._cdf(10)(1.0-q))
1.7976931348623157e+308
stats.norm._ppf(1)
1.7976931348623157e+308
stats.norm._ppf(1-1e-15)
7.9414444874159793
stats.norm._ppf(1-1e-16)
8.2095361516013874
stats.norm._ppf(1-1e-20)
1.7976931348623157e+308
stats.norm.ppf(1)
inf

The only alternative to fix this, that I see, would be to return a nan. One possibility is in the _argcheck, but we don't have the "tradition" to add everywhere a check and warning that an operation goes beyond machine precision. The warning is only used in some obvious cases like numerical integration. The second possibility is to check if the random number generator should return a nan, but this is a generic function that needs to work for all/many distributions. In no case would I recommend to return a valid number, since the number will not reflect the actual distribution.

If you really want a distribution build on this range of the normal distribution, the I would recommend mpmath which I think is able to handle this.

So for me this would be a "wontfix", unless someone finds a reasonable solution.

@scipy-gitbot

@pv wrote on 2009-06-07

That number looks a bit like FLOAT_MAX. It's possible that one of the underlying Fortran codes uses this special value instead of Inf to signify infinity.

This can be (and should be) fixed, if we can find out what the underlying function in scipy.special is.

@scipy-gitbot

@josef-pkt wrote on 2009-06-08

The inverse of the normal cdf is ndtri, which returns floatmax instead of inf.

>>> special.ndtri(1)
1.7976931348623157e+308

>>> special.ndtri(1-1e-16)
8.2095361516013874
>>> special.ndtri(1-1e-20)
1.7976931348623157e+308

Note the overflow also occurs at -inf

>>> special.ndtri(0)
-1.7976931348623157e+308

>>> special.ndtri(1e-30)
-11.464024688443613
>>> special.ndtri(1e-50)
-14.933337534788487
>>> special.ndtri(1e-300)
-37.047096299361201
>>> special.ndtri(1e-310)
-1.7976931348623157e+308

The precision at q=0 is much higher, so maybe the symmetry could be exploited? Might be worth checking, but a double won't be able to store 1-1e-30).

If special.ndtri would return inf instead, then the random variable returned in the original case would also return inf.

To return nan instead of inf, I think, we could replace _ppf by ppf in the generic _rvs. But this will likely be a large speed penalty for all generic random number generation, and I would switch to it only if there are other cases that are closer to the standard use cases.

    def _rvs(self, *args):
        ## Use basic inverse cdf algorithm for RV generation as default.
        U = mtrand.sample(self._size)
        Y = self._ppf(U,*args)
        return Y

Also precision in extreme tail behavior is good only in some distributions.
In many distributions, the generic calculation of ppf cannot be calculated for percentile q arbitrarily close to zero. The root finding algorithm needs to be bound away from 1 if the upper support of the distribution is inf.

@scipy-gitbot

@josef-pkt wrote on 2009-06-08

After my previous comment about symmetry, I checked the truncated distribution in the negative tail (since the normal distribution is symmetric), where there is no 1 to cover up the small numbers, and everything seems to work.

So it seems possible to work with the negative tail and just reverse the sign of the random variable. Whether the numbers are actually correct, needs checking.

>>> stats.truncnorm.pdf([-15,-14,-13,-12,-11,-10],-15, -10)
array([  7.25828901e-27,   1.43914398e-20,   1.04973518e-14,
         2.81683089e-09,   2.78065633e-04,   1.00980932e+01])
>>> stats.truncnorm.cdf([-15,-14,-13,-12,-11,-10],-15, -10)
array([  0.00000000e+00,   1.02279311e-21,   8.02792965e-16,
         2.33138632e-10,   2.50747563e-05,   1.00000000e+00])
>>> stats.truncnorm(-15, -10).rvs(10)
array([-10.21470333, -10.10066439, -10.01707767, -10.13847477,
       -10.16639336, -10.1053179 , -10.0017454 , -10.07044094,
       -10.10234868, -10.0281755 ])
>>> stats.truncnorm(-15, -10).stats('mvsk')
(array(-10.098093233962588), array(0.0094453778248606568), array(-1.9460313392723905), array(5.5803224183801934))
@scipy-gitbot

@josef-pkt wrote on 2009-06-08

just a basic check on the numbers for the truncnorm in the negative tail looks good

>>> integrate.quad(stats.norm.pdf, -15, -10)
(7.6198530247681341e-024, 1.1818646176835592e-024)
>>> integrate.quad(lambda x: stats.truncnorm.pdf(x, -15,-10), -15, -10)
(1.0000000000000082, 9.9518341175930435e-013)
>>> stats.truncnorm._pdf(np.linspace(-10.2,-10,20),-15, -10)
array([  1.33956725,   1.49131702,   1.66007346,   1.84772151,
         2.05635269,   2.28828741,   2.54609976,   2.83264494,
         3.15108954,   3.50494515,   3.89810548,   4.33488745,
         4.82007665,   5.3589776 ,   5.95746936,   6.62206702,
         7.35998967,   8.17923557,   9.08866523,  10.09809323])
>>> stats.truncnorm.veccdf([-15,-14,-13,-12,-11,-10],-15, -10)
array([  0.00000000e+00,   1.02279311e-21,   8.02792965e-16,
         2.33138632e-10,   2.50747563e-05,   1.00000000e+00])
>>> stats.truncnorm.veccdf(np.linspace(-10.2,-10,20),-15, -10)
array([ 0.13010258,  0.14498782,  0.16155838,  0.180003  ,  0.20053137,
        0.22337637,  0.24879661,  0.27707925,  0.30854313,  0.34354219,
        0.38246931,  0.42576057,  0.47389989,  0.52742426,  0.58692949,
        0.65307654,  0.72659861,  0.80830893,  0.89910937,  1.        ])
>>> stats.truncnorm.cdf(np.linspace(-10.2,-10,20),-15, -10)
array([ 0.13010258,  0.14498782,  0.16155838,  0.180003  ,  0.20053137,
        0.22337637,  0.24879661,  0.27707925,  0.30854313,  0.34354219,
        0.38246931,  0.42576057,  0.47389989,  0.52742426,  0.58692949,
        0.65307654,  0.72659861,  0.80830893,  0.89910937,  1.        ])
>>> stats.truncnorm.cdf(-10-1e-10,-15, -10)
0.99999999899019443
@scipy-gitbot

trac user foo wrote on 2009-06-08

I used GSL (particularly http://www.gnu.org/software/gsl/manual/html_node/The-Gaussian-Tail-Distribution.html) with a quick & dirty rejection sampling solution for my purposes, but GSL is GPL, so it wouldn't be suitable for scipy.

Now that I looked the algorithm used by GSL up though, I see it's simple enough. It's described at http://cg.scs.carleton.ca/~luc/chapter_nine.pdf, p.381 of the book, or p.3 of the pdf.

@scipy-gitbot

@josef-pkt wrote on 2009-06-08

It would be possible to include a normal tail distribution as a separate distribution, which would be relatively easy if the negative tail behavior of the truncated normal distribution is correct, i.e. if scipy.special for cdf and ppf in the left tail is accurate at the desired precision.

If the ppf (special.ndtri) is accurate in the left tail, then the current direct generation of the random variables can be used (using ppf). Adding a special random number generator for a new distribution would be more appropriately done in numpy.random, since stats.distributions until now has no structure for distribution specific methods for random number generation, it either uses numpy.random or the generic ppf method, (I would find it useful to have a semi-generic rejection sampling method, if that exists.)

I checked Johnson, Kotz and Balakrishnan Vol.2 again, and I didn't see any information about a normal tail distribution, the examples for the doubly truncated normal are all in the "fat" part of the normal distribution. I didn't know about gaussian tail distribution in gsl (which I don't have).

So, if someone has a good reference for the properties of the normal tail distribution and can provide a numerically accurate pdf and cdf, or test the left tail of the normal distribution related functions in scipy.special, then we can easily add it as a new distribution, either as a doubly truncated or as singly truncated normal distribution.
(Isn't there also a convergence result, some distribution that converges to the (conditional) tail of the normal distribution? I don't remember the reference right now)

However, as explained above, we cannot "abuse" the right tail of the truncated normal in this way.

How accurate are the following?

>>> stats.norm.pdf(-10.2)
1.0207305594306101e-023
>>> stats.norm._cdf(-10)
7.6198530241604696e-024
>>> stats.norm.pdf(-10.2)/stats.norm._cdf(-10)  # conditional left tail pdf
1.3395672543737427

>>> stats.norm._cdf(-20)
2.7536241186061556e-089
>>> stats.norm.pdf(-20)
5.5209483621597635e-088
>>> stats.norm.pdf(-20.2)/stats.norm._cdf(-20)  # conditional left tail pdf
0.35995251388498711
>>> stats.norm.pdf(-21)/stats.norm._cdf(-20)
2.5065256268967195e-008
@rgommers
SciPy member

Fixed by gh-2494:

>>> stats.truncnorm(10, 15).rvs()
10.182087940310346
@rgommers rgommers closed this May 19, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.