Improve precision of stats.norm.logcdf #188

Merged
merged 4 commits into from May 5, 2012

Conversation

Projects
None yet
3 participants
@rgommers
Member

rgommers commented Apr 1, 2012

This adds a log_ndtr function to scipy.special and uses it in stats.norm. See ticket 1614.

scipy/special/cephes/ndtr.c
+
+double log_ndtr(double a) {
+ double pi = 3.14159265358979323846264338327;
+ if (a > -10) {

This comment has been minimized.

@pv

pv Apr 1, 2012

Member

How was this cutoff chosen, and how big is the discontinuity across it? There probably should be a test case checking this.

@pv

pv Apr 1, 2012

Member

How was this cutoff chosen, and how big is the discontinuity across it? There probably should be a test case checking this.

This comment has been minimized.

@rgommers

rgommers Apr 1, 2012

Member

The discontinuity is about 0.0097. The pdf attached to the ticket has an error estimate, did you see that? It actually suggests that this approximation should be used for a cutoff closer to 0 that the current choice.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

norm = stats.norm()

x = np.linspace(-9.9, -10.1, 300)
y = norm.logcdf(x)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, y, 'b-')

plt.show()

print 'Discontinuity around -10: ', norm.logcdf(-10+1e-10) - norm.logcdf(-10-1e-10)
@rgommers

rgommers Apr 1, 2012

Member

The discontinuity is about 0.0097. The pdf attached to the ticket has an error estimate, did you see that? It actually suggests that this approximation should be used for a cutoff closer to 0 that the current choice.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

norm = stats.norm()

x = np.linspace(-9.9, -10.1, 300)
y = norm.logcdf(x)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, y, 'b-')

plt.show()

print 'Discontinuity around -10: ', norm.logcdf(-10+1e-10) - norm.logcdf(-10-1e-10)

This comment has been minimized.

@pv

pv Apr 1, 2012

Member

The error made by choosing the cutoff is of order 1/(2 z^2), and ndtr runs out of floating point numbers around z ~ -37, so getting very good accuracy by adjusting the cutoff is unfortunately not possible here. Adding the correction terms from the sum should help, however.

@pv

pv Apr 1, 2012

Member

The error made by choosing the cutoff is of order 1/(2 z^2), and ndtr runs out of floating point numbers around z ~ -37, so getting very good accuracy by adjusting the cutoff is unfortunately not possible here. Adding the correction terms from the sum should help, however.

scipy/stats/tests/test_distributions.py
+ -2893.25, -3205.30, -3533.35, -3877.40, -4237.44, -4613.48,
+ -5005.52, -5413.56, -5837.60, -6277.64, -6733.67]
+
+ assert_allclose(stats.norm().logcdf(x), expected, atol=0.01)

This comment has been minimized.

@pv

pv Apr 1, 2012

Member

More decimals could be useful to have here.

@pv

pv Apr 1, 2012

Member

More decimals could be useful to have here.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Apr 1, 2012

Member

Ok, the relative accuracy of this is not so good, around 1e-4. I'd prefer at least sqrt(eps).

I think this patch must be revised to properly implement the asymptotic expansion of log(ndtr(z)) = log(.5*erfc(-z/sqrt(2))). The relevant formula is 7.1.23 in Abramowitz&Stegun, which gives:

log ndtr(z) = -.5*log(2*pi) - log(-z) - z**2/2 + log(1 + sum_{m=1}^infty (-1)**m 1*3*...*(2m-1)/z**(2*m))

The last log(1 + ...) could maybe also be taken by log1p which I think is in npymath. The number of terms should be limited so that the last term taken is smaller than the machine epsilon.

EDIT: many edits to this comment, due to exceeding number of typos in the formula :/

Member

pv commented Apr 1, 2012

Ok, the relative accuracy of this is not so good, around 1e-4. I'd prefer at least sqrt(eps).

I think this patch must be revised to properly implement the asymptotic expansion of log(ndtr(z)) = log(.5*erfc(-z/sqrt(2))). The relevant formula is 7.1.23 in Abramowitz&Stegun, which gives:

log ndtr(z) = -.5*log(2*pi) - log(-z) - z**2/2 + log(1 + sum_{m=1}^infty (-1)**m 1*3*...*(2m-1)/z**(2*m))

The last log(1 + ...) could maybe also be taken by log1p which I think is in npymath. The number of terms should be limited so that the last term taken is smaller than the machine epsilon.

EDIT: many edits to this comment, due to exceeding number of typos in the formula :/

scheinatadobe and others added some commits Apr 9, 2012

TST: increase test accuracy for logcdf of normal distribution.
Also some minor style changes to implementation.
@rgommers

This comment has been minimized.

Show comment
Hide comment
@rgommers

rgommers Apr 9, 2012

Member

Issues should be addressed now. Andrew reported relative accuracy of 1e-14, for me it's about 1e-11. Test precision bumped up to atol=1e-8.

Member

rgommers commented Apr 9, 2012

Issues should be addressed now. Andrew reported relative accuracy of 1e-14, for me it's about 1e-11. Test precision bumped up to atol=1e-8.

@rgommers

This comment has been minimized.

Show comment
Hide comment
@rgommers

rgommers May 5, 2012

Member

@pv: are you OK with merging this?

Member

rgommers commented May 5, 2012

@pv: are you OK with merging this?

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv May 5, 2012

Member

I believe this is correct as is stands. +1

Member

pv commented May 5, 2012

I believe this is correct as is stands. +1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment