scipy.special.smirnov and scipy.special.smirnovi have accuracy & convergence issues #7456

pvanmulbregt · 2017-06-02T22:27:24Z

While investigating #7426 and #7455 I noticed 4 related issues with scipy.special.smirnov & scipy.special.smirnovi, functions which form the basis for spicy.stats.ksone.

S1: scipy.special.smirnovi(n, p) returns NAN for p close to 0
S2: scipy.special.smirnovi(n, p) returns incorrect values for p close to 1
S3: scipy.special.smirnovi(n, p) not monotonic for p close to 1
S4: scipy.special.smirnov(n, p) not monotonic for increasing n

In detail:
S1: scipy.special.smirnovi returns NAN for p close to 0

>>> scipy.special.smirnovi(1, np.linspace(0, 0.5, 26, endpoint=True))
array([  nan,   nan,   nan,   nan,   nan,   nan,   nan,  0.86,  0.84,
        0.82,  0.8 ,  0.78,  0.76,  0.74,  0.72,  0.7 ,  0.68,  0.66,
        0.64,  0.62,  0.6 ,  0.58,  0.56,  0.54,  0.52,  0.5 ])
>>> # Correct answer should be 1-p
... # Example for n=2
... p01 = np.linspace(0.0, 0.1, 11, endpoint=True)
>>> y01 = scipy.special.smirnovi(2, p01)
>>> y01
array([        nan,         nan,  0.85857864,  0.82679492,  0.8       ,
        0.7763932 ,  0.75505103,  0.73542487,  0.71715729,  0.7       ,
        0.68377223])
>>> # Correct answer should be 1-sqrt(p)
... y01 - (1-np.sqrt(p01))
array([             nan,              nan,   9.08451092e-12,
        -1.76464399e-11,  -1.40549794e-11,  -1.13044019e-11,
        -4.77495821e-12,   1.42458267e-11,   1.38629108e-11,
         9.03543906e-12,   2.10510498e-11])

S1 Summary: Some unexpected NANs are returned, but the non-NAN values that are returned are OK.

S2: scipy.special.smirnovi returns incorrect values for p close to 1

>>> p = np.linspace(0.5, 1.0, 26, endpoint=True)
>>> scipy.special.smirnovi(1, p)
array([ 0.5       ,  0.48      ,  0.46      ,  0.44      ,  0.42      ,
        0.4       ,  0.38      ,  0.36      ,  0.34      ,  0.32      ,
        0.3       ,  0.28      ,  0.26      ,  0.24      ,  0.22      ,
        0.2       ,  0.18      ,  0.16      ,  0.14      ,  0.09233075,
        0.06970634,  0.49327356,  0.        ,  0.        ,  0.        ,  0.        ])
>>> scipy.special.smirnovi(1, p) - (1-p)
array([  2.01649808e-12,   2.02221573e-12,   1.62669878e-12,
         6.26237950e-12,   3.12183612e-12,   1.09701137e-12,
         1.94971816e-12,   2.44942955e-12,   1.71884729e-12,
         3.00703906e-13,   3.59712260e-14,  -2.89490654e-13,
         9.67503855e-13,  -9.54680779e-13,   2.58731925e-12,
        -2.62931343e-12,   5.48847079e-12,  -4.43137194e-12,
        -6.39346909e-12,  -2.76692450e-02,  -3.02936628e-02,
         4.13273557e-01,  -6.00000000e-02,  -4.00000000e-02,
        -2.00000000e-02,   0.00000000e+00])
>>> p11 = np.linspace(0.9, 1.0, 11, endpoint=True)
>>> y11 = scipy.special.smirnovi(2, p11)
>>> expected = (-1 + np.sqrt(5-4*p11))/2
>>> y11 - expected
array([  3.22282478e-12,   3.07653902e-12,  -5.51529163e-09,
        -1.77132899e-02,  -2.03289598e-02,   8.84626156e-03,
        -3.85164807e-02,  -2.91502622e-02,  -1.96152423e-02,
        -9.90195136e-03,   0.00000000e+00])

The first few values in the array look fine, the zeroes at the end are troubling, and the values in the middle have lost lots of accuracy.

S2 Summary: smirnovi(n, p) returns inaccurate non-NAN values as p->1.

S3: scipy.special.smirnovi not monotonic for p close to 1

smirnovi(n, p) should be monotonic in p. [It arises from a probability distribution. (1-smirnov(n, x)) is a probability CDF, hence smirnovi(n, p) should be monotonically decreasing as p->1.]

>>> p = np.linspace(0.89, 0.93, 21, endpoint=True)
>>> y = scipy.special.smirnovi(1, p)
>>> for idx, p_ in enumerate(p[1:]):
...     print('IDx=%2d p=%f y=%f %s' % (idx, p_, y[idx], 'Bad' if y[idx]>y[idx-1] else ''))
... 
IDx= 0 p=0.892000 y=0.076595 Bad
IDx= 1 p=0.894000 y=0.074029 
IDx= 2 p=0.896000 y=0.071589 
IDx= 3 p=0.898000 y=0.069259 
IDx= 4 p=0.900000 y=0.067027 
IDx= 5 p=0.902000 y=0.069706 Bad
IDx= 6 p=0.904000 y=0.071481 Bad
IDx= 7 p=0.906000 y=0.071802 Bad
IDx= 8 p=0.908000 y=0.066801 
IDx= 9 p=0.910000 y=0.052894 
IDx=10 p=0.912000 y=0.089193 Bad
IDx=11 p=0.914000 y=0.046641 
IDx=12 p=0.916000 y=0.073143 Bad
IDx=13 p=0.918000 y=0.090877 Bad
IDx=14 p=0.920000 y=0.226680 Bad
IDx=15 p=0.922000 y=0.493274 Bad
IDx=16 p=0.924000 y=0.157108 
IDx=17 p=0.926000 y=0.122748 
IDx=18 p=0.928000 y=0.000000 
IDx=19 p=0.930000 y=0.000000

However it bounces around as p->1.
I've also observed that it can take many iterations to converge.

S3 Summary: Not monotonic, causes invalid probability distribution random variable generation.

S4: smirnov(n, p) not monotonically decreasing for increasing n.
As n increases, smirnov(n, p) should decrease dues to its probability interpretation. (The probability of a gap of size x goes down as more samples are observed.)

>>> for n in range(1000, 1020, 1):
...    print(n, np.log(scipy.special.smirnov(n, 0.5)))
...    if n == 1012:
...        print('---')
... 
1000 -606.21577633
1001 -606.381127708
1002 -608.814843287
1003 -608.976583833
1004 -611.488568812
1005 -611.594745951
1006 -614.143940759
1007 -614.244098843
1008 -616.824189779
1009 -618.894519775
1010 -619.581637548
1011 -621.655946257
1012 -622.31942097
---
1013 -539.447655421
1014 -539.979953359
1015 -540.512251298
1016 -541.044549237
1017 -541.576847175
1018 -542.109145113
1019 -542.641443052

The values do decrease until n=1012, at which point there is a jump up and followed by decreases from then on. It is very obvious when looking at a plot over the whole of [0,1], so it is not just an issue with one particular value of x.

x01 = np.linspace(0, 1.0, 1001, endpoint=True)
for n in range(900, 1025, 50):
    plt.plot(x01, np.log(scipy.special.smirnov(n, x01)), ls='-', label='sm(%d)' % n)
for n in range(1050, 1300, 50):
    plt.plot(x01, np.log(scipy.special.smirnov(n, x01)), ls=':', label='sm(%d)' % n)
plt.xlim([0,1])
plt.legend();
plt.grid();

S4 Summary: Discontinuity between n=1012 and 1013.

Reproducing code example:

import sys
import numpy as np
import scipy
import scipy.special

print('S1: scipy.special.smirnovi(n, p) returns NAN for p close to 0')
print(scipy.special.smirnovi(1, np.linspace(0, 0.5, 26, endpoint=True)))
print(scipy.special.smirnovi(2, np.linspace(0.0, 0.1, 11, endpoint=True)))

print('S2: scipy.special.smirnovi returns incorrect values for p close to 1')
p = np.linspace(0.5, 1.0, 26, endpoint=True)
print(scipy.special.smirnovi(1, p))
print(scipy.special.smirnovi(1, p) - (1-p))
p11 = np.linspace(0.9, 1.0, 11, endpoint=True)
y11 = scipy.special.smirnovi(2, p11)
expected = (-1 + np.sqrt(5-4*p11))/2
print(y11 - expected)

print('S3: scipy.special.smirnovi not monotonic for p close to 1')
p = np.linspace(0.89, 0.93, 21, endpoint=True)
y = scipy.special.smirnovi(1, p)
for idx, p_ in enumerate(p[1:]):
    print('IDx=%2d p=%f y=%f %s' % (idx, p_, y[idx], 'Bad' if y[idx]>y[idx-1] else ''))

print('S4: smirnov(n, p) not monotonic for increasing n')
for n in range(700, 1300, 50):
   print(n, np.log(scipy.special.smirnov(n, 0.5)))

for n in range(1000, 1020, 1):
   print(n, np.log(scipy.special.smirnov(n, 0.5)))
   if n == 1012:
       print('---')

The examples above often use n=1, but examples for higher n also exist.

Scipy/Numpy/Python version information:

0.19.0 1.12.1 sys.version_info(major=3, minor=6, micro=1, releaselevel='final', serial=0)
('0.13.3', '1.8.2', sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0))

The text was updated successfully, but these errors were encountered:

pvanmulbregt · 2017-06-02T22:30:53Z

Added the image showing the change in behavior as n increases.
Solid lines show np.log(scipy.special.smirnov(n,x)) for n<=1000, doted lines for n>= 1050.

…scipy.stats.ksone) scipy.special.{smirnov,smirnovi} have accuracy & convergence issues incorrect values for scipy.stats.ksone.pdf() Add a smirnovp(), derivative of smirnov(), function. Use smirnovp() in smirnovi() instead of an asymptotic. If the initial estimate for x in smirnovi() is not going to work for Newton-Raphson (because it is on a "bad" part of the curve), do some bisection steps to get a better estimate, one which is on the correct side of the desired value. Replace the relative stopping criteria with that used in np.allclose(). With this in place, the algorithm seems to converge in about 5 iterations (10 function evaluations) for most values of p. (The most I've observed is 18 iterations, 31 functions). Exposing smirnovp() allows its use in scipy.stats.ksone() which addresses the inaccurate pdf(). Requires changes in many cephes infrastructure files, adding doc, exports and tests.

pvanmulbregt · 2017-06-05T21:14:12Z

Causes:
S1,S2,S3:
The scipy.special.smirnovi(n, p) is implemented as C code in scipy/special/cephes/kolmogorov.c. The code generates an initial estimate for x, then performs some root-finding to create a better estimate.

It uses the asymptotic approximation (smirnov(n, x) ~ exp(-2*n*x**2)) for an initial estimate of x. This asymptotic may be reasonable for large n, but isn't very close for small n, and can lead to sub-optimal starting points for the later root-finding, for both small and large n.
It uses the Newton-Raphson (NR) algorithm for its root-finding. This can lead to subsequent estimates outside the [0, 1] interval, which are reported as OVERFLOW errors.
It doesn't use the actual derivative but instead uses the same above asymptotic to approximate the derivative. This isn't close enough for some values of n, p, so the convergence can be very slow, ~500+ iterations. (smirnov(n, _) has a wickedly small derivative for x >> 1/sqrt(n), and this wreaks havoc with NR if the estimate is on the wrong side of the actual value.)

In particular, the convergence is neither guaranteed nor quadratic.

S4. The algorithm used to compute smirnov() changes between n=1012 & 1013.

nCv, the number of combinations v items from n objects, is no longer representable as a 64-bit floating point number for n=1013, v=516, and this multiplier appears in the calculation. So the code switches to a slightly different algorithm for n>= 1013.
The second algorithm computes a particular multiplication as the exp of a sum of logs. The log(nCv) factor becomes a difference of lgam(_) values, hence it is not as accurate as the first algorithm for small n.

This switch in of itself should result in only small differences. In reality however, the first algorithm has failed for n much lower n than 1013.
smirnov() computes a sum of triple products. The issue is that the triple product is a product of two potentially very small pow() values and a potentially large multiplier. If the result of either pow() is 0, then the large multiplier cannot be used to recover the true value. As such, the first algorithm often picks up the first term in the sum of triple products, but doesn't accumulate the subsequent much larger terms. If working with sums of logs as in the second approach, these don't underflow, allowing a later valid exponentiation.

I.e. the first algorithm, the one for small n, is currently underreporting the probability for some range of values of n and x.

…scipy.stats.ksone) scipy.special.{smirnov,smirnovi} have accuracy & convergence issues incorrect values for scipy.stats.ksone.pdf() Add a smirnovp(), derivative of smirnov(), function. Use smirnovp() in smirnovi() instead of an asymptotic. If the initial estimate for x in smirnovi() is not going to work for Newton-Raphson (because it is on a "bad" part of the curve), perform some bisection steps to get a better estimate, one which is on the correct side of the desired value. Replace the relative stopping criteria with same logic used in np.allclose(). With these changes, the algorithm seems to converge in about 5 NR iterations (10 function evaluations) for most values of p. Exposing smirnovp() allows its use in scipy.stats.ksone() which addresses the inaccurate pdf(). Requires changes in many cephes infrastructure files, adding doc, exports.

* smirnovi_stability: Addresses scipy#7456, scipy#7426, scipy#7455 (scipy.special.smirnov, scipy.stats.ksone) scipy.special.{smirnov,smirnovi} have accuracy & convergence issues incorrect values for scipy.stats.ksone.pdf()

This gets rid of a lot of unused functionality, simplifying the code quite a bit, plus fixes incorrect handling of endianness of output arrays, making the fixes in scipy#7456 unnecessary. For simplicity, some performance is lost in the case of a non-aligned or byteswapped user provided output array.

This gets rid of a lot of unused functionality, simplifying the code quite a bit, plus fixes incorrect handling of endianness of output arrays, making the fixes in #7456 unnecessary. For simplicity, some performance is lost in the case of a non-aligned or byteswapped user provided output array.

* smirnovi_stability: Added smirnovc/smirnovci, the CDF for the Smirnov probability distribution. Addresses scipy#7456, scipy#7426, scipy#7455 (scipy.special.smirnov, scipy.stats.ksone) scipy.special.{smirnov,smirnovi} have accuracy & convergence issues; incorrect values for scipy.stats.ksone.pdf()

…ogorov. Addresses scipy#7426, scipy#7455, scipy#7456, scipy#7492. scipy.special.cephes: A rewrite of the algorithms in the Kolomgorov-Smirnov statistics package. Kolmogorov: Improve efficiency of kolmogorov(). Fixed NANs, false convergence, unbounded computation in kolmogi(). Added dedicated functions (kolmogc/kolmogci/ kolmogp) to compute the CDF, PPF and PDF. Smirnov: Improve accuracy of smirnov(). Computation suffered from underflow and/or denormalization, resulting in probabilities occasionally off by orders of magnitude. The SF/CDF is now usually with 1 ULP, the PDF within a few ULP. Fixed NANs, convergence failures, false convergence in smirnovi(). Added dedicated functions (smirnovc/smirnovci/smirnovp) to compute the CDF, PPF and PDF. Added additional testing of K-S, removed xfail on previous tests. scipy.stats: ksone and kstwobign classes. Added dedicated _sf, _pdf, _isf methods to call new K-S functionality. Updated _cdf and _ppf methods to make appropriate calls. The accuracy improvements to smirnov() required increased precision in some parts of the computation. Ported most of the dd_real part of the qd package and used in kolmogorov.c

…ogorov. Addresses scipy#7426, scipy#7455, scipy#7456, scipy#7492. scipy.special.cephes: A rewrite of the algorithms in the Kolomgorov-Smirnov statistics package. Kolmogorov: Improve efficiency of kolmogorov(). Fixed NANs, false convergence, unbounded computation in kolmogi(). Added dedicated functions (kolmogc/kolmogci/ kolmogp) to compute the CDF, PPF and PDF. Smirnov: Improve accuracy of smirnov(). Computation suffered from underflow and/or denormalization, resulting in probabilities occasionally off by orders of magnitude. The SF/CDF is now usually with 1 ULP, the PDF within a few ULP. Fixed NANs, convergence failures, false convergence in smirnovi(). Added dedicated functions (smirnovc/smirnovci/smirnovp) to compute the CDF, PPF and PDF. Added additional testing of K-S, removed xfail on previous tests. scipy.stats: ksone and kstwobign classes. Added dedicated _sf, _pdf, _isf methods to call new K-S functionality. Updated _cdf and _ppf methods to make appropriate calls. The accuracy improvements to smirnov() required increased precision in some parts of the computation. Uses the recent port the dd_real part of the qd package. Allow FuncData testing to specify a list/array of dtypes to apply. FuncData.check(dtypes=[int, float, float]) will treat the first 3 columns as type int, float, float respectively, rather than blanketly treating all as the underlying type of the data. Needed for testing some C functions that take an int as a parameter.

rgommers · 2018-10-30T05:29:44Z

Fixed by gh-8737

rgommers added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.special labels Jun 5, 2017

pvanmulbregt mentioned this issue Jun 11, 2017

Fix #7456, #7426, #7455 scipy.special.smirnovi & scipy.stats.ksone #7481

Closed

jaimefrio mentioned this issue Jul 8, 2017

MAINT: Simplify ndimage array converters. #7573

Merged

pvanmulbregt mentioned this issue Mar 1, 2018

BUG: Improve numerical precision/convergence failures of smirnov/kolmogorov #8503

Closed

pvanmulbregt mentioned this issue Apr 9, 2018

BUG: Improve numerical precision/convergence failures of smirnov/kolmogorov #8699

Closed

pvanmulbregt mentioned this issue Apr 16, 2018

BUG: Improve numerical precision/convergence failures of smirnov/kolmogorov #8737

Merged

rgommers added this to the 1.2.0 milestone Oct 30, 2018

rgommers closed this as completed Oct 30, 2018

pvanmulbregt mentioned this issue Feb 9, 2019

Master Tracker for some Kolmogorov-Smirnov test Issues #9797

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scipy.special.smirnov and scipy.special.smirnovi have accuracy & convergence issues #7456

scipy.special.smirnov and scipy.special.smirnovi have accuracy & convergence issues #7456

pvanmulbregt commented Jun 2, 2017

pvanmulbregt commented Jun 2, 2017 •

edited

Loading

pvanmulbregt commented Jun 5, 2017

rgommers commented Oct 30, 2018

scipy.special.smirnov and scipy.special.smirnovi have accuracy & convergence issues #7456

scipy.special.smirnov and scipy.special.smirnovi have accuracy & convergence issues #7456

Comments

pvanmulbregt commented Jun 2, 2017

Reproducing code example:

Scipy/Numpy/Python version information:

pvanmulbregt commented Jun 2, 2017 • edited Loading

pvanmulbregt commented Jun 5, 2017

rgommers commented Oct 30, 2018

pvanmulbregt commented Jun 2, 2017 •

edited

Loading