bugs stats entropy #2765

Closed
josef-pkt opened this Issue Aug 21, 2013 · 8 comments

4 participants

@josef-pkt
SciPy member

running the script at the bottom with python 3.3 and scipy 0.12.0
shows the following differences between the reported entropy and the one calculated by numerical integration (generic code)

This should have fewer false positives than the tests I did before with comparison to a large random sample (theoretical entropy versus sample entropy).

distributions raising an exception

[['reciprocal', (0.0062309367010521255, 1.0062309367010522)], 
['truncnorm', (-1.0978730080013919, 2.730675410903198)]]

distribution with more than 0.1 relative tolerance
last value is what the distributions return
second to last value is from numerical integration

(not shown: a very large number of warnings)

['alpha', (3.570477051665046,), -1.036769456792257, nan]
['dgamma', (1.1023326088288166,), 1.7876451235296962, nan]
['erlang', (20,), 2.899928334598662, -16.100071665401337]
['exponpow', (2.697119160358469,), -0.023913603818474717, -0.028157783503781696]
['gamma', (1.9932305483800778,), 1.5748065290920485, 0.58157598071197558]
['genextreme', (-0.1,), 1.6262384754944788, nan]
['gumbel_l', (), 1.5634607039735562, 1.0608407169541685]
['gumbel_r', (), 1.5634607039735562, 1.0608407169541685]
['invgauss', (0.14546264555347513,), -1.5749344358698636, nan]
['ksone', (1000,), nan, nan]
['kstwobign', (), nan, nan]
['levy', (), 3.3244828015117034, -0.028980334288262362]
['levy_l', (), 3.3244828015117034, -0.028980334288262362]
['logistic', (), 1.984183489108535, 1.0]
['recipinvgauss', (0.6300426780936912,), 1.7750403907054204, nan]
['wald', (), 0.8769456078722748, nan]

Notes:

  • I wrote the script originally for scipy 0.9 and python 2.6 (no xlogy)
  • it uses an old copy of the distribution cases from the test suite that is importable from the statsmodels test suite.
  • the cont_entropy function is almost completely a copy of the generic code in stats.distributions
# -*- coding: utf-8 -*-
"""

Created on Wed Aug 21 12:06:48 2013
Author: Josef Perktold
"""

import numpy as np
from scipy import special, integrate, stats

# I'm working with old scipy without special.xlogy
def xlogy(x, y):
    return x * np.log(y + 1e-30)


def cont_entropy(self, *args):
    def integ(x):
        val = self._pdf(x, *args)
        return xlogy(val, val)

    entr = -integrate.quad(integ,self.a,self.b)[0]
    if not np.isnan(entr):
        return entr
    else: # try with different limits if integration problems
        low,upp = self.ppf([0.001,0.999],*args)
        if np.isinf(self.b):
            upper = upp
        else:
            upper = self.b
        if np.isinf(self.a):
            lower = low
        else:
            lower = self.a
        return -integrate.quad(integ,lower,upper)[0]

print(cont_entropy(stats.rayleigh))
print(stats.rayleigh._entropy())

from statsmodels.sandbox.distributions.tests.distparams import distcont, distdiscrete

res = []
res_raised = []
for distname, distargs in distcont[:]:
        #if distname not in distex_0: continue
        # newer scipy use invgauss instead of invnorm
        if distname == 'invnorm':
            distname = 'invgauss'
        distfn = getattr(stats, distname)

        try:
            e1 = cont_entropy(distfn, *distargs)
            e2 = distfn._entropy(*distargs)
            res.append([distname, distargs, e1, e2])
        except:
            #print ('exception in', distfn, distargs))
            res_raised.append([distname, distargs])

print("distributions raising an exception")
print (res_raised)

print("distribution with more than 0.1 relative tolerance")        
for d in res:
    if not np.allclose(d[-2], d[-1], rtol=0.1):
        print(d)
@josef-pkt
SciPy member

as cross-ref: a similar list with (possibly) remaining bug for higher moments and stats skew kurtosis gh-1329

@WarrenWeckesser

I added the easy-fix label, because most of these are either already working or will be easy to fix.
Here's an update on the distributions @josef-pkt shows above:

Appear OK in master
* alpha
* dgamma
* genextreme
* invgauss
* levy
* levy_l
* wald
*recipinvgauss
* reciprocal
* truncnorm

Easy fix
* gamma: Formula is wrong in gamma._entropy (see wikipedia article). Fixing this will also fix erlang
* gumbel_r: The entropy is just gam + 1 where gam is the Euler constant 0.577... (See the wikipedia article for the Gumbel distribution.) The code has this hardcoded as 1.06... I don't know where that number comes from. gumbel_l is just gumbel_r with x -> -x, so its entropy is the same.
* logistic: The entropy of the standard distribution is simply 2, but logistic._entropy returns 1.

Needs further investigation
* exponpow ("overflow encountered")
* ksone (Perhaps the shape parameter 1000 is just too big? Using 10 returns a finite number)
* kstwobign

For some of the distributions, there is a known explicit formula for the entropy that is not used (e.g. levy). An additional easy task would be to add an _entropy method that uses the explicit formula wherever possible. If anyone does this, note that _entropy should calculate the entropy for the standard distribution (loc=0, scale=1).

reciprocal and truncnorm raise an exception in Josef's code because it bypasses a required call to _argcheck. In those distributions, _argcheck creates additional attributes that are used in the pdf and entropy calculations.

@josef-pkt
SciPy member

Thanks for updating this.

ksone, and kstwobign can be ignored.
They are written specifically for the Kolmogorov-Smirnov tests, and have only code for sf and ppf which are used in the kstests. IIRC, they fail all over the place. I looked a bit at the ksone case, a lower parameter helps, but it's still strange (support looks more like (0,1] in the examples I tried).
In any case it's a failure of the generic code when only cdf or sf is defined, and not an entropy failure. I think we don't have any other distribution like these.
(If never seen ksone used outside of kstest, kstwobign would be worth to have available as functioning distribution, but no closed form solutions are available for any method.)

@rgommers
SciPy member

After merging gh-2774, this is what's left:

distributions raising an exception
[['reciprocal', (0.0062309367010521255, 1.0062309367010522)], 
 ['truncnorm', (-1.0978730080013919, 2.730675410903198)]]

distribution with more than 0.1 relative tolerance
['ksone', (1000,), nan, nan]
['kstwobign', (), nan, nan]
['ncf', (27, 27, 0.41578441799226107), 0.49078270223194753, -0.0]
@ev-br
SciPy member

It can probably be closed now: Warren explained truncnorm and reciprocal fails are explained above, ksone/kstwobign are special anyway, and ncf seems to have an accuracy issue, special-cased in https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L428

@josef-pkt
SciPy member

@EvgeniBurovski Can you open a new issue for numerical problems in ncf? I don't have an overview for which methods the numerical problems are serious.

Then we can close this.

As we get closer to being bugfree, I would prefer issues for individual distributions that still need attention (instead of these generic entropy or moments don't work issues.)

@ev-br
SciPy member

@josef-pkt done. (not sure is any of these is high priority)

@rgommers
SciPy member

Thanks, closing.

@rgommers rgommers closed this Sep 15, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment