Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugs stats entropy #2765

Closed
josef-pkt opened this issue Aug 21, 2013 · 9 comments · Fixed by #2774
Closed

bugs stats entropy #2765

josef-pkt opened this issue Aug 21, 2013 · 9 comments · Fixed by #2774
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected good first issue Good topic for first contributor pull requests, with a relatively straightforward solution scipy.stats
Milestone

Comments

@josef-pkt
Copy link
Member

running the script at the bottom with python 3.3 and scipy 0.12.0
shows the following differences between the reported entropy and the one calculated by numerical integration (generic code)

This should have fewer false positives than the tests I did before with comparison to a large random sample (theoretical entropy versus sample entropy).

distributions raising an exception

[['reciprocal', (0.0062309367010521255, 1.0062309367010522)], 
['truncnorm', (-1.0978730080013919, 2.730675410903198)]]

distribution with more than 0.1 relative tolerance
last value is what the distributions return
second to last value is from numerical integration

(not shown: a very large number of warnings)

['alpha', (3.570477051665046,), -1.036769456792257, nan]
['dgamma', (1.1023326088288166,), 1.7876451235296962, nan]
['erlang', (20,), 2.899928334598662, -16.100071665401337]
['exponpow', (2.697119160358469,), -0.023913603818474717, -0.028157783503781696]
['gamma', (1.9932305483800778,), 1.5748065290920485, 0.58157598071197558]
['genextreme', (-0.1,), 1.6262384754944788, nan]
['gumbel_l', (), 1.5634607039735562, 1.0608407169541685]
['gumbel_r', (), 1.5634607039735562, 1.0608407169541685]
['invgauss', (0.14546264555347513,), -1.5749344358698636, nan]
['ksone', (1000,), nan, nan]
['kstwobign', (), nan, nan]
['levy', (), 3.3244828015117034, -0.028980334288262362]
['levy_l', (), 3.3244828015117034, -0.028980334288262362]
['logistic', (), 1.984183489108535, 1.0]
['recipinvgauss', (0.6300426780936912,), 1.7750403907054204, nan]
['wald', (), 0.8769456078722748, nan]

Notes:

  • I wrote the script originally for scipy 0.9 and python 2.6 (no xlogy)
  • it uses an old copy of the distribution cases from the test suite that is importable from the statsmodels test suite.
  • the cont_entropy function is almost completely a copy of the generic code in stats.distributions
# -*- coding: utf-8 -*-
"""

Created on Wed Aug 21 12:06:48 2013
Author: Josef Perktold
"""

import numpy as np
from scipy import special, integrate, stats

# I'm working with old scipy without special.xlogy
def xlogy(x, y):
    return x * np.log(y + 1e-30)


def cont_entropy(self, *args):
    def integ(x):
        val = self._pdf(x, *args)
        return xlogy(val, val)

    entr = -integrate.quad(integ,self.a,self.b)[0]
    if not np.isnan(entr):
        return entr
    else: # try with different limits if integration problems
        low,upp = self.ppf([0.001,0.999],*args)
        if np.isinf(self.b):
            upper = upp
        else:
            upper = self.b
        if np.isinf(self.a):
            lower = low
        else:
            lower = self.a
        return -integrate.quad(integ,lower,upper)[0]

print(cont_entropy(stats.rayleigh))
print(stats.rayleigh._entropy())

from statsmodels.sandbox.distributions.tests.distparams import distcont, distdiscrete

res = []
res_raised = []
for distname, distargs in distcont[:]:
        #if distname not in distex_0: continue
        # newer scipy use invgauss instead of invnorm
        if distname == 'invnorm':
            distname = 'invgauss'
        distfn = getattr(stats, distname)

        try:
            e1 = cont_entropy(distfn, *distargs)
            e2 = distfn._entropy(*distargs)
            res.append([distname, distargs, e1, e2])
        except:
            #print ('exception in', distfn, distargs))
            res_raised.append([distname, distargs])

print("distributions raising an exception")
print (res_raised)

print("distribution with more than 0.1 relative tolerance")        
for d in res:
    if not np.allclose(d[-2], d[-1], rtol=0.1):
        print(d)
@josef-pkt
Copy link
Member Author

as cross-ref: a similar list with (possibly) remaining bug for higher moments and stats skew kurtosis gh-1329

@WarrenWeckesser
Copy link
Member

I added the easy-fix label, because most of these are either already working or will be easy to fix.
Here's an update on the distributions @josef-pkt shows above:

Appear OK in master
* alpha
* dgamma
* genextreme
* invgauss
* levy
* levy_l
* wald
*recipinvgauss
* reciprocal
* truncnorm

Easy fix
* gamma: Formula is wrong in gamma._entropy (see wikipedia article). Fixing this will also fix erlang
* gumbel_r: The entropy is just gam + 1 where gam is the Euler constant 0.577... (See the wikipedia article for the Gumbel distribution.) The code has this hardcoded as 1.06... I don't know where that number comes from. gumbel_l is just gumbel_r with x -> -x, so its entropy is the same.
* logistic: The entropy of the standard distribution is simply 2, but logistic._entropy returns 1.

Needs further investigation
* exponpow ("overflow encountered")
* ksone (Perhaps the shape parameter 1000 is just too big? Using 10 returns a finite number)
* kstwobign

For some of the distributions, there is a known explicit formula for the entropy that is not used (e.g. levy). An additional easy task would be to add an _entropy method that uses the explicit formula wherever possible. If anyone does this, note that _entropy should calculate the entropy for the standard distribution (loc=0, scale=1).

reciprocal and truncnorm raise an exception in Josef's code because it bypasses a required call to _argcheck. In those distributions, _argcheck creates additional attributes that are used in the pdf and entropy calculations.

@josef-pkt
Copy link
Member Author

Thanks for updating this.

ksone, and kstwobign can be ignored.
They are written specifically for the Kolmogorov-Smirnov tests, and have only code for sf and ppf which are used in the kstests. IIRC, they fail all over the place. I looked a bit at the ksone case, a lower parameter helps, but it's still strange (support looks more like (0,1] in the examples I tried).
In any case it's a failure of the generic code when only cdf or sf is defined, and not an entropy failure. I think we don't have any other distribution like these.
(If never seen ksone used outside of kstest, kstwobign would be worth to have available as functioning distribution, but no closed form solutions are available for any method.)

@rgommers
Copy link
Member

After merging gh-2774, this is what's left:

distributions raising an exception
[['reciprocal', (0.0062309367010521255, 1.0062309367010522)], 
 ['truncnorm', (-1.0978730080013919, 2.730675410903198)]]

distribution with more than 0.1 relative tolerance
['ksone', (1000,), nan, nan]
['kstwobign', (), nan, nan]
['ncf', (27, 27, 0.41578441799226107), 0.49078270223194753, -0.0]

@ev-br
Copy link
Member

ev-br commented Sep 15, 2013

It can probably be closed now: Warren explained truncnorm and reciprocal fails are explained above, ksone/kstwobign are special anyway, and ncf seems to have an accuracy issue, special-cased in https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L428

@josef-pkt
Copy link
Member Author

@EvgeniBurovski Can you open a new issue for numerical problems in ncf? I don't have an overview for which methods the numerical problems are serious.

Then we can close this.

As we get closer to being bugfree, I would prefer issues for individual distributions that still need attention (instead of these generic entropy or moments don't work issues.)

@ev-br
Copy link
Member

ev-br commented Sep 15, 2013

@josef-pkt done. (not sure is any of these is high priority)

@rgommers
Copy link
Member

Thanks, closing.

@pvanmulbregt
Copy link
Contributor

#8737 fixes the entropy for kstwobign (accuracy of special.{kolmogi,kolmogp}).
#8737 and #9661 together fix the entropy for ksone (accuracy of special.{smirnovi,smirnovp}and setting appropriate right endpoint in ksone_gen.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected good first issue Good topic for first contributor pull requests, with a relatively straightforward solution scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants