Skip to content

Loading…

Improve distribution fitting as described in ticket #1553 and 1675 #462

Closed
wants to merge 3 commits into from

4 participants

@pv
SciPy member
pv commented

The idea here is good.

However, I think the penalty should not go into the nnlf() method, as inf is the correct (even if not useful) value to return. Rather, add a new method _penalized_nnlf that computes the out-of-bounds penalized nnlf, use this method in the fit() routine instead of nnlf().

@pbrod

Ok. Restored nnlf routine and added _penalized_nnlf.

@pv
SciPy member
pv commented

Another alternative could be use minimize with constraints. The penalty will probably be OK, but a constrained minimizer might be more robust.

@josef-pkt
SciPy member

This should get at least one test case, also a check what happens if goodargs is empty.

I don't think it's possible to get the constraints on the parameters itself, without recoding the conditions from _argcheck for cases where they are not the default >0. Although I don't have much experience with the constraint solvers.

Also, as far as I know, constraint solvers require that we have "valid" starting values, starting values that satisfy the constraints.
One of the big advantages of Per's proposal is that in some cases we can still recover from starting values that are invalid. Currently they return inf, after this change they would return large numbers but increasing in the number of violations. It would help us get around some of the awful starting values that we have.

@josef-pkt josef-pkt commented on the diff
scipy/stats/distributions.py
@@ -1689,12 +1690,38 @@ def nnlf(self, theta, x):
if not self._argcheck(*args) or scale <= 0:
return inf
x = asarray((x-loc) / scale)
- cond0 = (x <= self.a) | (x >= self.b)
+ cond0 = (x <= self.a) | (self.b <= x )
@josef-pkt SciPy member

we could avoid these checks for the unbounded support cases self.a = -inf, self.b=+inf

@josef-pkt SciPy member

wrong line, I meant in penalized_nnlf

@pbrod
pbrod added a note

What do you mean? How can we avoid the checks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt
SciPy member

about constraints for solvers:
argcheck is only relevant for the distributions that have .a, .b depending on the shape parameters.
In most cases, we only have conditions on loc and scale (often only loc?).
fmin_slsqp can handle general non-linear constraints, we would have up to 2*len(x) constraints in terms of a<=x<=b.

lognormal might be a simple test case (with true loc>0, so we have invalid default starting values).
(I don't remember what I looked at, when the ticket was opened.)

@pv pv commented on an outdated diff
scipy/stats/distributions.py
((14 lines not shown))
+ ''' Return negative loglikelihood function,
+ i.e., - sum (log pdf(x, theta),axis=0)
+ where theta are the parameters (including loc and scale)
+ '''
+ try:
+ loc = theta[-2]
+ scale = theta[-1]
+ args = tuple(theta[:-2])
+ except IndexError:
+ raise ValueError("Not enough input arguments.")
+ if not self._argcheck(*args) or scale <= 0:
+ return inf
+ x = asarray((x-loc) / scale)
+ cond0 = (x <= self.a) | (self.b <= x )
+ if (any(cond0)):
+ goodargs = argsreduce(1 - cond0, *((x,)))
@pv SciPy member
pv added a note

1 - cond0 -> ~cond0 as it's a boolean operation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@pv
SciPy member
pv commented

Ah OK, now I understand the issue with constraints, it's not as simple as I thought... As a general method, penalization is probably the simplest approach.

With some additional tests (e.g. for the bad cases in the tickets), I think we want to merge this.

@josef-pkt
SciPy member

an additional comment:
In some distribution we can still get many possible local minima (I think pareto is an example). If fit after this pull request is successful in converging, it might still return only one of the local minima (instead of nans, or starting values as now).

I think a warning in the docstring about that would be helpful.

Aside (not for this PR):
it might be interesting to check if basinhopping as minimizer is properly connected as choice of minimizer, and if we would gain a few more cases of distributions that fit correctly, in using it.

@pbrod pbrod Simplified the fit method.
Made sure fit_loc_scale method returns finite values in order to avoid failure in the fitting of the distributions
89a601c
@josef-pkt josef-pkt commented on the diff
scipy/stats/distributions.py
((12 lines not shown))
+
+ def _penalized_nnlf(self, theta, x):
+ ''' Return negative loglikelihood function,
+ i.e., - sum (log pdf(x, theta),axis=0)
+ where theta are the parameters (including loc and scale)
+ '''
+ try:
+ loc = theta[-2]
+ scale = theta[-1]
+ args = tuple(theta[:-2])
+ except IndexError:
+ raise ValueError("Not enough input arguments.")
+ if not self._argcheck(*args) or scale <= 0:
+ return inf
+ x = asarray((x-loc) / scale)
+ cond0 = (x <= self.a) | (self.b <= x )
@josef-pkt SciPy member

avoiding this in unnecessary cases (t distribution, ...), add in front of this part

if np.isneginf(self.a).all() and np.isinf(self.b).all(): 

or something like this

Or maybe this is already handled somewhere else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@pv
SciPy member
pv commented

Here's a fixed branch https://github.com/pv/scipy-work/compare/pr-462

I fixed the above comments, and re-enabled the continuous distribution fitting tests (test runtime roughly <2min in total, so OK). After this PR, the parameter fits succeed for 29 more distributions as compared to Scipy 0.12.0.

Merge? Y/N?

@rgommers
SciPy member

Merging this sounds good to me. Tested, no issues seen. A few minor comments:

    # Skip failing fits unless overrided

typo: overrided --> overridden

    xfail = True
    try: xfail = not int(os.environ['SCIPY_XFAIL'])
    except: pass 

PEP8: add some line breaks. Also, should SCIPY_XFAIL be documented somewhere?

@josef-pkt
SciPy member

looks fine to me, getting 29 additional cases to work is pretty good.

I don't understand why (0 < Shat) is used for Shat. For which case can we have a negative result after sqrt?

@pbrod

It is just to make sure it is not zero.

@pbrod pbrod closed this
@pbrod pbrod reopened this
@josef-pkt
SciPy member

shouldn't it be then not (0 <= Shat) ?

update, I was reading this back to front, all ok

@josef-pkt
SciPy member

Looks ready to merge to me

@pv
SciPy member
pv commented

Merged in b4e235c

@pv pv closed this
@ClemensFMN ClemensFMN referenced this pull request
Commit has since been removed from the repository and is no longer available.
@argriffing argriffing referenced this pull request
Merged

improved stats mle fit #4759

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Mar 10, 2013
  1. @pbrod
Commits on Mar 13, 2013
  1. @pbrod
Commits on Mar 29, 2013
  1. @pbrod

    Simplified the fit method.

    pbrod committed
    Made sure fit_loc_scale method returns finite values in order to avoid failure in the fitting of the distributions
This page is out of date. Refresh to see the latest.
Showing with 35 additions and 7 deletions.
  1. +35 −7 scipy/stats/distributions.py
View
42 scipy/stats/distributions.py
@@ -1677,9 +1677,10 @@ def _nnlf(self, x, *args):
return -sum(self._logpdf(x, *args),axis=0)
def nnlf(self, theta, x):
- # - sum (log pdf(x, theta),axis=0)
- # where theta are the parameters (including loc and scale)
- #
+ ''' Return negative loglikelihood function,
+ i.e., - sum (log pdf(x, theta),axis=0)
+ where theta are the parameters (including loc and scale)
+ '''
try:
loc = theta[-2]
scale = theta[-1]
@@ -1689,12 +1690,35 @@ def nnlf(self, theta, x):
if not self._argcheck(*args) or scale <= 0:
return inf
x = asarray((x-loc) / scale)
- cond0 = (x <= self.a) | (x >= self.b)
+ cond0 = (x <= self.a) | (self.b <= x )
@josef-pkt SciPy member

we could avoid these checks for the unbounded support cases self.a = -inf, self.b=+inf

@josef-pkt SciPy member

wrong line, I meant in penalized_nnlf

@pbrod
pbrod added a note

What do you mean? How can we avoid the checks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
if (any(cond0)):
return inf
else:
N = len(x)
- return self._nnlf(x, *args) + N*log(scale)
+ return self._nnlf(x, *args) + N * log(scale)
+
+ def _penalized_nnlf(self, theta, x):
+ ''' Return negative loglikelihood function,
+ i.e., - sum (log pdf(x, theta),axis=0)
+ where theta are the parameters (including loc and scale)
+ '''
+ try:
+ loc = theta[-2]
+ scale = theta[-1]
+ args = tuple(theta[:-2])
+ except IndexError:
+ raise ValueError("Not enough input arguments.")
+ if not self._argcheck(*args) or scale <= 0:
+ return inf
+ x = asarray((x-loc) / scale)
+ cond0 = (x <= self.a) | (self.b <= x )
@josef-pkt SciPy member

avoiding this in unnecessary cases (t distribution, ...), add in front of this part

if np.isneginf(self.a).all() and np.isinf(self.b).all(): 

or something like this

Or maybe this is already handled somewhere else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ Nbad = sum(cond0)
+ loginf = log(floatinfo.machar.xmax)
+ if Nbad>0:
+ x = argsreduce(~cond0, x)[0]
+
+ N = len(x)
+ return self._nnlf(x, *args) + N*log(scale) + Nbad * 100.0 * loginf
# return starting point for fit (shape arguments + loc + scale)
def _fitstart(self, data, args=None):
@@ -1719,7 +1743,7 @@ def _reduce_func(self, args, kwds):
x0.append(args[n])
if len(fixedn) == 0:
- func = self.nnlf
+ func = self._penalized_nnlf
restore = None
else:
if len(fixedn) == len(index):
@@ -1737,7 +1761,7 @@ def restore(args, theta):
def func(theta, x):
newtheta = restore(args[:], theta)
- return self.nnlf(newtheta, x)
+ return self._penalized_nnlf(newtheta, x)
return x0, func, restore, args
@@ -1845,6 +1869,10 @@ def fit_loc_scale(self, data, *args):
mu2hat = tmp.var()
Shat = sqrt(mu2hat / mu2)
Lhat = muhat - Shat*mu
+ if not np.isfinite(Lhat):
+ Lhat = 0
+ if not (np.isfinite(Shat) and (0 < Shat)) :
+ Shat = 1
return Lhat, Shat
@np.deprecate
Something went wrong with that request. Please try again.