Improve distribution fitting as described in ticket #1553 and 1675 #462

Closed
wants to merge 3 commits into
from

Conversation

Projects
None yet
4 participants
@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Mar 11, 2013

Member

The idea here is good.

However, I think the penalty should not go into the nnlf() method, as inf is the correct (even if not useful) value to return. Rather, add a new method _penalized_nnlf that computes the out-of-bounds penalized nnlf, use this method in the fit() routine instead of nnlf().

Member

pv commented Mar 11, 2013

The idea here is good.

However, I think the penalty should not go into the nnlf() method, as inf is the correct (even if not useful) value to return. Rather, add a new method _penalized_nnlf that computes the out-of-bounds penalized nnlf, use this method in the fit() routine instead of nnlf().

@pbrod

This comment has been minimized.

Show comment
Hide comment
@pbrod

pbrod Mar 13, 2013

Contributor

Ok. Restored nnlf routine and added _penalized_nnlf.

Contributor

pbrod commented Mar 13, 2013

Ok. Restored nnlf routine and added _penalized_nnlf.

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Mar 21, 2013

Member

Another alternative could be use minimize with constraints. The penalty will probably be OK, but a constrained minimizer might be more robust.

Member

pv commented Mar 21, 2013

Another alternative could be use minimize with constraints. The penalty will probably be OK, but a constrained minimizer might be more robust.

@josef-pkt

This comment has been minimized.

Show comment
Hide comment
@josef-pkt

josef-pkt Mar 21, 2013

Member

This should get at least one test case, also a check what happens if goodargs is empty.

I don't think it's possible to get the constraints on the parameters itself, without recoding the conditions from _argcheck for cases where they are not the default >0. Although I don't have much experience with the constraint solvers.

Also, as far as I know, constraint solvers require that we have "valid" starting values, starting values that satisfy the constraints.
One of the big advantages of Per's proposal is that in some cases we can still recover from starting values that are invalid. Currently they return inf, after this change they would return large numbers but increasing in the number of violations. It would help us get around some of the awful starting values that we have.

Member

josef-pkt commented Mar 21, 2013

This should get at least one test case, also a check what happens if goodargs is empty.

I don't think it's possible to get the constraints on the parameters itself, without recoding the conditions from _argcheck for cases where they are not the default >0. Although I don't have much experience with the constraint solvers.

Also, as far as I know, constraint solvers require that we have "valid" starting values, starting values that satisfy the constraints.
One of the big advantages of Per's proposal is that in some cases we can still recover from starting values that are invalid. Currently they return inf, after this change they would return large numbers but increasing in the number of violations. It would help us get around some of the awful starting values that we have.

@@ -1689,12 +1690,38 @@ def nnlf(self, theta, x):
if not self._argcheck(*args) or scale <= 0:
return inf
x = asarray((x-loc) / scale)
- cond0 = (x <= self.a) | (x >= self.b)
+ cond0 = (x <= self.a) | (self.b <= x )

This comment has been minimized.

@josef-pkt

josef-pkt Mar 21, 2013

Member

we could avoid these checks for the unbounded support cases self.a = -inf, self.b=+inf

@josef-pkt

josef-pkt Mar 21, 2013

Member

we could avoid these checks for the unbounded support cases self.a = -inf, self.b=+inf

This comment has been minimized.

@josef-pkt

josef-pkt Mar 21, 2013

Member

wrong line, I meant in penalized_nnlf

@josef-pkt

josef-pkt Mar 21, 2013

Member

wrong line, I meant in penalized_nnlf

This comment has been minimized.

@pbrod

pbrod Mar 29, 2013

Contributor

What do you mean? How can we avoid the checks?

@pbrod

pbrod Mar 29, 2013

Contributor

What do you mean? How can we avoid the checks?

@josef-pkt

This comment has been minimized.

Show comment
Hide comment
@josef-pkt

josef-pkt Mar 22, 2013

Member

about constraints for solvers:
argcheck is only relevant for the distributions that have .a, .b depending on the shape parameters.
In most cases, we only have conditions on loc and scale (often only loc?).
fmin_slsqp can handle general non-linear constraints, we would have up to 2*len(x) constraints in terms of a<=x<=b.

lognormal might be a simple test case (with true loc>0, so we have invalid default starting values).
(I don't remember what I looked at, when the ticket was opened.)

Member

josef-pkt commented Mar 22, 2013

about constraints for solvers:
argcheck is only relevant for the distributions that have .a, .b depending on the shape parameters.
In most cases, we only have conditions on loc and scale (often only loc?).
fmin_slsqp can handle general non-linear constraints, we would have up to 2*len(x) constraints in terms of a<=x<=b.

lognormal might be a simple test case (with true loc>0, so we have invalid default starting values).
(I don't remember what I looked at, when the ticket was opened.)

scipy/stats/distributions.py
+ x = asarray((x-loc) / scale)
+ cond0 = (x <= self.a) | (self.b <= x )
+ if (any(cond0)):
+ goodargs = argsreduce(1 - cond0, *((x,)))

This comment has been minimized.

@pv

pv Mar 22, 2013

Member

1 - cond0 -> ~cond0 as it's a boolean operation

@pv

pv Mar 22, 2013

Member

1 - cond0 -> ~cond0 as it's a boolean operation

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Mar 22, 2013

Member

Ah OK, now I understand the issue with constraints, it's not as simple as I thought... As a general method, penalization is probably the simplest approach.

With some additional tests (e.g. for the bad cases in the tickets), I think we want to merge this.

Member

pv commented Mar 22, 2013

Ah OK, now I understand the issue with constraints, it's not as simple as I thought... As a general method, penalization is probably the simplest approach.

With some additional tests (e.g. for the bad cases in the tickets), I think we want to merge this.

@josef-pkt

This comment has been minimized.

Show comment
Hide comment
@josef-pkt

josef-pkt Mar 22, 2013

Member

an additional comment:
In some distribution we can still get many possible local minima (I think pareto is an example). If fit after this pull request is successful in converging, it might still return only one of the local minima (instead of nans, or starting values as now).

I think a warning in the docstring about that would be helpful.

Aside (not for this PR):
it might be interesting to check if basinhopping as minimizer is properly connected as choice of minimizer, and if we would gain a few more cases of distributions that fit correctly, in using it.

Member

josef-pkt commented Mar 22, 2013

an additional comment:
In some distribution we can still get many possible local minima (I think pareto is an example). If fit after this pull request is successful in converging, it might still return only one of the local minima (instead of nans, or starting values as now).

I think a warning in the docstring about that would be helpful.

Aside (not for this PR):
it might be interesting to check if basinhopping as minimizer is properly connected as choice of minimizer, and if we would gain a few more cases of distributions that fit correctly, in using it.

Simplified the fit method.
Made sure fit_loc_scale method returns finite values in order to avoid failure in the fitting of the distributions
+ if not self._argcheck(*args) or scale <= 0:
+ return inf
+ x = asarray((x-loc) / scale)
+ cond0 = (x <= self.a) | (self.b <= x )

This comment has been minimized.

@josef-pkt

josef-pkt Mar 29, 2013

Member

avoiding this in unnecessary cases (t distribution, ...), add in front of this part

if np.isneginf(self.a).all() and np.isinf(self.b).all(): 

or something like this

Or maybe this is already handled somewhere else?

@josef-pkt

josef-pkt Mar 29, 2013

Member

avoiding this in unnecessary cases (t distribution, ...), add in front of this part

if np.isneginf(self.a).all() and np.isinf(self.b).all(): 

or something like this

Or maybe this is already handled somewhere else?

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv Apr 14, 2013

Member

Here's a fixed branch https://github.com/pv/scipy-work/compare/pr-462

I fixed the above comments, and re-enabled the continuous distribution fitting tests (test runtime roughly <2min in total, so OK). After this PR, the parameter fits succeed for 29 more distributions as compared to Scipy 0.12.0.

Merge? Y/N?

Member

pv commented Apr 14, 2013

Here's a fixed branch https://github.com/pv/scipy-work/compare/pr-462

I fixed the above comments, and re-enabled the continuous distribution fitting tests (test runtime roughly <2min in total, so OK). After this PR, the parameter fits succeed for 29 more distributions as compared to Scipy 0.12.0.

Merge? Y/N?

@rgommers

This comment has been minimized.

Show comment
Hide comment
@rgommers

rgommers Apr 14, 2013

Member

Merging this sounds good to me. Tested, no issues seen. A few minor comments:

    # Skip failing fits unless overrided

typo: overrided --> overridden

    xfail = True
    try: xfail = not int(os.environ['SCIPY_XFAIL'])
    except: pass 

PEP8: add some line breaks. Also, should SCIPY_XFAIL be documented somewhere?

Member

rgommers commented Apr 14, 2013

Merging this sounds good to me. Tested, no issues seen. A few minor comments:

    # Skip failing fits unless overrided

typo: overrided --> overridden

    xfail = True
    try: xfail = not int(os.environ['SCIPY_XFAIL'])
    except: pass 

PEP8: add some line breaks. Also, should SCIPY_XFAIL be documented somewhere?

@josef-pkt

This comment has been minimized.

Show comment
Hide comment
@josef-pkt

josef-pkt Apr 14, 2013

Member

looks fine to me, getting 29 additional cases to work is pretty good.

I don't understand why (0 < Shat) is used for Shat. For which case can we have a negative result after sqrt?

Member

josef-pkt commented Apr 14, 2013

looks fine to me, getting 29 additional cases to work is pretty good.

I don't understand why (0 < Shat) is used for Shat. For which case can we have a negative result after sqrt?

@pbrod

This comment has been minimized.

Show comment
Hide comment
@pbrod

pbrod Apr 14, 2013

Contributor

It is just to make sure it is not zero.

Contributor

pbrod commented Apr 14, 2013

It is just to make sure it is not zero.

@pbrod pbrod closed this Apr 14, 2013

@pbrod pbrod reopened this Apr 14, 2013

@josef-pkt

This comment has been minimized.

Show comment
Hide comment
@josef-pkt

josef-pkt Apr 15, 2013

Member

shouldn't it be then not (0 <= Shat) ?

update, I was reading this back to front, all ok

Member

josef-pkt commented Apr 15, 2013

shouldn't it be then not (0 <= Shat) ?

update, I was reading this back to front, all ok

@josef-pkt

This comment has been minimized.

Show comment
Hide comment
@josef-pkt

josef-pkt Apr 15, 2013

Member

Looks ready to merge to me

Member

josef-pkt commented Apr 15, 2013

Looks ready to merge to me

@pv

This comment has been minimized.

Show comment
Hide comment
@pv

pv May 1, 2013

Member

Merged in b4e235c

Member

pv commented May 1, 2013

Merged in b4e235c

@pv pv closed this May 1, 2013

@argriffing argriffing referenced this pull request Apr 20, 2015

Merged

improved stats mle fit #4759

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment