GLM.fit does not allow start_params #443

Closed
jseabold opened this Issue Aug 24, 2012 · 6 comments

Projects

None yet

2 participants

@jseabold
Member

GLM fit signature should allow starting parameters to be specified. This came up as I was reading about how IRLS is not a robust fitting algorithm.

http://www.r-bloggers.com/how-robust-is-logistic-regression/

@josef-pkt
Member

reading a bit how the initial estimate works, I think
mu = self.family.starting_mu(self.endog)
should depend on data_weights. It uses y.mean() instead of weighted mean in Binomial.

an example that breaks with the default starting values is at a comment on the original post
http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/

@josef-pkt
Member

discrete Logit fails with singular matrix exception when using the default newton method.

method='ncg' converges very fast (10 iterations) to the correct solution (result of R glm with adjusted starting values)

@josef-pkt
Member

example script for the case in the comment

# -*- coding: utf-8 -*-
"""

Created on Sat Aug 25 10:14:55 2012
Author: Josef Perktold

convergence failure of GLM Binomial and Logit
example based on comment in 
http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/

Note: example has one value with predicted probability zero
>>> (reslogit.predict() < 1e-10).sum()
1
>>> reslogit.predict().min()
9.6285423080907814e-233

method='ncg' looks good

"""

import numpy as np
from numpy.testing import assert_almost_equal, assert_equal

import statsmodels.api as sm

y2 = np.array('0 1 0 0 0 1'.split(), int)
wt = np.array([50,1,50,1,5,10])
y2 = np.repeat(y2, wt)
x2 = np.repeat([0,0,0.001,100,-1,-1], wt)
assert_equal(y2.sum(), 11)
assert_equal(len(y2), 117)
#y2.mean() is 11./117 = 0.094017094017094016

resglm2 = sm.GLM(y2, sm.add_constant(x2, prepend=True), family=sm.families.Binomial()).fit()
print resglm2.params #nan

try:
    reslogit = sm.Logit(y2, sm.add_constant(x2, prepend=True)).fit()
except np.linalg.linalg.LinAlgError:
    print 'numpy.linalg.linalg.LinAlgError'
#numpy.linalg.linalg.LinAlgError: Singular matrix

reslogit = sm.Logit(y2, sm.add_constant(x2, prepend=True)).fit(method='nm')
#Warning: Maximum number of iterations has been exceeded
reslogit = sm.Logit(y2, sm.add_constant(x2, prepend=True)).fit(method='ncg')
#from R glm with starting values (-4, -5)
#prints:
rglm_params = [-4.603050220003291, -5.296345452725506]
rglm_aic = 34.31049560846446
assert_almost_equal(reslogit.params, rglm_params, decimal=8)
assert_almost_equal(reslogit.aic, rglm_aic, decimal=12)

#with starting values
reslogit_s = sm.Logit(y2, sm.add_constant(x2, prepend=True)).fit(
                                        start_params=[-4,-5])

assert_almost_equal(reslogit_s.params, rglm_params, decimal=8)
assert_almost_equal(reslogit_s.aic, rglm_aic, decimal=12)
@jseabold jseabold added a commit to jseabold/statsmodels that referenced this issue Apr 18, 2014
@jseabold jseabold ENH: Allow start_params in GLM. Closes #443. 5d89811
@josef-pkt
Member

For adding the start_params I think @yizhong 's change in line 386 is the right way.

unit tests could just check the params in the fit history of the results instance.

@yizhong pull request would be very welcome

@jseabold
Member

I think this failing case is orthogonal to the start_params issue. Though since I think things like this [1] are boolean indices anyway then we should index and have two sums right (or use nansum)? This (almost) solves the issue here. I want to refactor all of this now after looking at it again 5 years later...

[1] https://github.com/statsmodels/statsmodels/blob/master/statsmodels/genmod/families/family.py#L779

@jseabold jseabold added a commit to jseabold/statsmodels that referenced this issue Apr 18, 2014
@jseabold jseabold ENH: Allow start_params in GLM. Closes #443. 30e7327
@josef-pkt
Member

AFAIU from the comments, the original blog post was about starting values and non-convergence. That should be a new issue.

@jseabold jseabold added a commit to jseabold/statsmodels that referenced this issue Apr 18, 2014
@jseabold jseabold ENH: Allow start_params in GLM. Closes #443. f510afa
@jseabold jseabold closed this in 524aa85 Apr 18, 2014
@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@jseabold jseabold ENH: Allow start_params in GLM. Closes #443. 06013c3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment