Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GLM.fit does not allow start_params #443

Closed
jseabold opened this issue Aug 24, 2012 · 6 comments

Comments

Projects
None yet
2 participants
@jseabold
Copy link
Member

commented Aug 24, 2012

GLM fit signature should allow starting parameters to be specified. This came up as I was reading about how IRLS is not a robust fitting algorithm.

http://www.r-bloggers.com/how-robust-is-logistic-regression/

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Aug 25, 2012

reading a bit how the initial estimate works, I think
mu = self.family.starting_mu(self.endog)
should depend on data_weights. It uses y.mean() instead of weighted mean in Binomial.

an example that breaks with the default starting values is at a comment on the original post
http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Aug 25, 2012

discrete Logit fails with singular matrix exception when using the default newton method.

method='ncg' converges very fast (10 iterations) to the correct solution (result of R glm with adjusted starting values)

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Aug 25, 2012

example script for the case in the comment

# -*- coding: utf-8 -*-
"""

Created on Sat Aug 25 10:14:55 2012
Author: Josef Perktold

convergence failure of GLM Binomial and Logit
example based on comment in 
http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/

Note: example has one value with predicted probability zero
>>> (reslogit.predict() < 1e-10).sum()
1
>>> reslogit.predict().min()
9.6285423080907814e-233

method='ncg' looks good

"""

import numpy as np
from numpy.testing import assert_almost_equal, assert_equal

import statsmodels.api as sm

y2 = np.array('0 1 0 0 0 1'.split(), int)
wt = np.array([50,1,50,1,5,10])
y2 = np.repeat(y2, wt)
x2 = np.repeat([0,0,0.001,100,-1,-1], wt)
assert_equal(y2.sum(), 11)
assert_equal(len(y2), 117)
#y2.mean() is 11./117 = 0.094017094017094016

resglm2 = sm.GLM(y2, sm.add_constant(x2, prepend=True), family=sm.families.Binomial()).fit()
print resglm2.params #nan

try:
    reslogit = sm.Logit(y2, sm.add_constant(x2, prepend=True)).fit()
except np.linalg.linalg.LinAlgError:
    print 'numpy.linalg.linalg.LinAlgError'
#numpy.linalg.linalg.LinAlgError: Singular matrix

reslogit = sm.Logit(y2, sm.add_constant(x2, prepend=True)).fit(method='nm')
#Warning: Maximum number of iterations has been exceeded
reslogit = sm.Logit(y2, sm.add_constant(x2, prepend=True)).fit(method='ncg')
#from R glm with starting values (-4, -5)
#prints:
rglm_params = [-4.603050220003291, -5.296345452725506]
rglm_aic = 34.31049560846446
assert_almost_equal(reslogit.params, rglm_params, decimal=8)
assert_almost_equal(reslogit.aic, rglm_aic, decimal=12)

#with starting values
reslogit_s = sm.Logit(y2, sm.add_constant(x2, prepend=True)).fit(
                                        start_params=[-4,-5])

assert_almost_equal(reslogit_s.params, rglm_params, decimal=8)
assert_almost_equal(reslogit_s.aic, rglm_aic, decimal=12)
@josef-pkt

This comment has been minimized.

Copy link
Member

commented Apr 18, 2014

For adding the start_params I think @yizhong 's change in line 386 is the right way.

unit tests could just check the params in the fit history of the results instance.

@yizhong pull request would be very welcome

@jseabold

This comment has been minimized.

Copy link
Member Author

commented Apr 18, 2014

I think this failing case is orthogonal to the start_params issue. Though since I think things like this [1] are boolean indices anyway then we should index and have two sums right (or use nansum)? This (almost) solves the issue here. I want to refactor all of this now after looking at it again 5 years later...

[1] https://github.com/statsmodels/statsmodels/blob/master/statsmodels/genmod/families/family.py#L779

jseabold added a commit to jseabold/statsmodels that referenced this issue Apr 18, 2014

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Apr 18, 2014

AFAIU from the comments, the original blog post was about starting values and non-convergence. That should be a new issue.

jseabold added a commit to jseabold/statsmodels that referenced this issue Apr 18, 2014

@jseabold jseabold closed this in 524aa85 Apr 18, 2014

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.