Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QUESTION: sm.Logit fails, while sm.GLM with Binomial link does not #6644

Open
emilmirzayev opened this issue Apr 14, 2020 · 4 comments
Open

Comments

@emilmirzayev
Copy link
Contributor

Describe the bug

This is more of a question, and maybe a potential issue (please, correct me where I am wrong)

I was trying to fit a simple logit model using sm.Logit approach but I was receiving this:

lm.fit(maxiter= 10000)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.664288
         Iterations: 10000
Traceback (most recent call last):

  File "<ipython-input-64-b39a0ea8a7d2>", line 1, in <module>
    lm.fit(maxiter= 10000)

  File "C:\Users\Emil\anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 1913, in fit
    disp=disp, callback=callback, **kwargs)

  File "C:\Users\Emil\anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 216, in fit
    disp=disp, callback=callback, **kwargs)

  File "C:\Users\Emil\anaconda3\lib\site-packages\statsmodels\base\model.py", line 533, in fit
    Hinv = np.linalg.inv(-retvals['Hessian']) / nobs

  File "<__array_function__ internals>", line 6, in inv

  File "C:\Users\Emil\anaconda3\lib\site-packages\numpy\linalg\linalg.py", line 547, in inv
    ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)

  File "C:\Users\Emil\anaconda3\lib\site-packages\numpy\linalg\linalg.py", line 97, in _raise_linalgerror_singular
    raise LinAlgError("Singular matrix")

Which, AFAIK means that my endog has some issues. So, I looked at it:

 y.describe()

count    676.000000
mean       0.573964
std        0.494865
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: benchmark_better, dtype: float64

However, when I am trying sm.GLM(y, X, family = sm.families.Binomial(link = sm.families.links.logit())) it works as expected:

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:       benchmark_better   No. Observations:                  676
Model:                            GLM   Df Residuals:                      654
Model Family:                Binomial   Df Model:                           21
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -449.06
Date:                Tue, 14 Apr 2020   Deviance:                       898.12
Time:                        15:28:45   Pearson chi2:                     677.
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         

AFAIK, the results of sm.Logit and sm.GLM with Binomial link should be all_close. If one works, why not the other?

Code Sample, a copy-pastable example if possible

Because this is a potential question, no code samples are required I presume

Note: As you can see, there are many issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates.

Note: Please be sure you are using the latest released version of statsmodels, or a recent build of master. If your problem has been fixed in an unreleased version, you might be able to use master until a new release occurs.

Note: If you are using a released version, have you verified that the bug exists in the master branch of this repository? It helps the limited resources if we know problems exist in the current master so that they do not need to check whether the code sample produces a bug in the next release.

If the issue has not been resolved, please file it in the issue tracker.

Expected Output

Results

Output of import statsmodels.api as sm; sm.show_versions()

[paste the output of import statsmodels.api as sm; sm.show_versions() here below this line]

INSTALLED VERSIONS
------------------
Python: 3.7.7.final.0

statsmodels
===========

Installed: 0.11.0 (C:\Users\Emil\anaconda3\lib\site-packages\statsmodels)

Required Dependencies
=====================

cython: 0.29.15 (C:\Users\Emil\anaconda3\lib\site-packages\Cython)
numpy: 1.18.1 (C:\Users\Emil\anaconda3\lib\site-packages\numpy)
scipy: 1.4.1 (C:\Users\Emil\anaconda3\lib\site-packages\scipy)
pandas: 1.0.3 (C:\Users\Emil\anaconda3\lib\site-packages\pandas)
    dateutil: 2.8.1 (C:\Users\Emil\anaconda3\lib\site-packages\dateutil)
patsy: 0.5.1 (C:\Users\Emil\anaconda3\lib\site-packages\patsy)

Optional Dependencies
=====================

matplotlib: 3.1.3 (C:\Users\Emil\anaconda3\lib\site-packages\matplotlib)
    backend: module://ipykernel.pylab.backend_inline 
cvxopt: Not installed
joblib: 0.14.1 (C:\Users\Emil\anaconda3\lib\site-packages\joblib)

Developer Tools
================

IPython: 7.13.0 (C:\Users\Emil\anaconda3\lib\site-packages\IPython)
    jinja2: 2.11.1 (C:\Users\Emil\anaconda3\lib\site-packages\jinja2)
sphinx: 2.4.4 (C:\Users\Emil\anaconda3\lib\site-packages\sphinx)
    pygments: 2.6.1 (C:\Users\Emil\anaconda3\lib\site-packages\pygments)
pytest: 5.4.1 (C:\Users\Emil\anaconda3\lib\site-packages\pytest)
virtualenv: Not installed
@josef-pkt
Copy link
Member

The main reason for these kind of problems is different optimizers.
GLM default to using iterated least squares IRLS, which is pretty robust if during the optimization the hessian is not well behaved away from the optimum.
The default newton method in discrete fails in that case (The newton method has a small ridge factor in the optimization to handle small problems, but that optimizer is numerically not very robust).

We can change optimizer in discrete models like Logit to one that is more robust, e.g. "nm" or "bfgs". If it's just a problem away from the optimum, then running e.g newton after nm has converged (or after enough iteratiions in nm), will improve the estimator locally because it uses derivative information.

@josef-pkt
Copy link
Member

Also, handling of edge cases is still partially different between GLM and discrete models,
e.g. AFAIR, GLM Binomial has better protection for division by zero and extreme probabilities, 0 or 1, than Logit.

@emilmirzayev
Copy link
Contributor Author

Thank you for your reply. I have several questions:

  1. If IRLS does not fail and works, can I trust the results? I am heavily using Logit in my paper and this is the first time I had problems with it and switched to GLM implementation
  2. The second part of your first comment is unclear to me. Can we apply two different optimizers on the same instance one after another to ensure the convergence (this is what I understood)
  3. Just out of curiosity, what is the reason for IRLS not being available on Logit?

thanks beforehand!

@josef-pkt
Copy link
Member

  1. whenever there are problems it's good to check the results. The problem could be in the optimization and go away when we are close to the optimum. Or there could be some fundamental problem with the data, e.g. singular, near-singular exog, or perfect separation in Logit and similar models.
    One basic check for convergence is that results.score(results.params) is close enough to zero, and Hessian is invertible. In that case we have a local optimum, and in standard cases it will be also the global optimum (theoretically the loglike for Logit is convex, AFAIR)

    Specifically, IRLS might not break on a singular exog because we use pinv or an equivalent of it. There might still be problems in estimating the cov_params,
    (In OLS/WLS cov_params could be rank deficient because of the pinv. I'm not sure that it doesn't apply to GLM.)

  2. Yes, e.g.

res0 = model.fit(method="nm")
results = model.ft(method="newton", start_params=res0.params)

By default GLM fit with a scipy optimizer runs a few iterations of IRLS to get better starting values for the gradient based optimizers.

  1. different framework
    GLM is historically estimated by IRLS, we added scipy optimizers a few years ago.
    Logit is traditionally a nonlinear MLE, with standard optimizers.

discrete models don't have the common framework to add IRLS generically, and adding it to individual models would require model specific work, which would just duplicate GLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants