QUESTION: sm.Logit fails, while sm.GLM with Binomial link does not #6644

emilmirzayev · 2020-04-14T13:43:51Z

Describe the bug

This is more of a question, and maybe a potential issue (please, correct me where I am wrong)

I was trying to fit a simple logit model using sm.Logit approach but I was receiving this:

lm.fit(maxiter= 10000)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.664288
         Iterations: 10000
Traceback (most recent call last):

  File "<ipython-input-64-b39a0ea8a7d2>", line 1, in <module>
    lm.fit(maxiter= 10000)

  File "C:\Users\Emil\anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 1913, in fit
    disp=disp, callback=callback, **kwargs)

  File "C:\Users\Emil\anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 216, in fit
    disp=disp, callback=callback, **kwargs)

  File "C:\Users\Emil\anaconda3\lib\site-packages\statsmodels\base\model.py", line 533, in fit
    Hinv = np.linalg.inv(-retvals['Hessian']) / nobs

  File "<__array_function__ internals>", line 6, in inv

  File "C:\Users\Emil\anaconda3\lib\site-packages\numpy\linalg\linalg.py", line 547, in inv
    ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)

  File "C:\Users\Emil\anaconda3\lib\site-packages\numpy\linalg\linalg.py", line 97, in _raise_linalgerror_singular
    raise LinAlgError("Singular matrix")

Which, AFAIK means that my endog has some issues. So, I looked at it:

 y.describe()

count    676.000000
mean       0.573964
std        0.494865
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: benchmark_better, dtype: float64

However, when I am trying sm.GLM(y, X, family = sm.families.Binomial(link = sm.families.links.logit())) it works as expected:

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:       benchmark_better   No. Observations:                  676
Model:                            GLM   Df Residuals:                      654
Model Family:                Binomial   Df Model:                           21
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -449.06
Date:                Tue, 14 Apr 2020   Deviance:                       898.12
Time:                        15:28:45   Pearson chi2:                     677.
No. Iterations:                     4                                         
Covariance Type:            nonrobust

AFAIK, the results of sm.Logit and sm.GLM with Binomial link should be all_close. If one works, why not the other?

Code Sample, a copy-pastable example if possible

Because this is a potential question, no code samples are required I presume

Note: As you can see, there are many issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates.

Note: Please be sure you are using the latest released version of statsmodels, or a recent build of master. If your problem has been fixed in an unreleased version, you might be able to use master until a new release occurs.

Note: If you are using a released version, have you verified that the bug exists in the master branch of this repository? It helps the limited resources if we know problems exist in the current master so that they do not need to check whether the code sample produces a bug in the next release.

If the issue has not been resolved, please file it in the issue tracker.

Expected Output

Results

Output of `import statsmodels.api as sm; sm.show_versions()`

[paste the output of import statsmodels.api as sm; sm.show_versions() here below this line]

INSTALLED VERSIONS
------------------
Python: 3.7.7.final.0

statsmodels
===========

Installed: 0.11.0 (C:\Users\Emil\anaconda3\lib\site-packages\statsmodels)

Required Dependencies
=====================

cython: 0.29.15 (C:\Users\Emil\anaconda3\lib\site-packages\Cython)
numpy: 1.18.1 (C:\Users\Emil\anaconda3\lib\site-packages\numpy)
scipy: 1.4.1 (C:\Users\Emil\anaconda3\lib\site-packages\scipy)
pandas: 1.0.3 (C:\Users\Emil\anaconda3\lib\site-packages\pandas)
    dateutil: 2.8.1 (C:\Users\Emil\anaconda3\lib\site-packages\dateutil)
patsy: 0.5.1 (C:\Users\Emil\anaconda3\lib\site-packages\patsy)

Optional Dependencies
=====================

matplotlib: 3.1.3 (C:\Users\Emil\anaconda3\lib\site-packages\matplotlib)
    backend: module://ipykernel.pylab.backend_inline 
cvxopt: Not installed
joblib: 0.14.1 (C:\Users\Emil\anaconda3\lib\site-packages\joblib)

Developer Tools
================

IPython: 7.13.0 (C:\Users\Emil\anaconda3\lib\site-packages\IPython)
    jinja2: 2.11.1 (C:\Users\Emil\anaconda3\lib\site-packages\jinja2)
sphinx: 2.4.4 (C:\Users\Emil\anaconda3\lib\site-packages\sphinx)
    pygments: 2.6.1 (C:\Users\Emil\anaconda3\lib\site-packages\pygments)
pytest: 5.4.1 (C:\Users\Emil\anaconda3\lib\site-packages\pytest)
virtualenv: Not installed

The text was updated successfully, but these errors were encountered:

josef-pkt · 2020-04-14T13:53:05Z

The main reason for these kind of problems is different optimizers.
GLM default to using iterated least squares IRLS, which is pretty robust if during the optimization the hessian is not well behaved away from the optimum.
The default newton method in discrete fails in that case (The newton method has a small ridge factor in the optimization to handle small problems, but that optimizer is numerically not very robust).

We can change optimizer in discrete models like Logit to one that is more robust, e.g. "nm" or "bfgs". If it's just a problem away from the optimum, then running e.g newton after nm has converged (or after enough iteratiions in nm), will improve the estimator locally because it uses derivative information.

josef-pkt · 2020-04-14T13:55:52Z

Also, handling of edge cases is still partially different between GLM and discrete models,
e.g. AFAIR, GLM Binomial has better protection for division by zero and extreme probabilities, 0 or 1, than Logit.

emilmirzayev · 2020-04-14T14:12:59Z

Thank you for your reply. I have several questions:

If IRLS does not fail and works, can I trust the results? I am heavily using Logit in my paper and this is the first time I had problems with it and switched to GLM implementation
The second part of your first comment is unclear to me. Can we apply two different optimizers on the same instance one after another to ensure the convergence (this is what I understood)
Just out of curiosity, what is the reason for IRLS not being available on Logit?

thanks beforehand!

josef-pkt · 2020-04-14T15:19:02Z

whenever there are problems it's good to check the results. The problem could be in the optimization and go away when we are close to the optimum. Or there could be some fundamental problem with the data, e.g. singular, near-singular exog, or perfect separation in Logit and similar models.
One basic check for convergence is that results.score(results.params) is close enough to zero, and Hessian is invertible. In that case we have a local optimum, and in standard cases it will be also the global optimum (theoretically the loglike for Logit is convex, AFAIR)

Specifically, IRLS might not break on a singular exog because we use pinv or an equivalent of it. There might still be problems in estimating the cov_params,
(In OLS/WLS cov_params could be rank deficient because of the pinv. I'm not sure that it doesn't apply to GLM.)
Yes, e.g.

res0 = model.fit(method="nm")
results = model.ft(method="newton", start_params=res0.params)

By default GLM fit with a scipy optimizer runs a few iterations of IRLS to get better starting values for the gradient based optimizers.

different framework
GLM is historically estimated by IRLS, we added scipy optimizers a few years ago.
Logit is traditionally a nonlinear MLE, with standard optimizers.

discrete models don't have the common framework to add IRLS generically, and adding it to individual models would require model specific work, which would just duplicate GLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUESTION: sm.Logit fails, while sm.GLM with Binomial link does not #6644

QUESTION: sm.Logit fails, while sm.GLM with Binomial link does not #6644

emilmirzayev commented Apr 14, 2020

josef-pkt commented Apr 14, 2020

josef-pkt commented Apr 14, 2020

emilmirzayev commented Apr 14, 2020

josef-pkt commented Apr 14, 2020

QUESTION: sm.Logit fails, while sm.GLM with Binomial link does not #6644

QUESTION: sm.Logit fails, while sm.GLM with Binomial link does not #6644

Comments

emilmirzayev commented Apr 14, 2020

Describe the bug

Code Sample, a copy-pastable example if possible

Expected Output

Output of import statsmodels.api as sm; sm.show_versions()

josef-pkt commented Apr 14, 2020

josef-pkt commented Apr 14, 2020

emilmirzayev commented Apr 14, 2020

josef-pkt commented Apr 14, 2020

Output of `import statsmodels.api as sm; sm.show_versions()`