# Generalized Linear Models

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

## Loading Data

In [2]:

star98 = sm.datasets.star98.load_pandas().data
formula = 'SUCCESS ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT + \
           PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'
data = star98[['NABOVE', 'NBELOW', 'LOWINC', 'PERASIAN', 'PERBLACK', 'PERHISP',
              'PCTCHRT', 'PCTYRRND', 'PERMINTE', 'AVYRSEXP', 'AVSALK',
              'PERSPENK', 'PTRATIO', 'PCTAF']].copy()
endog = data['NABOVE'] / (data['NABOVE'] + data.pop('NBELOW'))
del data['NABOVE']
data['SUCCESS'] = endog

## Fitting a Generalized Linear Model

In [3]:
model = smf.glm(formula=formula, data=data, family=sm.families.Binomial())

In [4]:
model = model.fit()
print_model = model.summary()
print(print_model)

                 Generalized Linear Model Regression Results                  
Dep. Variable:                SUCCESS   No. Observations:                  303
Model:                            GLM   Df Residuals:                      282
Model Family:                Binomial   Df Model:                           20
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -127.33
Date:                Sun, 07 Apr 2024   Deviance:                       8.5477
Time:                        02:36:02   Pearson chi2:                     8.48
No. Iterations:                     4   Pseudo R-squ. (CS):             0.1115
Covariance Type:            nonrobust                                         
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept               

The statsmodels module's output only has K-1 equations (in this case two equations), which show coefficients against a reference group. In the abalone example, the reference group was chosen to be female. The coefficients represent the log of ratios between two probabilities: the probability of belonging to a group of interest vs. the probability of belonging to the reference group.  In the abalone example, the reference group was female, therefore the equation below represents the first set of coefficients marked as SEX=Infant.  Note that there are two sets of coefficients, one marked as Infant and the second marked as Male.

## Accessing Model Parameters

In statsmodels, the `fit()` method returns a `Result` object. The model coefficients, standard errors, p-values, etc., are all available from this Result object.

Conveniently these are stored as Pandas dataframes with the parameter name as the dataframe index.

In [5]:
model.params

Intercept                   0.403664
LOWINC                     -0.020396
PERASIAN                    0.015865
PERBLACK                   -0.019802
PERHISP                    -0.009589
PCTCHRT                    -0.002218
PCTYRRND                   -0.002167
PERMINTE                    0.106822
AVYRSEXP                   -0.041119
PERMINTE:AVYRSEXP          -0.003065
AVSALK                      0.013091
PERMINTE:AVSALK            -0.001899
AVYRSEXP:AVSALK             0.000766
PERMINTE:AVYRSEXP:AVSALK    0.000060
PERSPENK                   -0.309703
PTRATIO                     0.009565
PERSPENK:PTRATIO            0.006611
PCTAF                      -0.014274
PERSPENK:PCTAF              0.010513
PTRATIO:PCTAF              -0.000114
PERSPENK:PTRATIO:PCTAF     -0.000246
dtype: float64

Here are some of the relevant values for a Logistic Regression.


|Attr/func|Description|
| ------------- |-------------|
|params|Estimated model parameters. Appears as coef when calling summary() on a fitted model|
|bse|Standard error|
|tvalues|A coefficient's t-statistic|
|pvalues|The model's p-value|
|conf_int(alpha)|Method that calculates the confidence interval for the estimated parameters. To call: model.conf_int(0.05)|

