# Generalized Linear Models for Binary Data (Example1)

### Intro and objectives


### In this lab you will learn:
1. examples of generalized linear models
2. how to fit these models in Python


## What I hope you'll get out of this lab
* The feeling that you'll "know where to start" when you need to fit generalized linear models
* Worked Examples
* How to interpret the results obtained

In [1]:
!pip install wooldridge
!pip install linearmodels
import wooldridge as woo
import statsmodels.formula.api as smf
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import linearmodels as plm
import numpy as np
from scipy import stats

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wooldridge
  Downloading wooldridge-0.4.4-py3-none-any.whl (5.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: wooldridge
Successfully installed wooldridge-0.4.4
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting linearmodels
  Downloading linearmodels-4.27-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting property-cached>=1.6.3
  Downloading property_cached-1.6.4-py2.py3-none-any.whl (7.8 kB)
Collecting formulaic~=0.3.2
  Downloading formulaic-0.3.4-py3-none-any.whl (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.2/68.2 KB[0m [31m3.5 MB/

# Example. Will you make it to a graduate school ?

#### We are interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.

In [4]:
AdmissionsDataFrame=pd.read_csv('https://raw.githubusercontent.com/thousandoaks/M4DS202/main/data/admissions.csv')


In [13]:
AdmissionsDataFrame.head(20)

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4
5,1,760,3.0,2
6,1,560,2.98,1
7,0,400,3.08,2
8,1,540,3.39,3
9,0,700,3.92,2


In [6]:
AdmissionsDataFrame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   admit   400 non-null    int64  
 1   gre     400 non-null    int64  
 2   gpa     400 non-null    float64
 3   rank    400 non-null    int64  
dtypes: float64(1), int64(3)
memory usage: 12.6 KB


In [7]:
# Compute the correlation matrix
corr = AdmissionsDataFrame[['admit','gre','gpa','rank']].corr()

round(corr,3)

Unnamed: 0,admit,gre,gpa,rank
admit,1.0,0.184,0.178,-0.243
gre,0.184,1.0,0.384,-0.123
gpa,0.178,0.384,1.0,-0.057
rank,-0.243,-0.123,-0.057,1.0


#### We observe weak, positive, levels of correlation between the response variable "admit" and "gre"

#### We observe weak, positive, levels of correlation between the response variable "admit" and "gpa"

#### We observe weak, negative, levels of correlation between the response variable "admit" and "rank"

## 1. The model

#### Given the binary nature of the response variable "admit" we select a logit model.
#### We asume that the response variable "admit" is a random variable following a binomial distribution.


#### We assume a logit function, this means that we are assuming that the log odds of the outcome can be modeled as a linear combination of the variables: "gre","gpa" and "rank".

$ logit(P(admit=1))=log(\frac{P(admit=1)}{P(admit=0)})=\beta_0+\beta_1*gre+\beta_2*gpa+\beta_3*rank $




## 2. Estimation of the parameters
#### In this case we rely on MLS (OLS is no longer valid in logit contexts)

In [9]:
reg1 = smf.logit(formula='admit ~ gre+gpa+C(rank)', data=AdmissionsDataFrame)

# We fit the model
results1 = reg1.fit()
results1.summary()

Optimization terminated successfully.
         Current function value: 0.573147
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,400.0
Model:,Logit,Df Residuals:,394.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 19 Jan 2023",Pseudo R-squ.:,0.08292
Time:,07:39:07,Log-Likelihood:,-229.26
converged:,True,LL-Null:,-249.99
Covariance Type:,nonrobust,LLR p-value:,7.578e-08

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.9900,1.140,-3.500,0.000,-6.224,-1.756
C(rank)[T.2],-0.6754,0.316,-2.134,0.033,-1.296,-0.055
C(rank)[T.3],-1.3402,0.345,-3.881,0.000,-2.017,-0.663
C(rank)[T.4],-1.5515,0.418,-3.713,0.000,-2.370,-0.733
gre,0.0023,0.001,2.070,0.038,0.000,0.004
gpa,0.8040,0.332,2.423,0.015,0.154,1.454


## 3. Model interpretation.

#### 1. All factors are statistically significant (p-values close to zero)


#### To interpret the estimates we must remember that a change in the independent variable changes the odds of the response variable. 

#### 2. For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.0023.

#### 3. For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.804.

#### 4. The indicator variables for rank have a slightly different interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1 (the base category), changes the log odds of admission by -0.675.

#### 5. Having attended an undergraduate institution with rank of 3, versus an institution with a rank of 1, changes the log odds of admission by -1.34.

#### 5. Having attended an undergraduate institution with rank of 4, versus an institution with a rank of 1, changes the log odds of admission by -1.55.