# Logistic Regression 

Predicting the occurence of an event. 

array([[ 0.83035958, -0.33025241, -0.23054277],
       [-0.04399859,  0.22065793,  1.60051901],
       [ 0.62671752,  2.10042501, -0.96579802]])

## 1. Statsmodels

The Spector and Mazzeo dataset consists of 32 observations of the effectiveness of an educational (PSI) program on improving grades:

The response (endogenous) variable is a binary variable indicating whether grade improved.
The predictor (exogenous) variables are:
<li>PSI: a binary value for whether participants used the program</li>
<li>GPA: grade point average</li>
<li>TUCE: economics test score</li>

In [1]:
import numpy as np
import statsmodels.api as sm

spector_data = sm.datasets.spector.load()

spector_data.exog = sm.add_constant(spector_data.exog, prepend=False)

log_reg = sm.Logit(spector_data.endog, spector_data.exog ).fit()

print(log_reg.summary())
print(log_reg.params)


Optimization terminated successfully.
         Current function value: 0.402801
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                  GRADE   No. Observations:                   32
Model:                          Logit   Df Residuals:                       28
Method:                           MLE   Df Model:                            3
Date:                Wed, 03 May 2023   Pseudo R-squ.:                  0.3740
Time:                        11:34:34   Log-Likelihood:                -12.890
converged:                       True   LL-Null:                       -20.592
Covariance Type:            nonrobust   LLR p-value:                  0.001502
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
GPA            2.8261      1.263      2.238      0.025       0.351       5.301
TUCE           0.0952      0.

## 2. SciKit-Learn

For this example, we'll make an example dataset using sklearn.

In [11]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, weights=[0.85, 0.15], random_state=9)
print(f"Percentage of people carrying the disease: {100*y.mean():.2f}%")

Percentage of people carrying the disease: 15.20%


In [14]:
# Split the data into test and train 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.1, shuffle=True, random_state=10)

# Fit logistic regression
from sklearn.linear_model import LogisticRegression

mdl = LogisticRegression(penalty='l2')          # ridge
mdl = mdl.fit(X_train, y_train)

In [19]:
from sklearn.metrics import class_likelihood_ratios

y_pred = mdl.predict(X_test)
pos_LR, neg_LR = class_likelihood_ratios(y_test, y_pred)
print(f"LR+: {pos_LR:.3f}")

LR+: 31.500
