Logistic regression, using scikit-learn and statsmodels (based on introduction to statistical learning)

In [None]:
import numpy as np 
import pandas as pd
import statsmodels.api as sm 
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from collections import Counter
from ISLP.models import summarize
from ISLP import confusion_table
import seaborn as sns

In [71]:
#load datasets
large = pd.read_csv('../datasets/full_cleaned_dataset.csv')
small = pd.read_csv('../datasets/1std_dataset.csv')

In [72]:
#dropping unnesacary columns (z_score column, and index columns, and year))
small.drop(columns=['z_score', 'year'], inplace=True)
large.drop(columns=['Unnamed: 0', 'year'], inplace=True)

#drop any non-numeric colums from both datasets
#getting lists of numeric columns
numeric_columns = large.select_dtypes(include=np.number).columns

#dropping non-numeric columns from large and small datasets
large = large[numeric_columns]
small = small[numeric_columns]

#move the target (distressed) out of the dataset
large_target = large.pop('distressed')
small_target = small.pop('distressed')

#turn dfs in numpy arrays
nplarge = large.to_numpy()
npsmall = small.to_numpy()

In [78]:
# logistic regression
scaler = StandardScaler()
# standardize the features (mean 0, variance 1)
X = scaler.fit_transform(npsmall.copy())
y = small_target.copy()
glm = sm.GLM(y, X, family=sm.families.Binomial())
results = glm.fit()

print(results.summary())

  t = np.exp(-z)
  special.gammaln(n - y + 1) + y * np.log(mu / (1 - mu + 1e-20)) +
  special.gammaln(n - y + 1) + y * np.log(mu / (1 - mu + 1e-20)) +


                 Generalized Linear Model Regression Results                  
Dep. Variable:             distressed   No. Observations:                  667
Model:                            GLM   Df Residuals:                      567
Model Family:                Binomial   Df Model:                           99
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                    nan
Date:                Fri, 11 Apr 2025   Deviance:                   3.6244e-09
Time:                        12:50:45   Pearson chi2:                 1.81e-09
No. Iterations:                    30   Pseudo R-squ. (CS):                nan
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1           326.1650   3.81e+06   8.57e-05      1.0

The results if this first model are bad. The low R2 and all P-values being 1 indicated that Logistic regerssion might not be a good fit for our data.

There is a perfect spearation warning. This could be solved by using a different model (Firth logistic regression is usually recommended, but it's not part of the libraries I'm using). Another option would be to remove variables that are causing the bias. It difficult to figure out which variables are causing the bias.

We can try a second model using Gaussian distribution instead of binomial.  

There seem to be two issues with the data that must be solved:
1. The data is incredibly inbalanced. There are only a few distressed observations (8 out of 667). Doing some resampling (oversampling the distressed observations) might help.
2. The data seems to have multicollinearity. I shoul try to remove features that have high correlations, and build a new model based on the reduced dataset.

In [76]:
#resampling using SMOTE
small_res, y_res = SMOTE().fit_resample(X, y)

print(f"Original dataset shape {Counter(y)}")
print(f"Resampled dataset shape {Counter(y_res)}")

Original dataset shape Counter({0.0: 659, 1.0: 8})
Resampled dataset shape Counter({0.0: 659, 1.0: 659})


We now have a 50/50 split in the data regarding distressed observations

In [75]:
glm = sm.GLM(y_res, small_res, family=sm.families.Binomial())
results = glm.fit()

print(results.summary())

  t = np.exp(-z)


                 Generalized Linear Model Regression Results                  
Dep. Variable:             distressed   No. Observations:                 1318
Model:                            GLM   Df Residuals:                     1218
Model Family:                Binomial   Df Model:                           99
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                    nan
Date:                Fri, 11 Apr 2025   Deviance:                       23026.
Time:                        12:42:21   Pearson chi2:                 1.13e+18
No. Iterations:                   100   Pseudo R-squ. (CS):                nan
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1          1.345e+15    1.2e+08   1.12e+07      0.0

  special.gammaln(n - y + 1) + y * np.log(mu / (1 - mu + 1e-20)) +
  special.gammaln(n - y + 1) + y * np.log(mu / (1 - mu + 1e-20)) +


Simply adding more distressed observations to the dataset didn't help with the model. The underlying issues with colinearity still should be present, even if there is no error regarding that anymore.