Logistic regression, using scikit-learn and statsmodels (based on introduction to statistical learning)

In [None]:
import numpy as np 
import pandas as pd
import math
import statsmodels.api as sm 
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from collections import Counter
from ISLP.models import summarize
from ISLP import confusion_table
import seaborn as sns

In [None]:
#load datasets
large = pd.read_csv('../datasets/full_cleaned_dataset.csv')
small = pd.read_csv('../datasets/1std_dataset.csv')

In [None]:
#dropping unnesacary columns (z_score column, and index columns, and year))
small.drop(columns=['z_score', 'year'], inplace=True)
large.drop(columns=['Unnamed: 0', 'year'], inplace=True)

#drop any non-numeric colums from both datasets
#getting lists of numeric columns
numeric_columns = large.select_dtypes(include=np.number).columns

#dropping non-numeric columns from large and small datasets
large = large[numeric_columns]
small = small[numeric_columns]

#move the target (distressed) out of the dataset
large_target = large.pop('distressed')
small_target = small.pop('distressed')

#turn dfs in numpy arrays
nplarge = large.to_numpy()
npsmall = small.to_numpy()

In [None]:
# logistic regression
scaler = StandardScaler()
# standardize the features (mean 0, variance 1)
X = scaler.fit_transform(npsmall.copy())
y = small_target.copy()
glm = sm.GLM(y, X, family=sm.families.Binomial())
results = glm.fit()

print(results.summary())

The results if this first model are bad. The low R2 and all P-values being 1 indicated that Logistic regerssion might not be a good fit for our data.

There is a perfect spearation warning. This could be solved by using a different model (Firth logistic regression is usually recommended, but it's not part of the libraries I'm using). Another option would be to remove variables that are causing the bias. It difficult to figure out which variables are causing the bias.

There seem to be two issues with the data that must be solved:
1. The data is incredibly inbalanced. There are only a few distressed observations (8 out of 667). Doing some resampling (oversampling the distressed observations) might help. This can be achieved with SMOTE
2. The data seems to have multicollinearity. I should try to remove features that have high correlations, and build a new model based on the reduced dataset. This can be achieved by

In [None]:
#resampling using SMOTE
small_res, y_res = SMOTE().fit_resample(X, y)

print(f"Original dataset shape {Counter(y)}")
print(f"Resampled dataset shape {Counter(y_res)}")

We now have a 50/50 split in the data regarding distressed observations

In [None]:
glm = sm.GLM(y_res, small_res, family=sm.families.Binomial())
results = glm.fit()

print(results.summary())

Simply adding more distressed observations to the dataset didn't help with the model. The underlying issues with colinearity still should be present, even if there is no error regarding that anymore.

To address that we can calculate the variance inflation factor (VIF) to find collinear features

In [None]:
# recreate a df with original features including the target variable
vif_small = small.copy()
vif_small['distressed'] = small_target.copy()

# calculate VIF for each feature
vals = [VIF(vif_small, i)
        for i in range(1, vif_small.shape[1])]
vif = pd.DataFrame({'vif': vals},
                   index=vif_small.columns[1:])
vif = vif.sort_values('vif', ascending=False)

vif = vif['vif'].round(3)

vif



Looking at the VIF values, we can notice that some features end up with a VIF that converges into infinity. Those features can be removed from the dataset. (minorityInterest, totalLiabilitiesAndTotalEquity, totalEquity, totalLiabilitiesAndStockholdersEquity, totalStockholdersEquity, grossProfit, costOfRevenue, revenue)

There are some features with very high VIFs, but for now we'll check those again, after the infinite VIFs have been removed.

In [None]:
# removing the features with high VIF values
toremove = ['minorityInterest', 'totalLiabilitiesAndTotalEquity', 'totalEquity',
            'totalLiabilitiesAndStockholdersEquity', 'totalStockholdersEquity', 'grossProfit', 'costOfRevenue', 'revenue']

for i in toremove:
    if i in vif_small.columns:
        vif_small.drop(columns=[i], inplace=True)
    else:
        print(f"{i} not in columns")


In [None]:
# calcualte VIFs again
vals = [VIF(vif_small, i)
        for i in range(1, vif_small.shape[1])]
vif = pd.DataFrame({'vif': vals},
                   index=vif_small.columns[1:])

vif = vif.sort_values('vif', ascending=False)

vif = vif['vif'].round(3)

vif

There are still very high VIF values, but let's try to create a model with those features removed, to see if there are improvements.

In [None]:
#load and prepare dataset
small2 = pd.read_csv('../datasets/1std_dataset.csv')
small2.drop(columns=['z_score', 'year'], inplace=True)
small2.drop(columns=toremove, inplace=True)

#remove all non-numeric columns from the dataset
numeric_columns = small2.select_dtypes(include=np.number).columns
small2 = small2[numeric_columns]

# create y, the target variable and X, the features
X = small2.copy()
X.drop(columns=['distressed'], inplace=True)
y = small2.pop('distressed')

# create dummy variables using smote
X, y = SMOTE().fit_resample(X, y)

# run the logistic regression model
glm = sm.GLM(y, X, family=sm.families.Binomial())
results = glm.fit()

print(results.summary())

Since the model is still very bad, let's try to remove all features with a VIF higher than 10. This will remove most of the features.

In [None]:
vals = [VIF(X, i)
        for i in range(1, X.shape[1])]
vif = pd.DataFrame({'vif': vals},
                   index=X.columns[1:])

vif.where(vif['vif'] > 10, inplace=True)
vif.dropna(inplace=True)
vif = vif.sort_values('vif', ascending=False)

toremove2 = list(vif.index)

In [None]:
#load and prepare dataset
small3 = pd.read_csv('../datasets/1std_dataset.csv')
small3.drop(columns=['z_score', 'year'], inplace=True)
small3.drop(columns=toremove, inplace=True)
small3.drop(columns=toremove2, inplace=True)

#remove all non-numeric columns from the dataset
numeric_columns = small3.select_dtypes(include=np.number).columns
small3 = small3[numeric_columns]

# create y, the target variable and X, the features
X = small3.copy()
X.drop(columns=['distressed'], inplace=True)
y = small3.pop('distressed')

# create dummy variables using smote
X, y = SMOTE().fit_resample(X, y)

# run the logistic regression model
glm = sm.GLM(y, X, family=sm.families.Binomial())
results = glm.fit()

print(results.summary())

We finally have a resemblence of a regular model.
We can standardize the data, and try to run the model again, this should give a better readable results.

In [None]:
# standardize the features (mean 0, variance 1)
X = scaler.fit_transform(X.copy())
y = y.copy()
glm = sm.GLM(y, X, family=sm.families.Binomial())
results = glm.fit()

print(results.summary())

R2 is 0.4 which is not that great. Considering how many features we dropped. I would suggest that logistic regression is simply not a good fit for the data we have at hand. But let's do some checks with the model, to get more insights.