The aim of this assignment is to compare the performance of random forests and logistic regression when trained after using two different imputation strategies for numerical variables.

- Strategy 1: impute missing data with the median and add missing indicators.

- Strategy 2: Impute missing data with a value at the extremes of the distribution.

Let's get started.

The data was introduced in the following article: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0254030

And it can be downloaded from Kaggle: https://www.kaggle.com/fedesoriano/company-bankruptcy-prediction.

I processed the data and uploaded it to the repo for this assignment.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, cross_validate

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

from feature_engine.imputation import (
    MeanMedianImputer,
    EndTailImputer,
    AddMissingIndicator,
)

In [2]:
data = pd.read_csv("../taiwan_na.csv")

data.head()

Unnamed: 0,bankrupt,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
0,1,0.370594,0.424389,0.40575,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.82789,0.290202,0.026601,,1,0.016469
1,1,0.464291,0.538214,0.51673,0.610235,0.610235,0.998946,0.79738,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.60145,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.77467,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.9987,0.796967,0.808966,0.30335,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.03549


In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("bankrupt", axis=1),
    data["bankrupt"],
    test_size=0.33,
    random_state=42,
)

# Imputers

## Set up the imputers

In [4]:
# strategy 1: median imputation + missing indicators
na_ind = AddMissingIndicator(missing_only=True)
mean_imp = MeanMedianImputer(imputation_method="median")

# strategy 2: end of tail imputation
end_imp = EndTailImputer()

## Transform the datasets

In [5]:
# strategy 1:

X_train_1 = na_ind.fit_transform(X_train)
X_train_1 = mean_imp.fit_transform(X_train_1)

X_test_1 = na_ind.transform(X_test)
X_test_1 = mean_imp.transform(X_test_1)

X_test_1.head()

Unnamed: 0,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,Continuous interest rate (after tax),...,Equity to Liability,Pre-tax net Interest Rate_na,Persistent EPS in the Last Four Seasons_na,Net Value Growth Rate_na,Cash Reinvestment %_na,Contingent liabilities/Net worth_na,Net profit before tax/Paid-in capital_na,Inventory/Current Liability_na,Total expense/Assets_na,Interest Coverage Ratio (Interest expense to EBIT)_na
239,0.434456,0.481247,0.498742,0.596326,0.596326,0.998791,0.797012,0.809041,0.303237,0.781291,...,0.087378,0,0,0,0,0,0,1,0,0
2850,0.542534,0.571413,0.590663,0.603417,0.603417,0.999041,0.797476,0.809375,0.303526,0.781638,...,0.028519,0,0,0,0,0,1,0,0,0
2687,0.584897,0.631433,0.617057,0.610567,0.609954,0.999079,0.797542,0.809422,0.30356,0.781684,...,0.048876,0,0,0,0,0,0,0,0,0
6500,0.436942,0.490951,0.482413,0.607987,0.607951,0.998921,0.797265,0.809187,0.303408,0.781435,...,0.014691,0,0,0,1,0,1,0,0,0
2684,0.506898,0.565526,0.561754,0.608693,0.608693,0.999103,0.797538,0.809447,0.303503,0.781699,...,0.019245,0,0,0,0,0,0,0,0,1


In [6]:
# strategy 2:

X_train_2 = end_imp.fit_transform(X_train)
X_test_2 = end_imp.transform(X_test)

X_test_2.head()

Unnamed: 0,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,Continuous interest rate (after tax),...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
239,0.434456,0.481247,0.498742,0.596326,0.596326,0.998791,0.797012,0.809041,0.303237,0.781291,...,0.765336,0.001373,0.626305,0.596326,0.838369,0.275936,0.026791,0.565157,1,0.087378
2850,0.542534,0.571413,0.590663,0.603417,0.603417,0.999041,0.797476,0.809375,0.303526,0.781638,...,0.817797,0.00101,0.625384,0.603415,0.841846,0.279975,0.026904,0.565645,1,0.028519
2687,0.584897,0.631433,0.617057,0.610567,0.609954,0.999079,0.797542,0.809422,0.30356,0.781684,...,0.847518,0.001218,0.623886,0.610563,0.843304,0.277186,0.026792,0.565161,1,0.048876
6500,0.436942,0.490951,0.482413,0.607987,0.607951,0.998921,0.797265,0.809187,0.303408,0.781435,...,0.76765,0.000978,0.623608,0.607985,0.834479,0.29639,0.026615,0.564153,1,0.014691
2684,0.506898,0.565526,0.561754,0.608693,0.608693,0.999103,0.797538,0.809447,0.303503,0.781699,...,0.810394,0.003965,0.620144,0.608693,0.84178,0.285421,0.027121,0.602907,1,0.019245


# Compare the performance of random forests

Train two Random Forest models using the datasets obtained after the different imputation strategies.

Evaluate their performance using the roc-auc.

Use cross-validation to obtain the roc-auc value over the train set.

In [7]:
# Strategy 1: 

rf_model = RandomForestClassifier(n_estimators=50, random_state=42)

model = cross_validate(
    rf_model,
    X_train_1,
    y_train,
    cv=3,
    scoring="roc_auc",
    return_estimator=True,
)

print("roc_auc train: ", np.mean(model['test_score']), " + or - ", np.std(model['test_score']))

roc_auc train:  0.9123935455337899  + or -  0.01942445746350038


In [8]:
rf_model.fit(X_train_1, y_train)

print("roc_auc test ", roc_auc_score(y_test, rf_model.predict_proba(X_test_1)[:,1]))

roc_auc test  0.9346560846560847


In [9]:
# strategy 2: 

rf_model = RandomForestClassifier(n_estimators=50, random_state=42)

model = cross_validate(
    rf_model,
    X_train_2,
    y_train,
    cv=3,
    scoring="roc_auc",
)

print("roc_auc train: ", np.mean(model['test_score']), " + or - ", np.std(model['test_score']))

roc_auc train:  0.9235294263921517  + or -  0.005186402208675665


In [10]:
rf_model.fit(X_train_2, y_train)

print("roc_auc test ", roc_auc_score(y_test, rf_model.predict_proba(X_test_2)[:,1]))

roc_auc test  0.9214115036695681


The second model seems to have less variability, smaller std in the train set. Therefore, it is perhaps a better model. But the imputation strategies do not seem to have a major effect on this dataset.

# Compare the performance of logistic regression

Train two Logistic regression models using the datasets obtained after the different imputation strategies.

Evaluate their performance using the roc-auc.

Use cross-validation to obtain the roc-auc value over the train set.

In [11]:
# strategy 1: 

logit = LogisticRegression(random_state=42, max_iter=1000)

model = cross_validate(
    logit,
    X_train_1,
    y_train,
    cv=3,
    scoring="roc_auc",
)

print("roc_auc train: ", np.mean(model['test_score']), " + or - ", np.std(model['test_score']))

roc_auc train:  0.6023959798877377  + or -  0.018267680924627716


In [12]:
logit.fit(X_train_1, y_train)

print("roc_auc test ", roc_auc_score(y_test, logit.predict_proba(X_test_1)[:,1]))

roc_auc test  0.5338055413324231


In [13]:
# strategy 2: 

logit = LogisticRegression(random_state=42, max_iter=1000)

model = cross_validate(
    logit,
    X_train_2,
    y_train,
    cv=3,
    scoring="roc_auc",
)

print("roc_auc train: ", np.mean(model['test_score']), " + or - ", np.std(model['test_score']))

roc_auc train:  0.6054907195267467  + or -  0.01065059919857885


In [14]:
logit.fit(X_train_2, y_train)

print("roc_auc test ", roc_auc_score(y_test, logit.predict_proba(X_test_2)[:,1]))

roc_auc test  0.5481709051601524


Logistic regression does not perform well regardless of the imputation strategy used. The performance on the test set is much smaller than on the train set.