# ISLP - Chapter 8 - Exercise 12
### Author: pzuehlke

We will analyze the real-world Kaggle dataset titled
[Company Bankruptcy Prediction](https://storage.googleapis.com/kaggle-data-sets/1111894/1938459/compressed/data.csv.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20250225%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20250225T001219Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=b45e0eb6fb03be22611fc0abcf9fc809378116514f7f35c7e7a34572e3d1759b98aa5db7bc4706108e456a604e64366d9a12594d9f6ab287e552e16037bdc209d1ef784daf1aa6769e59ff2f98e04333e09357cc8ebd781d1fb9ee3b7a3229bc4db5e5fe0153644b58a57bf0e1225312422434899ec3510a8ee34a5667336138f1ba2b18147ce80640ec5f71fcc8f38726d52c8c43044c5e3faa2b19881e03ebf7c5fd607a1fafac1ba1bff861a383dafcffaf038dec1436d30c0c0bda6f47b7db9e707b9b48cbfd62d0efa7fb440d0078eb05dcc5a53c38af4747ca3b263b7c73c693830cc8f94a31183c8037e5f6f25837e7c1803140cc0b30dcb5ee16ef37) with the aim of predicting company bankruptcy by means of several financial metrics about it. The companies are from Taiwan and the data were collected from 1999 to 2009. The original paper is
[Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study](https://www.sciencedirect.com/science/article/abs/pii/S0377221716000412).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, precision_recall_curve, auc
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier

We begin by loading the data and displaying basic information about it.

In [37]:
data = pd.read_csv("Company_Bankruptcy.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6819 entries, 0 to 6818
Data columns (total 96 columns):
 #   Column                                                    Non-Null Count  Dtype  
---  ------                                                    --------------  -----  
 0   Bankrupt?                                                 6819 non-null   int64  
 1    ROA(C) before interest and depreciation before interest  6819 non-null   float64
 2    ROA(A) before interest and % after tax                   6819 non-null   float64
 3    ROA(B) before interest and depreciation after tax        6819 non-null   float64
 4    Operating Gross Margin                                   6819 non-null   float64
 5    Realized Sales Gross Margin                              6819 non-null   float64
 6    Operating Profit Rate                                    6819 non-null   float64
 7    Pre-tax net Interest Rate                                6819 non-null   float64
 8    After-tax net Int

There are $ 96 $ columns, of which the first, `Bankrupt?` is the response. Let's
check that it really is a binary categorical variable as suggested by its type
and compute the proportion of companies that went bankrupt:

In [38]:
print(f"Unique values for `Bankrupt?`: {data["Bankrupt?"].unique()}")
print(f"Proportion of bankrupt companies in the period: {data["Bankrupt?"].mean():.4f}")

Unique values for `Bankrupt?`: [1 0]
Proportion of bankrupt companies in the period: 0.0323


Only about $ 3.2 \% $ went bankrupt. Let's check for missing values:

In [None]:
missing_values = data.isnull().sum()
print("\nMissing values per column:")
print(missing_values[missing_values > 0])


Missing values per column:
Series([], dtype: int64)


Great, the data is already clean. Let's determine the fifteen variables that are most correlated to bankruptcy:

In [None]:
correlations = data.corr()["Bankrupt?"]
sorted_correlations = correlations.abs().sort_values(ascending=False)
sorted_correlations.head(16)

Bankrupt?                                                   1.000000
 Net Income to Total Assets                                 0.315457
 ROA(A) before interest and % after tax                     0.282941
 ROA(B) before interest and depreciation after tax          0.273051
 ROA(C) before interest and depreciation before interest    0.260807
 Net worth/Assets                                           0.250161
 Debt ratio %                                               0.250161
 Persistent EPS in the Last Four Seasons                    0.219560
 Retained Earnings to Total Assets                          0.217779
 Net profit before tax/Paid-in capital                      0.207857
 Per Share Net profit before tax (Yuan ¥)                   0.201395
 Current Liability to Assets                                0.194494
 Working Capital to Total Assets                            0.193083
 Net Income to Stockholder's Equity                         0.180987
 Borrowing dependency             

In [48]:
y = data["Bankrupt?"]
X = data.drop(["Bankrupt?"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0, stratify=y
)

In [55]:
n_samples = len(y_train)
n_classes = 2
class_counts = Counter(y_train)
class_weights = {class_label: n_samples / (n_classes * count) for class_label, count in class_counts.items()}
print("class weights for addressing imbalance:")
print(class_weights)

class weights for addressing imbalance:
{0: 0.5166702749512881, 1: 15.496753246753247}


First, let's fit a basic decision tree without any pruning or restrictions:

In [52]:
dt_classifier = DecisionTreeClassifier(random_state=0)
dt_classifier.fit(X_train, y_train)

dt_train_score = dt_classifier.score(X_train, y_train)
dt_test_score = dt_classifier.score(X_test, y_test)

print(f"Unpruned decision tree training accuracy: {dt_train_score:.4f}")
print(f"Unpruned decision tree testing accuracy: {dt_test_score:.4f}")

Unpruned decision tree training accuracy: 1.0000
Unpruned decision tree testing accuracy: 0.9531


This test accuracy score seems great at first sight, but it is misleading.
Recall that only about $ 3.2\% $ percent of all companies go bankrupt, so
we could do better by simply predicting that no company will go bankrupt.

Below we define two functions to be able to compare the performance of the
several models to be fitted:

In [None]:
def get_error_rate(model, X_test, y_test, threshold=0.2):
    y_prob = model.predict_proba(X_test)[:, 1]
    y_pred = (y_prob > threshold).astype(int)
    accuracy = model.score(X_test, y_test)
    error_rate = 1 - accuracy
    return error_rate


def get_confusion_matrix(model, X_test, y_test, threshold=0.2):
    y_prob = model.predict_proba(X_test)[:, 1]
    y_pred = (y_prob > threshold).astype(int)
    conf_matrix = confusion_matrix(y_test, y_pred)
    return conf_matrix