## Company Bankruptcy Prediction

## Table of Contents

<ul>
    <li><a href='#intro'>Introduction</a></li>
    <li><a href='#wrangle'>Data Wrangling</a></li>
    <li><a href='#eda'>Exploratory Data Analysis</a></li>
    <li><a href='#conclusion'>Conclusion</a></li>
</ul>

<a id='intro'></a>
## Introduction
The data were collected from the Taiwan Economic Journal for the years 1999 to 2009. 

In this notebook, I built a LightGBM Classifier to predict company bankruptcy using financial features. The model has accuracy 0.99 and F1-score 0.99. Then I used SHAP to explain the predictions of this model. 

I made a [Data Visualization Web Application](https://bankruptcy-visualization.herokuapp.com/) by Streamlit and Heroku. In the application, you can select two random features to generate a scatterplot, with the colors represent bankruptcy condition.

<a id='wrangle'></a>
## Data Wrangling

In [None]:
# Import packages

## general packages
import os
import numpy as np
import pandas as pd

## Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.transform import factor_cmap, jitter
from bokeh.layouts import row

## Machine learning
import time
from sklearn.model_selection import train_test_split
### Oversampling
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
### LightGBM
import lightgbm as lgb
### Metrics
from sklearn.metrics import roc_auc_score, precision_score, classification_report
### Feature Selection
import shap

In [None]:
# Upload dataset
df = pd.read_csv('data.csv')
df

In [None]:
df2=df.copy()
df2['Bankrupt?'].replace({0:'No', 1: 'Yes'}, inplace=True)
df2

In [None]:
df.describe()

Normalize each feature

There are 95 variables describing the condition of companies, plus one column "Bankrupt?" as the label.

The number of records is 6819.

Next, I'll check the existence of replicates and null values.

In [None]:
# check duplicates
df.duplicated().sum()

In [None]:
# check null values
df.isnull().any().any()

In [None]:
df.info()

### Class Balancing

In [None]:
labels = df['Bankrupt?'].value_counts()
labels.index=['No', 'Yes']
labels

To solve the unbalance of labels, I'll use SMOTEENN method to oversample the minority class then clean the noisy samples.

In [None]:
ori_X = df.drop(['Bankrupt?'], axis=1)
ori_y = df['Bankrupt?']

smote_enn = SMOTEENN(random_state=28)
X, y = smote_enn.fit_resample(ori_X, ori_y)

In [None]:
new_labels = y.value_counts()
new_labels.index = ['No', 'Yes']
new_labels

In [None]:
fig = plt.figure(figsize=[10, 6])
gs = fig.add_gridspec(1, 2)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])

ax0.pie(labels, labels=labels.index, pctdistance=0.5, autopct='%.1f%%')
ax1.pie(new_labels, labels=new_labels.index, pctdistance=0.5, autopct='%.1f%%')

ax0.set_title('Distribution of Labels before SMOTEENN')
ax1.set_title('Distribution of Labels after SMOTEENN');

<a id='eda'></a>
## Exploratory Data Analysis

### Modeling

Use Scikit-learn API:

In [None]:
# Split data into random train and test subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=28)
# Set the dataset
d_train = lgb.Dataset(X_train, label=y_train)
d_test = lgb.Dataset(X_test, label=y_test)
# Specify parameters
params = {'boosting':'gbdt',
          'max_bin':512,
          'num_leaves':10,
          'learning_rate':0.03,
          'objective':'binary',
          'force_col_wise':True,
          'metric':'binary_logloss'}

# Train
lgbm = lgb.train(params, d_train, 1000)

In [None]:
y_pred = lgbm.predict(X_test)
y_pred = y_pred.round(0)
y_pred = y_pred.astype(int)

In [None]:
roc_auc_score(y_pred, y_test)

In [None]:
print(classification_report(y_test, y_pred))

### Feature selection

In [None]:
# Load JS visualization code
shap.initjs()

# Explain the model's predictions using SHAP
explainer = shap.TreeExplainer(lgbm)
shap_values = explainer.shap_values(X)

In [None]:
shap_values[1].shape

#### The total impact of features on the model

In [None]:
shap.summary_plot(shap_values, features=X_train, feature_names=X_train.columns, plot_type='bar')

This plot shows the importance of features in a descending order. For example, the top feature "Continuous interest rate" contribute more to the model than the second feature "Total debt/total net worth". To figure out the relationships between features and labels,I plotted the effect of these features on all records in the training data as shown below:

#### The impact of features on the model for individual data

In [None]:
shap.summary_plot(shap_values[1], features=X, feature_names=X.columns)

This plot shows the positive and negative relationships of the features with the label. Similarly, features are ranked in descending order. Each dot represents one record in the training data. The color represents the value of the feature (red high, blue low). The horizontal axis represents the effect of feature on model prediction. For example, high continuous interest rate lowers the probability that model predict bankrupt. There is a negative relationship between continuous interest rate with bankrupt.

#### The effect of a single feature across the whole dataset

In [None]:
for name in X.columns:
    shap.dependence_plot(name, shap_values[1], X)

Next, I'll explore the effect of features on the prediction of each record, such as the first one with index 0:

In [None]:
shap.force_plot(explainer.expected_value[1], shap_values[1][0,:], X.iloc[0,:])

How about the second record?

In [None]:
shap.force_plot(explainer.expected_value[1], shap_values[1][1,:], X.iloc[1,:])

The prediction of both records are lower than the base value, which means these two records were classified as "not bankrupt". Those features which push the prediction to the direction of "not bankrupt" are shown in blue.

### Visualization

First, let's take a look at the relationship between the top two variables: 
  - Continuous interest rate
  - Total debt/ Total net worth
  
More visualizations are shown on [Heroku app](https://bankruptcy-visualization.herokuapp.com)

In [None]:
output_notebook()

In [None]:
# figure1
p1 = figure(plot_width=500, plot_height=500)

colormap = {0:'green', 1: 'red'}
colors= [colormap[x] for x in df['Bankrupt?']]

p1.circle(df[' Continuous interest rate (after tax)'], 
         df[' Total debt/Total net worth'], 
         size=10, line_color='black', fill_color=colors, fill_alpha=0.2)

p1.xaxis.axis_label='Continuous interest rate (after tax)'
p1.yaxis.axis_label='Total debt/Total net worth'

#figure2
p2 = figure(plot_width=500, plot_height=500)

colormap = {0:'green', 1: 'red'}
colors= [colormap[x] for x in df['Bankrupt?']]

p2.circle(df[' Continuous interest rate (after tax)'], 
         df[' Retained Earnings to Total Assets'], 
         size=10, line_color='black', fill_color=colors, fill_alpha=0.2)

p2.xaxis.axis_label='Continuous interest rate (after tax)'
p2.yaxis.axis_label='Retained Earnings to Total Assets'


show(row(p1, p2))

<a id='conclusion'></a>
## Conclusion

The performance of the LightGBM model is pretty good, with accuracy 0.99 and F1-score 0.99. This model could help investigators distinguish companies with potentiation before making business decisions. 