# Predicting company bankruptcy 
-- Creating as much value as possible within a time limit of 1 day.

## Context
The data was collected from the Taiwan Economic Journal from 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange.

**In the article published by Deron Lianga, Chia-Chi Lu, Chih-Fong Tsaic and Guan-An Shiha (Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study, 2016), I managed to find some additional information regarding how this dataset was collected.**

1. "The sample companies had to have at least three years of complete public information before the occurrence of the financial crisis." However, there was no information on the year the data were taken from (year of the bankruptcy (Y), Y-1, Y-2, or Y-3).

2. The resultant sample includes: 
    * companies from the manufacturing industry composed of industrial and electronics companies (346 companies), \n- the service industry composed of shipping, tourism, and retail companies (39 companies)
    * and others (93 companies), but no financial companies.



## Project Overview

**Goals:** As a former Restructuring consultant, I thought it would be interesting:
* to see whether we can accurately predict which company will face bankruptcy in the future with Machine Learning. 
* to leverage on this model to try to create some additional value such as developping a credit scoring tool (not performed in this notebook)

**However, I will perform those tasks in depth through a future dataset that I will scrape myself on European or North American companies. Important information are missing in this dataset (such as the company's industry). Furthermore, I would like to scrape data from up to 1 year prior to the actual bankruptcy.**

**Constraint:** Due to the dataset limitations, I decided to spend only 1 day working on this project.
My goals are:
* To get used to working and publishing on Kaggle (as it is the first project that I will share on it),
* To see how much value I could produce in a limited timeframe,
* To increase the number of dataset and challenges I will be able to work with.


**edit 16/06:**

I actually created a repository to deploy this model on github thanks to flask/heroku: https://github.com/AymericPeltier/Bankruptcy_Prediction
* Currently waiting from support on how to fix a dependancy issue with numpy (mlk not supported yet)
* Furthermore, I updated 2/3 part of the codes to improve the f1_score, especially on the Linear_Regression (which is the model I deployed)
=> Please refer to version 7 for what I achieved only in one day.

**edit 19/06:**
** Added classification report / confusion matrix for the NN

# Index:

- [Describe the data](#p2)
- [Data pre-processing](#p3)
- [Weak learners: Decision Tree and Logistic Regression](#p4)
- [Shallow Neural Network (SNN) with Keras](#p5)
- [Conclusion](#p6)

In [None]:
#Let's load all the required libraries

#General
import numpy as np
import pandas as pd


#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Building the models
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, precision_recall_curve, f1_score, confusion_matrix, classification_report, roc_curve, auc
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
from imblearn.over_sampling import RandomOverSampler
from sklearn.feature_selection import RFECV, f_classif, VarianceThreshold, SelectKBest, f_regression
from sklearn.ensemble import RandomForestClassifier


#deep learning keras
import keras
from keras.models import Sequential
from keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from keras.layers import Dropout

 [Back to top](#Index:) 
<a id='p2'></a>
### Describe the Data

**Data type and quality:**
- There are 96 columns (95 input features + 1 output feature) in the dataset, and 6819 rows (=companies)
- 93 output features are numerical & 2 are categorical (one of which has only one value)
- There is no missing data, no null values, no duplicated rows 

**Out of the 6819 companies in the dataset:**
- 6599 (97%) did not go bankrupt
- 220 (3%) went bankrupt


**First impressions:**
* The dataset is really clean, there is no need for extensive data cleaning.

* Looking at the column names and taking in account that those are financial ratios, I can already guess that many features are highly correlated.

* We can see a strong class imbalance. Using accuracy as our evaluation metric (total accurate predictions) is not relevant as any model could achieve 97% accuracy by just predicting that every companies will not go bankrupt. I would also argue that predicting that a company will go bankrupt has more importance than correctly predicting than a company will not, because actions must be taken in the former case. (F1 score seems more relevant)


In [None]:
df = pd.read_csv("/kaggle/input/company-bankruptcy-prediction/data.csv")
columns = df.columns
print(df.shape)
print("total null values", df.isnull().sum().sum())
print("total potential duplicated rows", df.duplicated().sum())
print("----------------------------------")
Bankrupt, Bankrupt_perc = (df["Bankrupt?"].value_counts(), round(df["Bankrupt?"].value_counts(normalize=True),2))
display(Bankrupt, Bankrupt_perc)
print("----------------------------------")
df.info()

 [Back to top](#Index:) 
<a id='p3'></a>
### Data Pre-processing & Features Selection

In this section, we will prepare our data for consumption.
* We will normalize our data

* We will select a reduced amount of features (to avoid overfitting and make the models simple to implement if needed)

**Selection process:**
* First, we want to drop the features which have the lowest correlation to bankruptcy and with a correlation >0.7 to other features *(i.e: ROA(B) and ROA(A) have 0.99 correlation to ROA(C), and ROA(C) has the highest correlation to "Bankrupt?" => drop ROA(A) and ROA(B))*. -- 34 features were dropped doing so.

* Then, we use feature ranking with recursive feature elimination and cross-validated selection of the best number of features. As my goal is to make something easy to understand and simple to use, I'll restrain the features to those with the highest importance. -- at this stage, 9 features were selected.

* Finally, after looking at their k score (SelectKbest) and their boxplot (data distribution), we will decide whether to use any more parameters. -- 6 features were kept.
 

**At the end of our steps, we have 6 features remaining:**
* ROA(C), 
* Net Value Per Share (B), 
* Debt ratio %, 
* Total Asset Turnover, 
* Cash/Total Assets, 
* Equity to Liability

I decided to drop 'Accounts Receivable Turnover', 'Total income/Total expense', and 'Operating Profit Rate' as those parameters might lead to overfitting. If we obtained more datas, it would be interesting to check whether those parameters are actually useful or not. 

**Without going too much in depth due to the time constraint, they seem to efficiently capture the main aspects that lead a company to bankrupty:**
* **ROA & Total Asset Turnover** provide information regarding how effective the investing strategy (into asset) is. It also gives an indice on profitability (as one is calculated based on net sales, and the other on net income)
* **Net value and debt ratio** provide interesting relations between total assets and total debt (= financial risk)
* **Cash/Total Assets** may indicates a degree of surety if not low (and too high)
* **Equity to Liability** gives a picture of the leverage being used

Overall, I think that we are missing very important information in this dataset:
* the company industry: Financial ratios can strongly vary from an industry to another.
* time series ratio. Most of this exercise value would be to predict risk of going bankrupt as soon as possible.

**While I think we could actually achieve an even better selection taking more time and using domain knowledge, this set of parameter seems quite decent**

**I won't be eliminating any outlier values for multiple reasons:**
* the minority class sample is already quite small, I am unwilling to reduce it even further

* We have no information regarding the companie's industry, financial ratio can vary strongly from one industry to another. Those outliers may actually be normal.

* Bankrupt companies could present some highly skewed financial ratios.

However, with more time for considerations and data exploration, I might do a different choice.

**I did not perform any feature engineering as all the financial ratios are already quite precise and encompass many if not all the main financial aspect of a company.**

In [None]:
#Drop constant columns (if any)
var_thres = VarianceThreshold(threshold=0).fit(df)
constant_columns = [column for column in df.columns
                    if column not in df.columns[var_thres.get_support()]]
for feature in constant_columns:
     print(feature)
df.drop(constant_columns,axis=1, inplace=True)

In [None]:
#Normalize data for faster processing
def data_scaling(DataFrame):
    scaler = StandardScaler()
    DataFrame.iloc[:,1:] = scaler.fit_transform(DataFrame.iloc[:,1:])
    return(DataFrame)
df_ml = data_scaling(df)

#Split dataframe for feature selection
X = df_ml.drop(columns=["Bankrupt?"])
y = df_ml["Bankrupt?"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
df_train =pd.concat([y_train, X_train], axis=1)


#Sort columns from the less correlated to the most correlated
df_train_corr = df_train.corr()
df_train_corr = df_train_corr.reindex(df_train_corr["Bankrupt?"].abs().sort_values(ascending=True).index).T
column_names = np.array(df_train_corr.columns)
df_train= df_train.reindex(columns=column_names)

#Isolate the input features which have a high correlation between themselves
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr
corr_features = correlation(X_train, 0.65)
display(len(corr_features))

#Now let's go ahead and drop them:
X_train.drop(corr_features, axis=1, inplace=True)
X_test.drop(corr_features, axis=1, inplace=True)
df_train.drop(corr_features,axis=1, inplace=True)
df_train.head()

In [None]:
#Feature selection for linear regression
#Use rfecv to determine which features should remain in our model
lgr = LogisticRegression(max_iter=1000)
rfecv = RFECV(estimator=lgr, step=5, cv=5, scoring='f1')
rfecv = rfecv.fit(X_train, y_train)

select_features_rfecv = pd.DataFrame({'Features': list(X_train.columns),
                                  'Ranking_cv': rfecv.ranking_})
select_features_rfecv = select_features_rfecv.sort_values(by='Ranking_cv')

#Let's also get their score on correlation
select_features = SelectKBest(score_func=f_classif, k=10).fit(X_train, y_train)
select_features_kbest = pd.DataFrame({'Features': list(X_train.columns),
                                  'Scores': select_features.scores_})
select_features_kbest.sort_values(by='Scores', ascending = False)

#Let's display the result
select_features_rfecv.merge(select_features_kbest, how='left', on='Features').head(20)

In [None]:
# Select the 9 most important features. We are very likely to remove the 3 features with scores lower than 1, but I am interested
# in peeking at their basic information.
features = ["Bankrupt?"]
for i in range (9):
    features.append(select_features_rfecv.iloc[i,0])
final_df = df.loc[:, df.columns.isin(features)]
round(final_df.describe(),2)

In [None]:
#change name for clarity purpose:
final_df = final_df.rename(columns={' ROA(C) before interest and depreciation before interest': ' ROA(C)'})

#plotting boxplot to look for outliers:
fig, saxis = plt.subplots(3, 3,figsize=(15,15))
for i in range(3):
    sns.boxplot(x = "Bankrupt?", y=final_df.columns[i+1], data=final_df, ax=saxis[0,i])
    saxis[0,i].set_title(f"Bankrupt vs {final_df.columns[i+1]}")

for i in range(3):
    sns.boxplot(x = "Bankrupt?", y=final_df.columns[i+4], data=final_df, ax=saxis[1,i])
    saxis[1,i].set_title(f"Bankrupt vs {final_df.columns[i+4]}")
    
for i in range(3):
    sns.boxplot(x = "Bankrupt?", y=final_df.columns[i+7], data=final_df, ax=saxis[2,i])
    saxis[2,i].set_title(f"Bankrupt vs {final_df.columns[i+7]}")

**Looking at the boxplots, it appears that the 3 previous features with the lowest k score are mainly impactful due to outliers and do not provide a decisive information. Thus, I will drop them.**

In [None]:
#In the end, the 3 features with low k scores only seem to have impact mostly for outliers.
final_df.drop(' Accounts Receivable Turnover', axis=1, inplace=True)
final_df.drop(' Total income/Total expense', axis=1, inplace=True)
final_df.drop(' Operating Profit Rate', axis=1, inplace=True)
final_df.shape

 [Back to top](#Index:) 
<a id='p4'></a>

### Weak learners: Decision Tree and Logistic Regression

**Disclaimer: this section is more of a training ground as I did not implement the most efficient models (considering what we know so far), I am convinced that XGBoost or a Random Forest would achieve more satisfactory results. Please skip to the Shallow Neural Network if you are more interested in performance ! **

First, let's see if what we can achieve using some simple models.

#### Logistic Regression:

**Steps taken:**
* We will once again use a stratified approach to limit the variance of our results
* We will plot the Precision-Recall curve to have a better evaluation of how our model perform in regards to predicting whether a company will go bankrupt or not.

**Results:**
* The f1 score is around 0.4 with class weights, which is 0.2 higher than when using a RandomOverSampler (previous version)


In [None]:
#Split the df:
X = final_df.drop(columns=["Bankrupt?"])
y = final_df["Bankrupt?"]

#metrics
accuracy_score = []
f1_score = []

#kfold
kfold = StratifiedKFold(n_splits=5,shuffle=True)

y_real = []
y_proba = []
plt.figure(figsize=(12,12))
#Let's use a Logistic function to check the results:
for i, (train_fold_index, test_fold_index) in enumerate(kfold.split(X, y)):
    #split the data
    X_train_fold, X_test_fold = X.iloc[train_fold_index], X.iloc[test_fold_index]
    y_train_fold, y_test_fold = y.iloc[train_fold_index], y.iloc[test_fold_index]
    X_train_fold = StandardScaler().fit_transform(X_train_fold)
    #upsample the data - deleted in favor of class weights:
    #X_train_fold_upsample, y_train_fold_upsample = ros.fit_resample(X_train_fold, y_train_fold)
    
    #fit the model
    model2 = LogisticRegression(solver='lbfgs',max_iter=1000, class_weight = {0:1 , 1:6}).fit(X_train_fold, y_train_fold)
    y_pred_fold = model2.predict(X_test_fold)
    decision = model2.decision_function(X_test_fold)
    
    #Score the model:
    score1 = round(np.mean(cross_val_score(model2, X_test_fold, y_test_fold, scoring='accuracy', cv=kfold, n_jobs=1)),2)
    score2 = round(np.mean(cross_val_score(model2, X_test_fold, y_test_fold, scoring='f1', cv=kfold, n_jobs=1)),2)
    accuracy_score.append(score1)
    f1_score.append(score2)
    
    #plot PR-curve
    precision, recall, thresholds = precision_recall_curve(y_test_fold, decision)
    area = auc(recall, precision)
    lab = 'Fold %d AUC=%.4f' % (i+1, auc(recall, precision))
    plt.step(recall, precision, label=lab)
    y_real.append(y_test_fold)
    y_proba.append(decision)

#plotting the average PR curve
y_real = np.concatenate(y_real)
y_proba = np.concatenate(y_proba)
precision, recall, _ = precision_recall_curve(y_real, y_proba)
lab = 'Overall AUC=%.4f' % (auc(recall, precision))
plt.step(recall, precision, label=lab, lw=2, color='black')

plt.title('Precision Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend(loc='upper right', fontsize='small')
print('f1_score:', np.mean(f1_score))

#### Decistion Tree classifiers:

**Steps taken:**
* I corrected previous version by addind a quick feature selection step using a random forest (instead of Logistic regression, which was a mistake),
* Added a gridcv to see if we could improve the f1 score
* Also removed the class imbalance used from previous version (as explained before, it was merely to practice).

**Results:**
* The f1 score is stable when the depth is over 5,
* The average f1 score is around 0.3.


In [None]:
#Updated this part with new feature selection based on tree models (I was curious to see if it would make a difference, considering tree models did not achieve a high f1 previously)
#https://bmcgenomdata.biomedcentral.com/articles/10.1186/s12863-018-0633-8
#Researches suggest that RF do not scale well for high dimension data which are highly correlated, so we kept the first step.

#Let's now do the same with trees:
#Use rfecv to determine which features should remain in our model
rf = RandomForestClassifier(max_depth=10, random_state=0, n_estimators = 100).fit(X_train, y_train)
rfecv = RFECV(estimator=rf, step=5, cv=5, scoring='f1')
rfecv = rfecv.fit(X_train, y_train)

select_features_rfecv = pd.DataFrame({'Features': list(X_train.columns),
                                  'Ranking_cv': rfecv.ranking_})
select_features_rfecv = select_features_rfecv.sort_values(by='Ranking_cv')

#Let's also get their score on correlation
select_features = SelectKBest(score_func=f_classif, k=10).fit(X_train, y_train)
select_features_kbest = pd.DataFrame({'Features': list(X_train.columns),
                                  'Scores': select_features.scores_})
select_features_kbest.sort_values(by='Scores', ascending = False)

#Let's display the result
select_features_rfecv.merge(select_features_kbest, how='left', on='Features').head(40)


In [None]:
features = ["Bankrupt?"]
for i in range (19):
    features.append(select_features_rfecv.iloc[i,0])
final_df = df.loc[:, df.columns.isin(features)]
#Split the df:
X = final_df.drop(columns=["Bankrupt?"])
y = final_df["Bankrupt?"]

#gridsearchcv
param_dict = {'max_leaf_nodes': [3,10,20,30,40], 'min_samples_split': [2, 3, 4,8], 'max_depth': [5,8,11,14,17,20]}
grid = GridSearchCV(DecisionTreeClassifier(class_weight = {0:1 , 1:6}), param_grid=param_dict, cv=5,verbose=1,n_jobs=-1, scoring='f1')
grid.fit(X, y)
model_grid = grid.best_estimator_


kfold = StratifiedKFold(n_splits=5,shuffle=True)
list_tree_score ={"accuracy":[],
                 "f1_score": []}


for train_fold_index, test_fold_index in kfold.split(X, y):
    #split the data
    X_train_fold, X_test_fold = X.iloc[train_fold_index], X.iloc[test_fold_index]
    y_train_fold, y_test_fold = y.iloc[train_fold_index], y.iloc[test_fold_index]
    X_train_fold = StandardScaler().fit_transform(X_train_fold)
    
    #upsample the data:
    #X_train_fold_upsample, y_train_fold_upsample = ros.fit_resample(X_train_fold, y_train_fold)
    
    #fit the model
    model = model_grid.fit(X_train_fold, y_train_fold)
    
    #Score the model:
    score1 = round(np.mean(cross_val_score(model, X_test_fold, y_test_fold, scoring='accuracy', cv=kfold, n_jobs=1)),2)
    score2 = round(np.mean(cross_val_score(model, X_test_fold, y_test_fold, scoring='f1', cv=kfold, n_jobs=1)),2)
    list_tree_score["accuracy"].append(score1)
    list_tree_score["f1_score"].append(score2)
 
accuracy = np.mean(list_tree_score["accuracy"])
f1_score = np.mean(list_tree_score["f1_score"])

print(f"the accuracy of the model is {accuracy}")
print(f"the f1_score of the model is {f1_score}")


 [Back to top](#Index:) 
<a id='p5'></a>
### Shallow Neural Network (SNN) with Keras

Now that we have noticed that simple models do not perform well in predicting the positive class (=companies going bankrupt), let's use Shallow Neural Networks.

**Model considerations:**
* A Shallow Neural Network will be sufficient to deal with this problem
* We will train the model 10 times with different split (as the data is small) to reduce the variance of the result
* **We assign the following weight to the accuracy evaluation metric: 1 for correctly predicting non-bankruptcy and 4 for accurately predicting bankruptcy.** The reason behind is that False negatives are less problematic than False positives

**Results:**
* The f1 score of the SNN using our 6 selected features varies between 0.35 and 0.45 depending on the loops due to the small numbers of positive classes. 


In [None]:
#We will now build a simple Neural Network using Keras
def classification_model():
    #initiating the model
    model = Sequential()
    model.add(Dense(input_dim=X_train.shape[1], units=6, activation="relu"))
    
    #for loop to add layers (can be fine tuned): 
    for i in range(2):
        model.add(Dense(units=6, activation="relu"))
        #If we add more layers, we would want to avoid overfitting with a Dropout layer model.add(Dropout(.1))
    
    #last layer (2 categories)
    model.add(Dense(units=2, activation='sigmoid'))
    
    #compile the model
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    return model

In [None]:
#set class weights to offset the class imbalance:
weights = {0:1 , 1:4}
list_scores_keras =[]
confusion_matrix_average1 = np.zeros((2, 2))

#Let's create a loop to check our average accuracy:

for i in range(10):
    
    #Let's split the data into training and testing data
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,train_size=0.75, random_state=0+i)
    
    #Now let's one hot encode outputs
    y_train = to_categorical(y_train)
    y_test = to_categorical(y_test)

    #We will convert all the DataFrame to arrays:
    X_train = StandardScaler().fit_transform(X_train)
    X_test = X_test.values

    
    #Initiate the model:
    model = classification_model()
    #fitting the model:
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=30, batch_size=15, verbose=0,class_weight=weights)

    #confusion matrix:
    model_pred = model.predict(X_test)
    matrix = confusion_matrix(y_test.argmax(axis=1), model_pred.argmax(axis=1))
    confusion_matrix_average1 = confusion_matrix_average1 + matrix
    print(f"loop {i} classification report:", classification_report(y_test.argmax(axis=1), model_pred.argmax(axis=1)))
    
    #evaluate the model:
    tn, fp, fn, tp = matrix.ravel()
    f1_score = tp/ (tp+0.5*(fp+fn))
    list_scores_keras.append(f1_score)
    
#Let's plot the result to visualize how robust our model is:
fig = plt.figure()
ax = plt.axes()
ax.plot(list_scores_keras, scaley=False)
ax.set_title(f"The average f1_score of the model is {round(np.mean(list_scores_keras),3)}")
plt.xlabel('loop number')
plt.ylabel("weighted f1_score")

print('--------------------------------------------------------')
display(confusion_matrix_average1)
print('--------------------------------------------------------')
#print("last epoch classification report:", classification_report(y_test.argmax(axis=1), model_pred.argmax(axis=1)))

 [Back to top](#Index:) 
<a id='p6'></a>
### Conclusion

Based on what we have done previously, we can conclude that the 6 features selected are quite representative of the risk of going bankrupt.
* ROA(C), 
* Net Value Per Share (B), 
* Debt ratio %, 
* Total Asset Turnover, 
* Cash/Total Assets, 
* Equity to Liability

**Overall, the Logistic regression and SNN performed decently. Considering that some of the features distribution are not linear (they have optimum for some and have bad impact on both tails), it would be interesting to check how well some more advanced tree models perform. 

