# <center>BankruptcyClassifier </center>
<img src="https://www.popoptiq.com/wp-content/uploads/2018/06/business-man-watching-company-go-bankrupt-pop.jpg"/>

<a id="Table-Of-Contents"></a>
# Table Of Contents
* [Table Of Contents](#Table-Of-Contents)
* [Task Details](#Task-Details)
* [Importing Libraries](#Importing-Libraries)
* [Notes](#Notes)
    - [Context](#Context)
    - [Target](#Target)
    - [Attribute-Information](#Attribute-Information)
* [Read in Data](#Read-in-Data)
    - [Data.csv](#Data.csv)
* [Preprocessing Data](#Preprocessing-Data)
    - [Train-Test Stratified Split](#Train-Test-Stratified-Split)
* [Initial Models](#Initial-Models)
* [LightGBM Classifier](#LightGBM-Classifier)
    - [Tuning LightGBM](#Tuning-LightGBM)
    - [Model Metrics](#Model-Metrics)
    - [Bayesian Optimization](#Bayesian-Optimization)
    - [Feature Importance](#Feature-Importance)
* [LightGBM Model Peformance](#LightGBM-Model-Peformance)
    - [ROC Curve](#ROC-Curve)
    - [Confusion Matrix](#Confusion-Matrix)
* [Conclusion](#Conclusion)

<a id="Task-Details"></a>
# Task Details
Our top priority in this business problem is to identify companies in bankruptcy.

## Evaluation
Evaluation using F1-Score.
The F1-Score is defines as the harmonic mean between precision and recall:
<img src="https://datascience103579984.files.wordpress.com/2019/04/capture4-17.png"/>

<a id="Importing-Libraries"></a>
# Importing Libraries

In [None]:
#%% Imports

# Basic Imports 
import numpy as np
import pandas as pd

# Plotting 
from matplotlib import pyplot
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
%matplotlib inline

# Preprocessing
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder

# Metrics 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import classification_report,accuracy_score, recall_score, roc_auc_score, precision_score
from sklearn.metrics import roc_curve, auc

# ML Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from lightgbm import LGBMClassifier 

# Model Tuning 
from bayes_opt import BayesianOptimization

# Feature Importance 
import shap

# Ignore Warnings 
import warnings
warnings.filterwarnings('ignore')

<a id="Notes"></a>
# Notes

## Context
The data were collected from the Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange.

## Target
Y - Bankrupt?: Class label: 0 - Not Bankrupt , 1 - Bankrupt

## Attribute Information
X1 - ROA(C) before interest and depreciation before interest: Return On Total Assets(C)  
X2 - ROA(A) before interest and % after tax: Return On Total Assets(A)  
X3 - ROA(B) before interest and depreciation after tax: Return On Total Assets(B)  
X4 - Operating Gross Margin: Gross Profit/Net Sales  
X5 - Realized Sales Gross Margin: Realized Gross Profit/Net Sales  
X6 - Operating Profit Rate: Operating Income/Net Sales  
X7 - Pre-tax net Interest Rate: Pre-Tax Income/Net Sales  
X8 - After-tax net Interest Rate: Net Income/Net Sales  
X9 - Non-industry income and expenditure/revenue: Net Non-operating Income Ratio  
X10 - Continuous interest rate (after tax): Net Income-Exclude Disposal Gain or Loss/Net Sales  
X11 - Operating Expense Rate: Operating Expenses/Net Sales  
X12 - Research and development expense rate: (Research and Development Expenses)/Net Sales  
X13 - Cash flow rate: Cash Flow from Operating/Current Liabilities  
X14 - Interest-bearing debt interest rate: Interest-bearing Debt/Equity  
X15 - Tax rate (A): Effective Tax Rate  
X16 - Net Value Per Share (B): Book Value Per Share(B)  
X17 - Net Value Per Share (A): Book Value Per Share(A)  
X18 - Net Value Per Share (C): Book Value Per Share(C)  
X19 - Persistent EPS in the Last Four Seasons: EPS-Net Income  
X20 - Cash Flow Per Share  
X21 - Revenue Per Share (Yuan ¥): Sales Per Share  
X22 - Operating Profit Per Share (Yuan ¥): Operating Income Per Share  
X23 - Per Share Net profit before tax (Yuan ¥): Pretax Income Per Share  
X24 - Realized Sales Gross Profit Growth Rate  
X25 - Operating Profit Growth Rate: Operating Income Growth  
X26 - After-tax Net Profit Growth Rate: Net Income Growth  
X27 - Regular Net Profit Growth Rate: Continuing Operating Income after Tax Growth  
X28 - Continuous Net Profit Growth Rate: Net Income-Excluding Disposal Gain or Loss Growth  
X29 - Total Asset Growth Rate: Total Asset Growth  
X30 - Net Value Growth Rate: Total Equity Growth  
X31 - Total Asset Return Growth Rate Ratio: Return on Total Asset Growth  
X32 - Cash Reinvestment %: Cash Reinvestment Ratio  
X33 - Current Ratio  
X34 - Quick Ratio: Acid Test  
X35 - Interest Expense Ratio: Interest Expenses/Total Revenue  
X36 - Total debt/Total net worth: Total Liability/Equity Ratio  
X37 - Debt ratio %: Liability/Total Assets  
X38 - Net worth/Assets: Equity/Total Assets  
X39 - Long-term fund suitability ratio (A): (Long-term Liability+Equity)/Fixed Assets  
X40 - Borrowing dependency: Cost of Interest-bearing Debt  
X41 - Contingent liabilities/Net worth: Contingent Liability/Equity  
X42 - Operating profit/Paid-in capital: Operating Income/Capital  
X43 - Net profit before tax/Paid-in capital: Pretax Income/Capital  
X44 - Inventory and accounts receivable/Net value: (Inventory+Accounts Receivables)/Equity  
X45 - Total Asset Turnover  
X46 - Accounts Receivable Turnover  
X47 - Average Collection Days: Days Receivable Outstanding  
X48 - Inventory Turnover Rate (times)  
X49 - Fixed Assets Turnover Frequency  
X50 - Net Worth Turnover Rate (times): Equity Turnover  
X51 - Revenue per person: Sales Per Employee  
X52 - Operating profit per person: Operation Income Per Employee  
X53 - Allocation rate per person: Fixed Assets Per Employee  
X54 - Working Capital to Total Assets  
X55 - Quick Assets/Total Assets  
X56 - Current Assets/Total Assets  
X57 - Cash/Total Assets  
X58 - Quick Assets/Current Liability  
X59 - Cash/Current Liability  
X60 - Current Liability to Assets  
X61 - Operating Funds to Liability  
X62 - Inventory/Working Capital  
X63 - Inventory/Current Liability  
X64 - Current Liabilities/Liability  
X65 - Working Capital/Equity  
X66 - Current Liabilities/Equity  
X67 - Long-term Liability to Current Assets  
X68 - Retained Earnings to Total Assets  
X69 - Total income/Total expense  
X70 - Total expense/Assets  
X71 - Current Asset Turnover Rate: Current Assets to Sales  
X72 - Quick Asset Turnover Rate: Quick Assets to Sales  
X73 - Working capitcal Turnover Rate: Working Capital to Sales  
X74 - Cash Turnover Rate: Cash to Sales  
X75 - Cash Flow to Sales  
X76 - Fixed Assets to Assets  
X77 - Current Liability to Liability  
X78 - Current Liability to Equity  
X79 - Equity to Long-term Liability  
X80 - Cash Flow to Total Assets  
X81 - Cash Flow to Liability  
X82 - CFO to Assets  
X83 - Cash Flow to Equity  
X84 - Current Liability to Current Assets  
X85 - Liability-Assets Flag: 1 if Total Liability exceeds Total Assets, 0 otherwise  
X86 - Net Income to Total Assets  
X87 - Total assets to GNP price  
X88 - No-credit Interval  
X89 - Gross Profit to Sales  
X90 - Net Income to Stockholder's Equity  
X91 - Liability to Equity  
X92 - Degree of Financial Leverage (DFL)  
X93 - Interest Coverage Ratio (Interest expense to EBIT)  
X94 - Net Income Flag: 1 if Net Income is Negative for the last two years, 0 otherwise  
X95 - Equity to Liability  

## Source
Deron Liang and Chih-Fong Tsai, deronliang '@' gmail.com; cftsai '@' mgt.ncu.edu.tw, National Central University, Taiwan
The data was obtained from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction

## Relevant Papers
Liang, D., Lu, C.-C., Tsai, C.-F., and Shih, G.-A. (2016) Financial Ratios and Corporate Governance Indicators in Bankruptcy Prediction: A Comprehensive Study. European Journal of Operational Research, vol. 252, no. 2, pp. 561-572.
https://www.sciencedirect.com/science/article/pii/S0377221716000412

<a id="Read-in-Data"></a>
# Read in Data

<a id="data.csv"></a>
## Data.csv

In [None]:
#%% Read data.csv
data_csv = pd.read_csv('../input/company-bankruptcy-prediction/data.csv')

# Rename Column names 
# X0 = target 
column_names = ['Y']
column_names = column_names + ['X' + str(num) for num in range(1,len(data_csv.columns))]
column_names_df = pd.DataFrame({"Var_name": column_names,"Description": data_csv.columns})

data_csv.columns = column_names 
data_csv.info(verbose = True,show_counts = True)

In [None]:
column_names_df.style

<a id="Preprocessing-Data"></a>
# Preprocessing Data

In [None]:
for int_column in data_csv.select_dtypes(include="int64"):
    print(data_csv[int_column].value_counts())
    print("\n")

# drop column C94 for being a useless feature, only value is 1 
data_csv = data_csv.drop("X94",axis = "columns")

# this dataset is imbalance 6599 ones and 220 zeroes

label encoding does not need to be used here as the columns are already fine. The categorical variables can stay as int64 type as they are only binary variables. 

<a id="#Train-Test-Stratified-Split"></a>
## Train-Test Stratified Split

In [None]:
# Create test and train set 80-20

#%% Sepearte features and target from data_csv
X = data_csv.drop("Y",axis = "columns")
y = data_csv["Y"]

#%%  train-test stratified split using 80-20 
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.2, random_state = 0, shuffle= True,stratify = y)

# By using a stratified split, the raito of 1 and 0s are consistent btwn the two splits 
print(train_y.value_counts())
print("\n")
print(valid_y.value_counts())

# Model Metrics
I've created several functions to help evalaute a model's performance such as confusion matrix, accuarcy, recall, and F1-score.    

In [None]:
# Great Function found on Kaggle for plotting a Confusion Matrix
# https://www.kaggle.com/grfiv4/plot-a-confusion-matrix
def plot_confusion_matrix_kaggle(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm:           confusion matrix from sklearn.metrics.confusion_matrix

    target_names: given classification classes such as [0, 1, 2]
                  the class names, for example: ['high', 'medium', 'low']

    title:        the text to display at the top of the matrix

    cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
                  see http://matplotlib.org/examples/color/colormaps_reference.html
                  plt.get_cmap('jet') or plt.cm.Blues

    normalize:    If False, plot the raw numbers
                  If True, plot the proportions

    Usage
    -----
    plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
                                                              # sklearn.metrics.confusion_matrix
                          normalize    = True,                # show proportions
                          target_names = y_labels_vals,       # list of names of the classes
                          title        = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")


    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    plt.show()
    
# binarize an array based of a threshold 
def binarizeArray(array,threshold = 0.5):
    return [0 if num < threshold else 1 for num in array]

<a id="Initial Models"></a>
# Initial Models
I applied different machine learning algorthims to test which model perform better on this dataset. I've listed below various machine learning techniques applied in this section.

1. Logistic Regression
2. XGBoost Classifier
3. Random Forest Classifier
5. LightGBM Classifier

In [None]:
# Create initial models
LogReg = LogisticRegression(random_state=0).fit(train_X, train_y)
XGBClass = xgb.XGBClassifier(eval_metric  = "logloss", max_depth=5, learning_rate=0.01, n_estimators=100, gamma=0, 
                        min_child_weight=1, subsample=0.8, colsample_bytree=0.8, reg_alpha=0.005,seed = 0).fit(train_X,train_y)
RFClass = RandomForestClassifier(n_estimators = 50, max_depth = 50,n_jobs = -1, random_state = 0).fit(train_X,train_y)
LGBMClass = LGBMClassifier(random_state=0).fit(train_X, train_y)

In [None]:
# initial models performance on Test Data 
    
pred_y = LogReg.predict(valid_X)
print("                    Logistic Regression")
print(classification_report(valid_y,pred_y,digits=3))

pred_y = XGBClass.predict(valid_X)
print("                    XGBoost Classifier")
print(classification_report(valid_y,pred_y,digits=3))

pred_y = RFClass.predict(valid_X)
print("                    Random Forest Classifier")
print(classification_report(valid_y,pred_y,digits=3))

pred_y = LGBMClass.predict(valid_X)
print("                    LightGBM Classifier")
print(classification_report(valid_y,pred_y,digits=3))

Focus on F1-score for 1. It can be seen that it is difficult to determine if a company files bankruptcy. 

LightGBM Classifier perform the best so I proceeded to implement hyperparameter optimization onto LightGBM.  

<a id="LightGBM-Classifier"></a>
# LightGBM Classifier

In [None]:
##% parameter tuning for lightgbm 
# store the catagorical features names as a list      
cat_features = data_csv.select_dtypes(['object']).columns.to_list()
# print(cat_features)

# Create the LightGBM data containers
# Make sure that cat_features are used
train_lgbdata=lgb.Dataset(train_X,label=train_y, categorical_feature = cat_features,free_raw_data=False)
test_lgbdata=lgb.Dataset(valid_X,label=valid_y, categorical_feature = cat_features,free_raw_data=False)

<a id="Bayesian-Optimization"></a>
## Bayesian Optimization

In [None]:
# https://github.com/fmfn/BayesianOptimization
def search_best_param(X,y,cat_features):
    
    trainXY = lgb.Dataset(data=X, label=y,categorical_feature = cat_features,free_raw_data=False)
    # define the lightGBM cross validation
    def lightGBM_CV(max_depth, num_leaves, n_estimators, learning_rate, subsample, colsample_bytree, 
                lambda_l1, lambda_l2, min_child_weight):
    
        params = {'boosting_type': 'gbdt', 'objective': 'binary', 'metric':'auc', 'verbose': -1,
                  'early_stopping_round':100}
        
        params['max_depth'] = int(round(max_depth))
        params["num_leaves"] = int(round(num_leaves))
        params["n_estimators"] = int(round(n_estimators))
        params['learning_rate'] = learning_rate
        params['subsample'] = subsample
        params['colsample_bytree'] = colsample_bytree
        params['lambda_l1'] = max(lambda_l1, 0)
        params['lambda_l2'] = max(lambda_l2, 0)
        params['min_child_weight'] = min_child_weight
    
        score = lgb.cv(params, trainXY, nfold=5, seed=1, stratified=True, verbose_eval =False, metrics=['auc'])
        return np.mean(score['auc-mean']) # maximize auc-mean

    # use bayesian optimization to search for the best hyper-parameter combination
    lightGBM_Bo = BayesianOptimization(lightGBM_CV, 
                                       {
                                          'max_depth': (5, 50),
                                          'num_leaves': (20, 100),
                                          'n_estimators': (50, 500),
                                          'learning_rate': (0.01, 0.3),
                                          'subsample': (0.7, 0.8),
                                          'colsample_bytree' :(0.5, 0.99),
                                          'lambda_l1': (0, 5),
                                          'lambda_l2': (0, 3),
                                          'min_child_weight': (2, 50) 
                                      },
                                       random_state = 1,
                                       verbose = 3
                                      )
    np.random.seed(1)
    
    lightGBM_Bo.maximize(init_points= 5, n_iter=5) # 5 + 5, 10 iterations 
    # n_iter: How many steps of bayesian optimization you want to perform. The more steps the more likely to find a good maximum you are.
    # init_points: How many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.
    # more iterations more time spent searching 
    
    params_set = lightGBM_Bo.max['params']
    
    # get the params of the maximum target     
    max_target = -np.inf
    for i in lightGBM_Bo.res: # loop thru all the residuals 
        if i['target'] > max_target:
            params_set = i['params']
            max_target = i['target']
    
    params_set.update({'verbose': -1})
    params_set.update({'metric': 'auc'})
    params_set.update({'boosting_type': 'gbdt'})
    params_set.update({'objective': 'binary'})
    
    params_set['max_depth'] = int(round(params_set['max_depth']))
    params_set['num_leaves'] = int(round(params_set['num_leaves']))
    params_set['n_estimators'] = int(round(params_set['n_estimators']))
    params_set['seed'] = 1 #set seed
    
    return params_set

In [None]:
# Search for best param on Training Set
best_params = search_best_param(train_X,train_y,cat_features)

In [None]:
# Print best_params
for key, value in best_params.items():
    print(key, ' : ', value)

<a id="Tuning-LightGBM"></a>
## Tuning LightGBM

In [None]:
# Train lgbm_best using the best params found from Bayesian Optimization
lgbm_best = lgb.train(best_params,
                 train_lgbdata,
                 num_boost_round = 100,
                 valid_sets = test_lgbdata,
                 early_stopping_rounds = 100,
                 verbose_eval = 25
                 )

<a id="Feature-Importance "></a>
## Feature Importance 

In [None]:
##% Feature Importance 
# https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
lgb.plot_importance(lgbm_best,figsize=(25,20),max_num_features = 10)

In [None]:
##% Feature Importance using shap package 
# import shap
shap_values = shap.TreeExplainer(lgbm_best).shap_values(valid_X)
shap.summary_plot(shap_values, valid_X)

<a id="LightGBM-Model-Peformance "></a>
# LightGBM Model Peformance 

<a id="ROC-Curve"></a>
## ROC Curve

In [None]:
#%% ROC Curve for training/validation data
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
# from sklearn.metrics import roc_curve, auc

y_probas = lgbm_best.predict(valid_X) 

fpr, tpr, _ = roc_curve(valid_y, y_probas)
roc_auc = auc(fpr, tpr)

plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.4f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for training data')
plt.legend(loc="lower right")
plt.show()

In [None]:
#%% Plot ROC curve and find best threshold to binarize the predictions from LGBM
# https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/
pred_y = lgbm_best.predict(valid_X)
# calculate roc curves
fpr, tpr, thresholds = roc_curve(valid_y, pred_y)
# calculate the g-mean for each threshold
gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))
# plot the roc curve for the model
pyplot.figure(num=0, figsize=[6.4, 4.8])
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.', label='LightGBM')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()

<a id="Confusion-Matrix"></a>
## Confusion Matrix

In [None]:
# Great Function found on Kaggle for plotting a Confusion Matrix
# https://www.kaggle.com/grfiv4/plot-a-confusion-matrix
def plot_confusion_matrix_kaggle(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm:           confusion matrix from sklearn.metrics.confusion_matrix

    target_names: given classification classes such as [0, 1, 2]
                  the class names, for example: ['high', 'medium', 'low']

    title:        the text to display at the top of the matrix

    cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
                  see http://matplotlib.org/examples/color/colormaps_reference.html
                  plt.get_cmap('jet') or plt.cm.Blues

    normalize:    If False, plot the raw numbers
                  If True, plot the proportions

    Usage
    -----
    plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
                                                              # sklearn.metrics.confusion_matrix
                          normalize    = True,                # show proportions
                          target_names = y_labels_vals,       # list of names of the classes
                          title        = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")


    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    plt.show()

# binarize an array based of a threshold 
def binarizeArray(array,threshold = 0.5):
    return [0 if num < threshold else 1 for num in array]

In [None]:
#%% Plot Confusion Matrix with best threshold
pref_y_bin = binarizeArray(pred_y,thresholds[ix])
cm = confusion_matrix(valid_y,pref_y_bin)
plot_confusion_matrix_kaggle(cm =cm, 
                      normalize    = False,
                      target_names = ['Not Bankrupt(0)', 'Bankrupt(1)'],
                      title        = "Confusion Matrix")
print(classification_report(valid_y,pref_y_bin))
print("Accuracy: %.2f%%" % (accuracy_score(valid_y, pref_y_bin)*100.0))
print("Recall: %.2f%%" % ((recall_score(valid_y,pref_y_bin))*100.0))

<a id="Conclusion"></a>
# Conclusion

**Conclusion**
* Good dataset to work with, enough observations to create a model
* Difficult to achieve a good model metric as dataset is imbalanced

**Closing Remarks**  
* Please comment and like the notebook if it of use to you! Have a wonderful year! 

**More Notebooks** 

**Regression Notebooks**   
[https://www.kaggle.com/josephchan524/housepricesregressor-using-lightgbm](https://www.kaggle.com/josephchan524/housepricesregressor-using-lightgbm)

[https://www.kaggle.com/josephchan524/tabularplaygroundregressor-using-lightgbm-feb2021](https://www.kaggle.com/josephchan524/tabularplaygroundregressor-using-lightgbm-feb2021)

[https://www.kaggle.com/josephchan524/studentperformanceregressor-rmse-12-26-r2-0-26](https://www.kaggle.com/josephchan524/studentperformanceregressor-rmse-12-26-r2-0-26)


**Classification Notebooks**  
[https://www.kaggle.com/josephchan524/hranalytics-lightgbm-classifier-auc-80](https://www.kaggle.com/josephchan524/hranalytics-lightgbm-classifier-auc-80)

[https://www.kaggle.com/josephchan524/bankchurnersclassifier-recall-97-accuracy-95](https://www.kaggle.com/josephchan524/bankchurnersclassifier-recall-97-accuracy-95)

[https://www.kaggle.com/josephchan524/tabularplaygroundclassifier-using-lightgbm-mar2021](https://www.kaggle.com/josephchan524/tabularplaygroundclassifier-using-lightgbm-mar2021)  

3-16-2020
Joseph