# Project: Company Bankruptcy Prediction

# About the dataset:

link to hte dataset at Kaggle: https://www.kaggle.com/fedesoriano/company-bankruptcy-prediction

Similar Datasets
The Boston House-Price Data: LINK

Context
The data were collected from the Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange.

Attribute Information
Version 2: Updated column names and description to make the data easier to understand (Y = Output feature, X = Input features)

Y - Bankrupt?: Class label

X1 - ROA(C) before interest and depreciation before interest: Return On Total Assets(C)

X2 - ROA(A) before interest and % after tax: Return On Total Assets(A)

X3 - ROA(B) before interest and depreciation after tax: Return On Total Assets(B)

X4 - Operating Gross Margin: Gross Profit/Net Sales

X5 - Realized Sales Gross Margin: Realized Gross Profit/Net Sales

X6 - Operating Profit Rate: Operating Income/Net Sales

X7 - Pre-tax net Interest Rate: Pre-Tax Income/Net Sales

X8 - After-tax net Interest Rate: Net Income/Net Sales

X9 - Non-industry income and expenditure/revenue: Net Non-operating Income Ratio

X10 - Continuous interest rate (after tax): Net Income-Exclude Disposal Gain or Loss/Net Sales

X11 - Operating Expense Rate: Operating Expenses/Net Sales

X12 - Research and development expense rate: (Research and Development Expenses)/Net Sales

X13 - Cash flow rate: Cash Flow from Operating/Current Liabilities

X14 - Interest-bearing debt interest rate: Interest-bearing Debt/Equity

X15 - Tax rate (A): Effective Tax Rate

X16 - Net Value Per Share (B): Book Value Per Share(B)

X17 - Net Value Per Share (A): Book Value Per Share(A)

X18 - Net Value Per Share (C): Book Value Per Share(C)

X19 - Persistent EPS in the Last Four Seasons: EPS-Net Income

X20 - Cash Flow Per Share

X21 - Revenue Per Share (Yuan ¥): Sales Per Share

X22 - Operating Profit Per Share (Yuan ¥): Operating Income Per Share

X23 - Per Share Net profit before tax (Yuan ¥): Pretax Income Per Share

X24 - Realized Sales Gross Profit Growth Rate

X25 - Operating Profit Growth Rate: Operating Income Growth

X26 - After-tax Net Profit Growth Rate: Net Income Growth

X27 - Regular Net Profit Growth Rate: Continuing Operating Income after Tax Growth

X28 - Continuous Net Profit Growth Rate: Net Income-Excluding Disposal Gain or Loss Growth

X29 - Total Asset Growth Rate: Total Asset Growth

X30 - Net Value Growth Rate: Total Equity Growth

X31 - Total Asset Return Growth Rate Ratio: Return on Total Asset Growth

X32 - Cash Reinvestment %: Cash Reinvestment Ratio

X33 - Current Ratio

X34 - Quick Ratio: Acid Test

X35 - Interest Expense Ratio: Interest Expenses/Total Revenue

X36 - Total debt/Total net worth: Total Liability/Equity Ratio

X37 - Debt ratio %: Liability/Total Assets

X38 - Net worth/Assets: Equity/Total Assets

X39 - Long-term fund suitability ratio (A): (Long-term Liability+Equity)/Fixed Assets

X40 - Borrowing dependency: Cost of Interest-bearing Debt

X41 - Contingent liabilities/Net worth: Contingent Liability/Equity

X42 - Operating profit/Paid-in capital: Operating Income/Capital

X43 - Net profit before tax/Paid-in capital: Pretax Income/Capital

X44 - Inventory and accounts receivable/Net value: (Inventory+Accounts Receivables)/Equity

X45 - Total Asset Turnover

X46 - Accounts Receivable Turnover

X47 - Average Collection Days: Days Receivable Outstanding

X48 - Inventory Turnover Rate (times)

X49 - Fixed Assets Turnover Frequency

X50 - Net Worth Turnover Rate (times): Equity Turnover

X51 - Revenue per person: Sales Per Employee

X52 - Operating profit per person: Operation Income Per Employee

X53 - Allocation rate per person: Fixed Assets Per Employee

X54 - Working Capital to Total Assets

X55 - Quick Assets/Total Assets

X56 - Current Assets/Total Assets

X57 - Cash/Total Assets

X58 - Quick Assets/Current Liability

X59 - Cash/Current Liability

X60 - Current Liability to Assets

X61 - Operating Funds to Liability

X62 - Inventory/Working Capital

X63 - Inventory/Current Liability

X64 - Current Liabilities/Liability

X65 - Working Capital/Equity

X66 - Current Liabilities/Equity

X67 - Long-term Liability to Current Assets

X68 - Retained Earnings to Total Assets

X69 - Total income/Total expense

X70 - Total expense/Assets

X71 - Current Asset Turnover Rate: Current Assets to Sales

X72 - Quick Asset Turnover Rate: Quick Assets to Sales

X73 - Working capitcal Turnover Rate: Working Capital to Sales

X74 - Cash Turnover Rate: Cash to Sales

X75 - Cash Flow to Sales

X76 - Fixed Assets to Assets

X77 - Current Liability to Liability

X78 - Current Liability to Equity

X79 - Equity to Long-term Liability

X80 - Cash Flow to Total Assets

X81 - Cash Flow to Liability

X82 - CFO to Assets

X83 - Cash Flow to Equity

X84 - Current Liability to Current Assets

X85 - Liability-Assets Flag: 1 if Total Liability exceeds Total Assets, 0 otherwise

X86 - Net Income to Total Assets

X87 - Total assets to GNP price

X88 - No-credit Interval

X89 - Gross Profit to Sales

X90 - Net Income to Stockholder's Equity

X91 - Liability to Equity

X92 - Degree of Financial Leverage (DFL)

X93 - Interest Coverage Ratio (Interest expense to EBIT)

X94 - Net Income Flag: 1 if Net Income is Negative for the last two years, 0 otherwise

X95 - Equity to Liability

Source
Deron Liang and Chih-Fong Tsai, deronliang '@' gmail.com; cftsai '@' mgt.ncu.edu.tw, National Central University, Taiwan
The data was obtained from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Taiwanese+Bankruptcy+Prediction

Relevant Papers
Liang, D., Lu, C.-C., Tsai, C.-F., and Shih, G.-A. (2016) Financial Ratios and Corporate Governance Indicators in Bankruptcy Prediction: A Comprehensive Study. European Journal of Operational Research, vol. 252, no. 2, pp. 561-572.
https://www.sciencedirect.com/science/article/pii/S0377221716000412

## Action plot:
Predicting minority class "bankrupt" of the unbaslanced datase with random foresrt classifier using various techniques, like Naive Undersampling, SMOTE and Cost Estimation.  

# STEP 1: Learning the dataset and feature engineering

In [None]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats # for Q-Q plots

from sklearn.model_selection import (
    train_test_split,
    GridSearchCV
)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

from sklearn.preprocessing import MinMaxScaler

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings("ignore")

In [None]:
# importing the dataset
df = pd.read_csv("../input/company-bankruptcy-prediction/data.csv")

Learning the dataset and making feature engineering

In [None]:
# showing first five rows of the dateset
df.head()

In [None]:
# showing the column names
# list(df.columns)

In [None]:
# editing columns names
string_columns = list(df.columns)
string_columns_new = []
string_columns_new.append(list(df.columns)[0])
for i in range(1, len(list(df.columns))):
    string_columns_new.append(list(df.columns)[i][1:])
# string_columns_new
df.columns = string_columns_new
# list(df.columns)

In [None]:
# showing statistical information about the dataset
# df.info()

In [None]:
# showing statistical data of the dataset
# df.describe()

In [None]:
# checking missing values
df.isnull().any(axis = 1).sum()

In [None]:
# checking duplicated rows
df.duplicated().sum()

### Exploring the dataset

In [None]:
# plotting Bankrupt? column

# define figure size
plt.figure(figsize=(1, 3))

# histogram
sns.histplot(df['Bankrupt?'], bins=2);
plt.title('Bankrupt?')

In [None]:
# counting bankrupted and non-bankrupted companies
df['Bankrupt?'].value_counts()

In [None]:
# counting percentage of negative (0) and positive (1) values 
df['Bankrupt?'].value_counts(normalize=True)

Discussion: The dataset is imbalanced.

### Exploring the variables

Visualize data columns

Explore distribution, skewness, outliers and other statistical properties

Looking at the distributions of the variables to see which imputation to use

In [None]:
# function to create histogram, Q-Q plot and boxplot


def diagnostic_plots(df, variable):
    # function takes a dataframe (df) and
    # the variable of interest as arguments

    # define figure size
    plt.figure(figsize=(16, 4))

    # histogram
    plt.subplot(1, 3, 1)
    sns.histplot(df[variable], bins=30)
    plt.title('Histogram')

    # Q-Q plot
    plt.subplot(1, 3, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.ylabel('Variable quantiles')

    # boxplot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=df[variable])
    plt.title('Boxplot')

    plt.show()

In [None]:
# a function for plotting a str column of df and testing for normality

def draw_and_test(str):
    # plotting variable
    diagnostic_plots(df, str)

    # testing for normality
    print(str)
    skewness = df[str].skew()
    print('Skewness is {:.2f}'.format(skewness))
    kurtosis = df[str].kurtosis()
    print('Kurtosis is {:.2f}'.format(kurtosis))

In [None]:
# plotting and testing variable
# str = 'ROA(C) before interest and depreciation before interest'
# draw_and_test(str)

In [None]:
# plotting all the independent variables
for column in df.columns[1:]:
    draw_and_test(column)

In [None]:
# listing values of variable Net Income Flag that's skewness and kurotsis are both zero
df['Net Income Flag'].value_counts()

In [None]:
df.drop(columns=['Net Income Flag'], axis=1, inplace=True)
print("Column 'Net Income Flag' deleted - single value variable")

Discussion: many variables are highly skewed and heavy-tailed. This may be caused by minority class or by noise. Cleaning noise can cause distortion of the minority class.

Question: should skewness be fixed or left as is?
Answer: First, let's try to work with original dataset.

### Separating train and test sets

In [None]:
# separating dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    df.drop(labels=['Bankrupt?'], axis=1),  # drop the target
    df['Bankrupt?'],  # just the target
    test_size=0.3,
    random_state=42)

X_train.shape, X_test.shape

### Scaling

In [None]:
# scaling for further KNN technique
scaler = MinMaxScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Random Undersampling

[RandomUnderSampler](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html)

In [None]:
rus = RandomUnderSampler(
    sampling_strategy='auto',  # samples only from majority class
    random_state=42,  # for reproducibility
    replacement=True # if it should resample with replacement
)  

X_resampled_rus, y_resampled_rus = rus.fit_resample(X_train, y_train)

In [None]:
# size of undersampled data

X_resampled_rus.shape, y_resampled_rus.shape

In [None]:
# number of positive class in original dataset
y_train.value_counts()

In [None]:
# final data size is 2 times the number of observations
# with positive class:

y_train.value_counts()[1] * 2

### Oversampling: SMOTE

Creates new samples by interpolation of samples of the minority class and any of its k nearest neighbours (also from the minority class). K is typically 5.

[SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html)

In [None]:
sm = SMOTE(
    sampling_strategy='auto',  # samples only the minority class
    random_state=42,  # for reproducibility
    k_neighbors=5,
    n_jobs=4
)

X_resampled_smote, y_resampled_smote = sm.fit_resample(X_train, y_train)

In [None]:
# size of undersampled data

X_resampled_smote.shape, y_resampled_smote.shape

### Machine learning performance comparison

Let's compare model performance with and without undersampling.

In [None]:
# function to training of random forests and evaluating the performance

def run_randomForests(X_train, X_test, y_train, y_test):
    
    rf = RandomForestClassifier(n_estimators=200, random_state=42, max_depth=4, n_jobs=4)
    rf.fit(X_train, y_train)

    print('Train set')
    pred = rf.predict_proba(X_train)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
    
    print('Test set')
    pred = rf.predict_proba(X_test)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

In [None]:
# evaluating performance of algorithm built
# using imbalanced dataset

run_randomForests(X_train,
                  X_test,
                  y_train,
                  y_test)

In [None]:
# evaluating performance of algorithm built
# using undersampled dataset

run_randomForests(X_resampled_rus,
                  X_test,
                  y_resampled_rus,
                  y_test)

In [None]:
# evaluating performance of algorithm built
# using oversampled dataset

run_randomForests(X_resampled_smote,
                  X_test,
                  y_resampled_smote,
                  y_test)

Discussion: Naive undersampling didn't improve the performance on minority class in comparison with imbalances data and a valuable information about majority class was lost. SMOTE slightly improved erformance on minority class in comparison with imbalances data.

### Estimating the Cost with Cross-Validation

In [None]:
# setting up initial random forest

rf = RandomForestClassifier(n_estimators=50,
                            random_state=42,
                            max_depth=2,
                            n_jobs=4,
                            class_weight=None)

In [None]:
# setting up parameter search grid
# including class weight

param_grid = {
  'n_estimators': [10, 50, 100, 200],
  'max_depth': [None, 2, 3, 4],
  'class_weight': [None, {0:1, 1:10}, {0:1, 1:100}],
}

In [None]:
search = GridSearchCV(estimator=rf,
                      scoring='roc_auc',
                      param_grid=param_grid,
                      cv=2,
                     ).fit(X_train, y_train)

In [None]:
search.best_score_

In [None]:
search.best_params_

In [None]:
search.best_estimator_

In [None]:
search.score(X_test, y_test)

Discussion: Cost sensitive method slightly improved the performance.

## Summary: Bankrupcy was predicted with 93.8% ROC-AUC accuracy using cost-adjustment technique in random forest classifier.