# ML project for bankruptcy

Dataset is downloaded by Kaggle: https://www.kaggle.com/fedesoriano/company-bankruptcy-prediction 

The data were collected from the Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange.


Content

1. The dataset contains the following information:
Y - Bankrupt?: Class label
X1 - ROA(C) before interest and depreciation before interest: Return On Total Assets(C)
X2 - ROA(A) before interest and % after tax: Return On Total Assets(A)
X3 - ROA(B) before interest and depreciation after tax: Return On Total Assets(B)
X4 - Operating Gross Margin: Gross Profit/Net Sales
X5 - Realized Sales Gross Margin: Realized Gross Profit/Net Sales
X6 - Operating Profit Rate: Operating Income/Net Sales
X7 - Pre-tax net Interest Rate: Pre-Tax Income/Net Sales
X8 - After-tax net Interest Rate: Net Income/Net Sales
X9 - Non-industry income and expenditure/revenue: Net Non-operating Income Ratio
X10 - Continuous interest rate (after tax): Net Income-Exclude Disposal Gain or Loss/Net Sales
X11 - Operating Expense Rate: Operating Expenses/Net Sales
X12 - Research and development expense rate: (Research and Development Expenses)/Net Sales
X13 - Cash flow rate: Cash Flow from Operating/Current Liabilities
X14 - Interest-bearing debt interest rate: Interest-bearing Debt/Equity
X15 - Tax rate (A): Effective Tax Rate
X16 - Net Value Per Share (B): Book Value Per Share(B)
X17 - Net Value Per Share (A): Book Value Per Share(A)
X18 - Net Value Per Share (C): Book Value Per Share(C)
X19 - Persistent EPS in the Last Four Seasons: EPS-Net Income
X20 - Cash Flow Per Share
X21 - Revenue Per Share (Yuan ¥): Sales Per Share
X22 - Operating Profit Per Share (Yuan ¥): Operating Income Per Share
X23 - Per Share Net profit before tax (Yuan ¥): Pretax Income Per Share
X24 - Realized Sales Gross Profit Growth Rate
X25 - Operating Profit Growth Rate: Operating Income Growth
X26 - After-tax Net Profit Growth Rate: Net Income Growth
X27 - Regular Net Profit Growth Rate: Continuing Operating Income after Tax Growth
X28 - Continuous Net Profit Growth Rate: Net Income-Excluding Disposal Gain or Loss Growth
X29 - Total Asset Growth Rate: Total Asset Growth
X30 - Net Value Growth Rate: Total Equity Growth
X31 - Total Asset Return Growth Rate Ratio: Return on Total Asset Growth
X32 - Cash Reinvestment %: Cash Reinvestment Ratio
X33 - Current Ratio
X34 - Quick Ratio: Acid Test
X35 - Interest Expense Ratio: Interest Expenses/Total Revenue
X36 - Total debt/Total net worth: Total Liability/Equity Ratio
X37 - Debt ratio %: Liability/Total Assets
X38 - Net worth/Assets: Equity/Total Assets
X39 - Long-term fund suitability ratio (A): (Long-term Liability+Equity)/Fixed Assets
X40 - Borrowing dependency: Cost of Interest-bearing Debt
X41 - Contingent liabilities/Net worth: Contingent Liability/Equity
X42 - Operating profit/Paid-in capital: Operating Income/Capital
X43 - Net profit before tax/Paid-in capital: Pretax Income/Capital
X44 - Inventory and accounts receivable/Net value: (Inventory+Accounts Receivables)/Equity
X45 - Total Asset Turnover
X46 - Accounts Receivable Turnover
X47 - Average Collection Days: Days Receivable Outstanding
X48 - Inventory Turnover Rate (times)
X49 - Fixed Assets Turnover Frequency
X50 - Net Worth Turnover Rate (times): Equity Turnover
X51 - Revenue per person: Sales Per Employee
X52 - Operating profit per person: Operation Income Per Employee
X53 - Allocation rate per person: Fixed Assets Per Employee
X54 - Working Capital to Total Assets
X55 - Quick Assets/Total Assets
X56 - Current Assets/Total Assets
X57 - Cash/Total Assets
X58 - Quick Assets/Current Liability
X59 - Cash/Current Liability
X60 - Current Liability to Assets
X61 - Operating Funds to Liability
X62 - Inventory/Working Capital
X63 - Inventory/Current Liability
X64 - Current Liabilities/Liability
X65 - Working Capital/Equity
X66 - Current Liabilities/Equity
X67 - Long-term Liability to Current Assets
X68 - Retained Earnings to Total Assets
X69 - Total income/Total expense
X70 - Total expense/Assets
X71 - Current Asset Turnover Rate: Current Assets to Sales
X72 - Quick Asset Turnover Rate: Quick Assets to Sales
X73 - Working capitcal Turnover Rate: Working Capital to Sales
X74 - Cash Turnover Rate: Cash to Sales
X75 - Cash Flow to Sales
X76 - Fixed Assets to Assets
X77 - Current Liability to Liability
X78 - Current Liability to Equity
X79 - Equity to Long-term Liability
X80 - Cash Flow to Total Assets
X81 - Cash Flow to Liability
X82 - CFO to Assets
X83 - Cash Flow to Equity
X84 - Current Liability to Current Assets
X85 - Liability-Assets Flag: 1 if Total Liability exceeds Total Assets, 0 otherwise
X86 - Net Income to Total Assets
X87 - Total assets to GNP price
X88 - No-credit Interval
X89 - Gross Profit to Sales
X90 - Net Income to Stockholder's Equity
X91 - Liability to Equity
X92 - Degree of Financial Leverage (DFL)
X93 - Interest Coverage Ratio (Interest expense to EBIT)
X94 - Net Income Flag: 1 if Net Income is Negative for the last two years, 0 otherwise
X95 - Equity to Liability

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import matplotlib.cm as cm
import matplotlib.colors as colors
%matplotlib inline


from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split,cross_val_score

In [2]:
pd.set_option("display.max_columns", None)

In [3]:
df=pd.read_csv('/Users/apple/Documents/Data science/ML project/data.csv')
df

FileNotFoundError: [Errno 2] No such file or directory: '/Users/apple/Documents/Data science/ML project/data.csv'

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.columns

In [None]:
df['Bankrupt?'].value_counts()

In [None]:
df.isnull().sum()

In [None]:
sns.countplot('Bankrupt?',data=df)

NOTE: The dataset is highly unbalanced,This can be fixed with adjustment but I'm taking the Dataset as it is. As the main goal of the ML project is to practice ML algorithms.

# Pipelines

In [None]:
# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = [ ' ROA(C) before interest and depreciation before interest',
       ' ROA(A) before interest and % after tax',
       ' ROA(B) before interest and depreciation after tax',
       ' Operating Gross Margin', ' Realized Sales Gross Margin',
       ' Operating Profit Rate', ' Pre-tax net Interest Rate',
       ' After-tax net Interest Rate',
       ' Non-industry income and expenditure/revenue',
       ' Continuous interest rate (after tax)', ' Operating Expense Rate',
       ' Research and development expense rate', ' Cash flow rate',
       ' Interest-bearing debt interest rate', ' Tax rate (A)',
       ' Net Value Per Share (B)', ' Net Value Per Share (A)',
       ' Net Value Per Share (C)', ' Persistent EPS in the Last Four Seasons',
       ' Cash Flow Per Share', ' Revenue Per Share (Yuan ¥)',
       ' Operating Profit Per Share (Yuan ¥)',
       ' Per Share Net profit before tax (Yuan ¥)',
       ' Realized Sales Gross Profit Growth Rate',
       ' Operating Profit Growth Rate', ' After-tax Net Profit Growth Rate',
       ' Regular Net Profit Growth Rate', ' Continuous Net Profit Growth Rate',
       ' Total Asset Growth Rate', ' Net Value Growth Rate',
       ' Total Asset Return Growth Rate Ratio', ' Cash Reinvestment %',
       ' Current Ratio', ' Quick Ratio', ' Interest Expense Ratio',
       ' Total debt/Total net worth', ' Debt ratio %', ' Net worth/Assets',
       ' Long-term fund suitability ratio (A)', ' Borrowing dependency',
       ' Contingent liabilities/Net worth',
       ' Operating profit/Paid-in capital',
       ' Net profit before tax/Paid-in capital',
       ' Inventory and accounts receivable/Net value', ' Total Asset Turnover',
       ' Accounts Receivable Turnover', ' Average Collection Days',
       ' Inventory Turnover Rate (times)', ' Fixed Assets Turnover Frequency',
       ' Net Worth Turnover Rate (times)', ' Revenue per person',
       ' Operating profit per person', ' Allocation rate per person',
       ' Working Capital to Total Assets', ' Quick Assets/Total Assets',
       ' Current Assets/Total Assets', ' Cash/Total Assets',
       ' Quick Assets/Current Liability', ' Cash/Current Liability',
       ' Current Liability to Assets', ' Operating Funds to Liability',
       ' Inventory/Working Capital', ' Inventory/Current Liability',
       ' Current Liabilities/Liability', ' Working Capital/Equity',
       ' Current Liabilities/Equity', ' Long-term Liability to Current Assets',
       ' Retained Earnings to Total Assets', ' Total income/Total expense',
       ' Total expense/Assets', ' Current Asset Turnover Rate',
       ' Quick Asset Turnover Rate', ' Working capitcal Turnover Rate',
       ' Cash Turnover Rate', ' Cash Flow to Sales', ' Fixed Assets to Assets',
       ' Current Liability to Liability', ' Current Liability to Equity',
       ' Equity to Long-term Liability', ' Cash Flow to Total Assets',
       ' Cash Flow to Liability', ' CFO to Assets', ' Cash Flow to Equity',
       ' Current Liability to Current Assets', ' Liability-Assets Flag',
       ' Net Income to Total Assets', ' Total assets to GNP price',
       ' No-credit Interval', ' Gross Profit to Sales',
       " Net Income to Stockholder's Equity", ' Liability to Equity',
       ' Degree of Financial Leverage (DFL)',
       ' Interest Coverage Ratio (Interest expense to EBIT)',
        ' Equity to Liability']
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)])

# 1. ML- Logistic regression

In [None]:
#import models
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import time 

In [None]:
#defining dependent and independent variables
x = df.drop('Bankrupt?', axis=1)
y = df['Bankrupt?']

In [None]:
#splitting data into training and testing set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

In [None]:
#splitting data into training and testing set
lr = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter = 10000))])
lr.fit(x_train, y_train)


#getting confusion matrix
y_pred = lr.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
lra = accuracy_score(y_test,y_pred)
print('accuracy score = ',lra)

#check for time use
start=time.time()
lr = LogisticRegression(max_iter = 10000)
lr.fit(x_train,y_train)
end=time.time()
lrt=end-start
print ('The time (seconds) of execution of logistic regression is: ', lrt)

# 2. ML- KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
#check for optimal k neighbor
for i in range (1,10): 
    knnmodel= KNeighborsClassifier(n_neighbors=i)
    knnmodel.fit(x_train,y_train)
    predictions=knnmodel.predict(x_test)
    acc_score=accuracy_score(y_test,predictions)
    print (i,acc_score)

NN 2 has great enough accuracy. 

In [None]:
knn = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', KNeighborsClassifier(n_neighbors = 2, metric = 'minkowski',p = 2))])
knn.fit(x_train,y_train)

#confusion matrix
y_pred=knn.predict(x_test)
cm=confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
knna = accuracy_score(y_test,y_pred)
print('accuracy score = ',knna)

#check for time use
start=time.time()
knn= KNeighborsClassifier(n_neighbors = 2, metric = 'minkowski',p = 2)
knn.fit(x_train,y_train)
end=time.time()
knnt=end-start
print ('The time (seconds) of execution of KNN is: ', knnt)

#  3. ML- Decision Tree

In [None]:
#training model
from sklearn.tree import DecisionTreeClassifier
dt = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', DecisionTreeClassifier())])
dt.fit(x_train,y_train)

#getting confusion matrix
y_pred = dt.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
dta = accuracy_score(y_test,y_pred)
print('accuracy score = ',dta)

#check for time use
start=time.time()
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)
end=time.time()
dtt=end-start
print ('The time (seconds) of execution of decision tree is: ', dtt)


#  4. ML- Random Forrest

In [None]:
from sklearn.ensemble import RandomForestClassifier

#n_estimators: no. of decision trees; 100 by default
#n_jobs: The number of jobs to run in parallel. None means 1 , -1 means using all processors

rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(n_estimators = 100, random_state = 0))])
rf.fit(x_train,y_train)

#getting confusion matrix
y_pred = rf.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
rfa = accuracy_score(y_test,y_pred)
print('accuracy score = ',rfa)

#check for time use
start=time.time()
rf = RandomForestClassifier(n_estimators = 100, random_state = 0)
rf.fit(x_train,y_train)
end=time.time()
rft=end-start
print ('The time (seconds) of execution of Random Forrest is: ', rft)

# 5. ML- AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

#n_estimators: The maximum number of estimators at which boosting is terminated. 
ada = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', AdaBoostClassifier(DecisionTreeClassifier(max_depth=3), n_estimators=100,random_state=10))])
ada.fit(x_train, y_train)

#getting confusion matrix
y_pred = ada.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
adaa = accuracy_score(y_test,y_pred)
print('accuracy score = ',adaa)


#check for time use
start=time.time()
ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=3), n_estimators=100,random_state=10)
ada.fit(x_train, y_train)
end=time.time()
adat=end-start
print ('The time (seconds) of execution of Adaboost is: ', adat)

# Conclusion

In [None]:
plt.figure(figsize= (8,7))
ac = [lra,knna,dta,rfa,adaa]
name = ['Logistic Regression','knn','Decision Tree', 'Random Forest','AdaBoost']
sns.barplot(x = ac,y = name,palette='pastel')
plt.title("Plotting the Model Accuracies", fontsize=16, fontweight="bold")

Bar chart isn't a good graphical representation as the model accuracies are all quite similar so I decided to do a scatterplot. 

In [None]:
models= pd.DataFrame({
    'Model': ['Logistic Regression', 'KNN Classifier', 'Decision Tree','Random Forrest','Ada Boost'],
    'Score': [lra,knna,dta,rfa,adaa],
    'Execution Time':[lrt,knnt,dtt,rft,adat]})
models

In [None]:
fig, ax = plt.subplots()

colormap = cm.viridis
colorlist = [colors.rgb2hex(colormap(i)) for i in np.linspace(0, 0.9, len(models['Model']))]

for i,c in enumerate(colorlist):

    x = models['Score'][i]
    y = models['Execution Time'][i]
    l = models['Model'][i]

    ax.scatter(x, y, label=l, s=50, linewidth=0.1, c=c)

ax.legend()

plt.show()

As the above table indicates, all models have a great accuracy score. KNN classifier takes the least amount of execution time. So if I have to choose a model for prediction of bankruptcy, I'll choose KNN Classifier. 