# Introduction
The data were collected from the Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange.
# Purpose
The dataset has lots of features (96) so it's an excelent opportunity to put into practice dimensionality reduction techniques, EDA and finally a machine learning prediction model
# Table of contents
1. [Data Loading and Data Cleaning](#1.-Data-Loading-and-Data-Cleaning)
2. [Model Based Feature Selection](#2.-Model-Based-Feature-Selection)
3. [Descriptive Analysis](#3.-Descriptive-Analysis)
4. [Data Analysis](#4.-Data-Analysis)
5. [Predicting bankruptcy](#5.-Predicting-bankruptcy)
6. [Conclusions](#6.-Conclusions)


In [None]:
# runtime
import timeit

# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

# preprocessing
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler, Normalizer

# Ml model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

np.warnings.filterwarnings('ignore')

# 1. Data Loading and Data Cleaning
In this step we are just going to see if we have any null's and see the shape of the dataset. Descriptive analytics wouldn't make sense since we are going to drop lot's of features in step number 2

In [None]:
bank = pd.read_csv('../input/company-bankruptcy-prediction/data.csv')

print(bank.isnull().values.any())
print(bank.shape)

bank

great! let's start dropping features

# 2. Model Based Feature Selection
Model based feature selection uses a supervised machine learning model to judge the importance of each feature, and keeps only the most important ones. For this case, we are going to use a random forest classifier, since it usually yields good results and because this is a classification task

In [None]:
# training set
X = bank.iloc[:,1:].values
y = bank.iloc[:,0].values.reshape(-1, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
# determining optimal number of features
n_features = [5, 10, 15, 20, 25, 30, 35, 40]
for i in n_features:
    # Building the model based feature selection
    select = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=i)

    select.fit(X_train, y_train)

    mask = select.get_support()

    X_train_rfe = select.transform(X_train)
    X_test_rfe = select.transform(X_test)

    score = RandomForestClassifier().fit(X_train_rfe, y_train).score(X_test_rfe, y_test)
    
    print("Test score: {:.3f}".format(score), " number of features: {}".format(i))



There's not so much difference between the scores with different features. We are going to work with 15 features since is a 'workable' number of features and has a good a score. Let's run the algorithm again and get the features

In [None]:
select = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=15)

select.fit(X_train, y_train)

mask = select.get_support()

X_train_rfe = select.transform(X_train)
X_test_rfe = select.transform(X_test)

score = RandomForestClassifier().fit(X_train_rfe, y_train).score(X_test_rfe, y_test)

print("Test score: {:.3f}".format(score), " number of features: {}".format(15))

features = pd.DataFrame({'features':list(bank.iloc[:,1:].keys()), 'select':list(mask)})
features = list(features[features['select']==True]['features'])
features.append('Bankrupt?')

Let's see the DataFrame we are going to work with it's stats

In [None]:
bank = bank[features]
bank

# 3. Descriptive Analysis

Now that we have a more workable number of features, let's take a look at their stats

## 3.1. Target Variable

In [None]:
sns.countplot(data=bank, x='Bankrupt?', palette='bwr')
plt.show()

bank.groupby('Bankrupt?').size()

We have highly unbalanced data, this is a problem since the machine learning algorithm could be making prediction based mainly on data majority, that's the reason we got so good reasults in section 2. In this case we are going and try oversampling the data: making synthetic data out of the smaller sample (1)

## 3.2. Features

In [None]:
bank.hist(figsize=(20,20), edgecolor='white')
plt.show()

Most of the data is rich on outliers, and in some other the values are located in just one bin. Let's take a closer look at ' Non-industry income and expenditure/revenue'

In [None]:
bins = pd.cut(bank[' Non-industry income and expenditure/revenue'], bins=10)
bins = pd.DataFrame(bins)
bins.value_counts()

In [None]:
lower = bank[' Non-industry income and expenditure/revenue'] >0.3025
upper = bank[' Non-industry income and expenditure/revenue'] <0.3045

close = bank[lower & upper]
print('Rows with outliers: {}'.format(bank.shape[0]))
print('Rows withou outliers: {}'.format(close.shape[0]))
print('information lost = {} rows'.format(bank.shape[0]-close.shape[0]))
close[' Non-industry income and expenditure/revenue'].hist(edgecolor='white')

The distribution does have a normal distribution but is highly influenced by the outliers. Therefore, when analysing the data, will be better to use the median as our analysis tool for central measures.
Additionally, When we are building our model, we could try and take this outliers out just to see if we can get a better result

In [None]:
display(bank.describe())
bank.shape

## 3.3. Correlations

In [None]:
fig, ax = plt.subplots(figsize=(14,12))

sns.heatmap(bank.corr(), vmin=-1, vmax=1, cmap=sns.diverging_palette(20, 220, as_cmap=True), annot=True)


We have some interesting correlations. Let's inspect the top 3 and see if we can find any bankruptcy pattern

In [None]:
fig, ax = plt.subplots(1,3, figsize=(20, 6))

sns.scatterplot(data=bank, x=' Net profit before tax/Paid-in capital', y=' Persistent EPS in the Last Four Seasons', hue='Bankrupt?', ax=ax[0])
sns.scatterplot(data=bank, x=' Persistent EPS in the Last Four Seasons', y=' Net Value Per Share (A)', hue='Bankrupt?', ax=ax[1])
sns.scatterplot(data=bank, x=" Net Income to Stockholder's Equity", y=' Borrowing dependency', hue='Bankrupt?', ax=ax[2])

In [None]:
bank.info()

We start to see some patterns
- companies with a low 'Net profit before tax/Paid-in capital', 'Persistent EPS in the Last Four Seasons' and 'Net Value Per Share (A)' tend to go bankrupt
- 'Borrowing dependency' has bankrupt companies distributed through all it's range. But, around 0.4, are located the companies that do not go bankrupt. Having around 0.4 doesn't guarantee to be bankrupt safe since there are a lot of companies that went bankrupt with this index, but having a higher o lower index seems to be critical since there aren't any companies operating with this kind of index. Same goes to "Net Income to Stockholder's Equity" but around 0.8

## Descriptive Analysis Conclusions
- We have highly unbalanced data. Therefore, we are going to try applying oversampling
- Most of the features have outliers. Median will be a better analysis method and, also, taking some outliers out will be a good idea when building the model
- companies with a low 'Net profit before tax/Paid-in capital', 'Persistent EPS in the Last Four Seasons' and 'Net Value Per Share (A)' tend to go bankrupt. **A KNN algorithm would yield good results since the clusters are so evident**
- 0.4 'Borrowing dependency' is a good indicator to operate but doesn't completely safe you from bankruptcy
- 0.8 "Net Income to Stockholder's Equity" is a good indicator to operate but doesn't completely safe you from bankruptcy

# 4. Data Analysis
Let's compare the median of bankrupt and not bankrupt companies of each feature to further see if we can find a tendency

In [None]:
central = bank.groupby('Bankrupt?').median().reset_index()
features = list(central.keys()[1:])

fig, ax = plt.subplots(5,3, figsize=(20,20))

ax = ax.ravel()
position = 0

for i in features:
    sns.barplot(data=central, x='Bankrupt?', y=i, ax=ax[position], palette='bwr')
    position += 1
    
plt.show()
display(central)

## Data Analysis Conclusions
Let's mention the most evident tendencies:

Companies with:
- high "Interest-bearing debt interest rate" tend to go bankrupt (≈ 0.000499)
- high "Total debt/Total net worth" tend to go bankrupt (≈ 0.015723)
- high "Fixed Assets Turnover Frequency" tend to go bankrupt (≈ 0.001225)
- low  "Cash/Total Assets" tend to go bankrupt (≈ 0.023755)
- low "Equity to Liability" tend to go bankrupt (≈ 0.018662)

Also, These indicators should be enough to build a reliable model since the trend is very clear. Let's build our model

# 5. Predicting bankruptcy

## 5.1 KNN
A KNN if we recall section 3 and 4 conclusions, an KNN algorithm with features 'Net profit before tax/Paid-in capital', 'Persistent EPS in the Last Four Seasons', "Interest-bearing debt interest rate", "Total debt/Total net worth", "Fixed Assets Turnover Frequency", "Cash/Total Assets" and "Equity to Liability" should do the work. let's go and try. 

We also have to take into account that we are dealing with highly imbalanced data, so oversampling will be part of the preprocessing phase (part 2 next week).

In [None]:
model = ['Bankrupt?', ' Net profit before tax/Paid-in capital', ' Persistent EPS in the Last Four Seasons', " Interest-bearing debt interest rate", " Total debt/Total net worth", " Fixed Assets Turnover Frequency", " Cash/Total Assets", " Equity to Liability"]
model = bank[model]
X = model.iloc[:,1:].values
y = model.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

best_n = 0
best_training = 0
best_test = 0

for i in range(1, 20):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    
    training = knn.score(X_train, y_train)
    test = knn.score(X_test, y_test)
    
    if test > best_test:
        best_n = i
        best_training = training
        best_test = test

print("best number of neighbors: {}".format(best_n))
print("best training set score : {:.3f}".format(best_training))
print("best test set score: {:.3f}".format(best_test))

In [None]:
start = timeit.default_timer()

knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
print("training set score : {:.2f}".format(knn.score(X_train, y_train)))
print("test set score: {:.2f}".format(knn.score(X_test, y_test)))

stop = timeit.default_timer()
print('Time: ', stop - start)  

## 5.2 Gradient Boosting Classifer

Here we are going to first apply a more sophisticated classifier on our reduced data and then on the whole dataset. In the end, we compare the three models
### 5.3 Gradient Boosting Classifer, reduced features

In [None]:
X = model.iloc[:,1:].values
y = model.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

start = timeit.default_timer()
gbrt = GradientBoostingClassifier(n_estimators=100, random_state=42, max_depth=1).fit(X_train, y_train)

print("training set score : {:.2f}".format(gbrt.score(X_train, y_train)))
print("test set score: {:.2f}".format(gbrt.score(X_test, y_test)))

stop = timeit.default_timer()
print('Time: ', stop - start)  

### 5.4 Gradient Boosting Classifer, all features

In [None]:
bank = pd.read_csv('../input/company-bankruptcy-prediction/data.csv')
X = bank.iloc[:,1:].values
y = bank.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

start = timeit.default_timer()
gbrt = GradientBoostingClassifier(n_estimators=100, random_state=42, max_depth=1).fit(X_train, y_train)

print("training set score : {:.2f}".format(gbrt.score(X_train, y_train)))
print("test set score: {:.2f}".format(gbrt.score(X_test, y_test)))

stop = timeit.default_timer()
print('Time: ', stop - start)  

# 6. Conclusions

- We were able to build three models with a set accuracy of 0.97, while significantly reducing the number of feautures (just 7). This lead us to save running time (from 5.47 seconds to only 0.5)
- With the reduced features, we were also able to describe how a company might go bankrupt or not, explaining the model better. The features conclusiones were:

Companies with:
- high "Interest-bearing debt interest rate" tend to go bankrupt (≈ 0.000499)
- high "Total debt/Total net worth" tend to go bankrupt (≈ 0.015723)
- high "Fixed Assets Turnover Frequency" tend to go bankrupt (≈ 0.001225)
- low  "Cash/Total Assets" tend to go bankrupt (≈ 0.023755)
- low "Equity to Liability" tend to go bankrupt (≈ 0.018662)
- companies with a low 'Net profit before tax/Paid-in capital', 'Persistent EPS in the Last Four Seasons' and 'Net Value Per Share (A)' tend to go bankrupt