# Bankruptcy Prediction
Why is it necessary?  
Predicting certain public domain information will help you invest wisely, or help you decide weather to trust company or not.  
Also this is advantageous to Banks who can prevent losses by avoinding loan to these companies.  
This data however is limited to certain companies and doesnot represent complete dataset.  0

**In this notebook**, I'll demonstrate a rather simple model based on Naive Bayes.  
For those of who unaware of tha concept, do read about it as it is very simple and is based on high school probabilty concepts.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns

from sklearn.model_selection import *
from sklearn.feature_selection import *
from sklearn.metrics import *
from sklearn.preprocessing import *
from sklearn.ensemble import *
from sklearn.decomposition import PCA
from sklearn.naive_bayes import BernoulliNB

# this code help in displaying complete block of data rather than ... in the columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

FOLDS = 5

# Importing Data

In [None]:
data = pd.read_csv('../input/company-bankruptcy-prediction/data.csv')
data.head()

In [None]:
data.describe()

# Data Plots
Pandas itself support many of the matplotlib functionalities, and we'll use the same.

In [None]:
data.hist(figsize=(40,60))
plt.show()

If you observe carefully most of the columns are skewed.  
This means many of the data points lie towards it's minimum or maximum.  
For Example, observe **Operating Expense Rate** and **Research and development expense rate**.  
This means model will not be able distuingish well between yes and no as most value lie towards one end.  
Hence we'll learn about how to reduce skewness.  

In [None]:
data.corrwith(data['Bankrupt?'])

Correlation means how directly or indirectly our column effects another column.  
`+1` means direct, i.e increase in column will lead increase in label value.  
`-1` for just the opposite.

# Understanding Data

In [None]:
data.isna().sum()

As we observe, there are no missing values in the dataset.

In [None]:
def corr_skew(X):
    s = X.skew().reset_index().rename(columns = {0:'skew'})

    pos = list(s[s['skew']>=1]['index'].values)
    neg = list(s[s['skew']<=-1]['index'].values)

    X[pos] = (X[pos]+1).apply(np.log)
    X[neg] = (X[neg])**3
    return X

The above function finds the skewness of each column.  
Skewness greater than 1 is corrected ny using `log`, where as less than -1 is corrected using `cube`.  
**Note in log I've used +1 to prevent log of 0 if any**

In [None]:
y  = data['Bankrupt?']
X  = data.drop(['Bankrupt?'],axis=1)
X = corr_skew(X)

# Visualise data and labels

In [None]:
def plot_PCA(X,Y):
    
    pca = PCA()
    X = pca.fit_transform(X)

    # 2D plot
    plt.scatter(X[:,0],X[:,1],c=y,cmap=ListedColormap(['b','r']))
    plt.show()    

    # 3D plot
    fig = plt.figure()
    ax = fig.add_subplot(111, projection = '3d')
    ax.scatter(X[:,0],X[:,1],X[:,2],c=y,cmap=ListedColormap(['b','r']))
    plt.show()

The above function reduces are dataset to help us plot.  
This gives us a fairly good idea how well our model be able to fit it.  

In [None]:
plot_PCA(X,y)

# Model Training

In [None]:
def pred_stratified(X,y):
    X = X.values
    y = y.values

    skf = StratifiedKFold(n_splits=FOLDS)
    aucs = []
    fig, ax = plt.subplots()

    for train_index, test_index in skf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)

        model = BernoulliNB(alpha = 10)
        model.fit(X_train,y_train)

        yxgb = model.predict(X_test)
        plot_roc_curve(model, X_test, y_test, ax=ax)
        aucs.append(roc_auc_score(y_true=y_test,y_score=yxgb))
    plt.show()      
    return sum(aucs)/5

Let's break the function above.  
1. Stratified K fold:  
You might have use train test split previously. It randomly splits the data into two parts.  
However Stratified Sampling helps in our test set being similar to train, i.e ratio of each target label is same in train in test.  
This method is an important sampling technique and a good practice to implement.
2. Folds:
To observe performance we sample using stratified method 5 times and average it get an average result. 
3. Scaling:
This is done to normalise the dast

In [None]:
pred_stratified(X,y)

# Feature Reduction
As we see, the train data has 90+ features. This makes training costly.  
Moreover many of these columns have same correlation that points towards multi collinearity.  
A simple way to understand is, two columns have exatly same influence on target, so why not drop 1 and multiply other by 2??.  
**Multiplication by 2 is not actually done, however is reflected automatically in the equation my the co-effecients**.  
This means we can achieve similar score even with a reduced number of columns. 

In [None]:
sel = SelectFromModel(RandomForestClassifier(random_state=42))

y  = data['Bankrupt?']
X  = data.drop(['Bankrupt?'],axis=1)
X = corr_skew(X)
sel.fit(X,y)

I've used Random forest to select columns based on it's importance. You can try with other classifeir as well.  

In [None]:
features = X.columns[(sel.get_support())]
print(len(features))
features

In [None]:
X = X.filter(items=features)

# Visualise Data and Labels on reduced data

In [None]:
plot_PCA(X,y)

In [None]:
pred_stratified(X,y)

As observed we've obtained similar result on reduced dataset as well with only 1/3 the columns.  
This increases our effeciency and reduces time on training and predicting.  
I hope you learned from this notebook.  
### Happy Learning