# Company Bankrupcy Prediction

https://www.kaggle.com/fedesoriano/company-bankruptcy-prediction

### Table of contents

1. Data overview
2. Target variable analysis
3. Features-target analysis
4. Multicollinearity
5. Data imbalance
6. SVM
7. Conclusion + links

***note:*** I will not explain the basic statistical / machine learning concepts. Some links are provided in the last section.

***important terminology note:***
- I am going to use these terms interchangeably
- *features = predictor variables = columns*
- *target variables = categories = classes*
- *datapoint = row of the dataset*

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.decomposition import PCA
from sklearn.metrics import f1_score, make_scorer, accuracy_score, recall_score, precision_score
from sklearn.metrics import confusion_matrix
from sklearn.utils import class_weight
from sklearn.feature_selection import RFE

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import plot_confusion_matrix

from imblearn.over_sampling import SMOTE

import random
import itertools

import scipy

import warnings
warnings.filterwarnings("ignore")

## 1. Data overview

In [None]:
df = pd.read_csv("/kaggle/input/company-bankruptcy-prediction/data.csv")
print("Number of datapoints is %.d" %len(df))
df.head()
#df.describe()

It looks like the features are mostly numerical but I cannot see all 96 columns so I am going to look for categorical variables.

In [None]:
#looking for possibly categorical features
Filter = []

for col in df.columns:
    if df[col].nunique()<50: #quasi definition of categorical variable xd
        Filter.append(True)
    else:
        Filter.append(False)

categorical_c = df.loc[:,Filter] #filtering columns in this case

print(df.loc[:,Filter])

df = df.drop([' Liability-Assets Flag',' Net Income Flag'], 1)

df.shape

It turns out that only two features are categorical (Liability-Assets Flag, Liability-Assets Flag) and they also happen to be useless because all datapoints have the same value. All other features are numerical.

## 2. Target variable analysis

The target is a dichotomous variable, I am going to have a look at the distribution of the two classes.

In [None]:
#analyzing Target variable (Class: 0 = Not Bankrupt, 1 = Bankrupt)

print(df["Bankrupt?"].value_counts())

percentage = df["Bankrupt?"].value_counts()[1]/len(df)
print("Total percentage of bankrupted companies is %.1f" %(percentage*100) + " %.")

sns.countplot(df["Bankrupt?"])
plt.show()

There is a huge imbalance between the two categories. It turns out that only 3.2% companies in this dataset bankrupted.

## 3. Variable-target analysis

Some companies bankrupted and some did not. I am not an economist and I sincerely know very little about the meaning of the predictor variables and bankrupcy. However, before proceeding with the analysis I would like to see at least a small evidence that the variables have effect on the bankrupcy.

### 3.1 Non-statistical test
Plotting the relative difference between the means of the features for both categories (bankrupted and not bankrupted).

In [None]:
#Variables' effect on class

features = df.columns[1:] #from now on "features" are interchangable with "columns"

X = df[features]
y = df["Bankrupt?"]

X_0 = X.loc[y==0,:] #not bankrupted
X_1 = X.loc[y==1,:] #bakrupted

X_0_test = X_0.sample(n=220)

significant_cols = [] #features that have "very different" means
difs=[] #differences between means

for col in X.columns:
    relative_means_difference = (X_1[col].mean() - X_0_test[col].mean()) / X_0_test[col].mean() 
    difs.append([col,relative_means_difference])
    if abs(relative_means_difference)>0.5: #tresnhold, at least 50% freater/smaller mean 
        significant_cols.append(col)


sns.barplot(x=list(range(len(difs))),y=[e[1] for e in difs])
plt.ylim((-1,5)) #this controls the size of the window displayed
plt.xlabel("Features")
plt.ylabel("Relative difference between means")
plt.show()

There are a few features with really big differences and overall around 20 features whose means are more than 50% apart in these two categories.

### 3.2 Monte Carlo Hypothesis Test

#### HYPOTHESIS: There is a difference between bakrupted and not-bankrupted companies
(Null hypothesis: There is no difference between the bankrupted and not-bankrupted companies.)

I am going to generate 1000 samples, each containing 220 datapoints from <code>X</code> (- all datapoints) and obtain the sampling distribution of the sample mean for each feature. From the observed data (= the 220 datapoints of bankrupted companies) and sampling distribution I am going to determine the p-value.

***p-value for each feature:*** percentage of sample means that are more extreme than the bankrupt companies mean

In [None]:
#MONTE CARLO HYPOTESIS TEST

from statistics import mean

sampling_distribution = {feature: [] for feature in features} #SAMPLING DISTRIBUTION OF SAMPLE MEANS for each feature
bankrupt_means = {feature: X_1[feature].mean() for feature in features} #MEAN of each feature (observed data = bankrupt companies)

for i in range(1000): #sampling from the data 1000 times
    X_sample = X.sample(n=220) #n same as the number of bankrupt companies,sampling from X
    for feature in features:
        s_mean = X_sample[feature].mean()
        sampling_distribution[feature].append(s_mean)

pvalues = {feature: None for feature in features}

def get_p_value(sampling_distribution, observed):
    l = abs(observed-mean(sampling_distribution)) #distance of observed from the sample mean
    return sum(abs(sample_mean-mean(sampling_distribution))>l for sample_mean in sampling_distribution)/len(sampling_distribution) #the proportion of data more extreme than observed
               
for feature in pvalues: #filling the pvalues dictionary
    pvalues[feature] = get_p_value(sampling_distribution[feature],bankrupt_means[feature]) 

In [None]:
print("Number of significantly different features: %d" %sum(np.array(list(pvalues.values()))>0.05))
dict(itertools.islice(pvalues.items(),10)) #look at the first 10 features and associated p-values

In [None]:
#Plotting some features and their distribution of sample means + red line with the mean of the observed data (= data of bankrupt companies)

fig, axes = plt.subplots(2,2, figsize=(15,8))

sns.distplot(sampling_distribution[" Operating Gross Margin"], ax=axes[0,0],label="sampling distribution of the mean")
axes[0,0].axvline(x=bankrupt_means[" Operating Gross Margin"],label="observation - pvalue %.2f"%pvalues[" Operating Gross Margin"],c="r")
axes[0,0].legend(loc='upper left')

sns.distplot(sampling_distribution[" Interest-bearing debt interest rate"], ax=axes[0,1])
axes[0,1].axvline(x=bankrupt_means[" Interest-bearing debt interest rate"],label="pvalue %.2f"%pvalues[" Interest-bearing debt interest rate"],c="r")
axes[0,1].legend()

sns.distplot(sampling_distribution[" Inventory/Current Liability"], ax=axes[1,0])
axes[1,0].axvline(x=bankrupt_means[" Inventory/Current Liability"],label="pvalue %.2f"%pvalues[" Inventory/Current Liability"],c="r")
axes[1,0].legend()

sns.distplot(sampling_distribution[" No-credit Interval"], ax=axes[1,1])
axes[1,1].axvline(x=bankrupt_means[" No-credit Interval"],label="pvalue %.2f"%pvalues[" No-credit Interval"],c="r")
axes[1,1].legend()

plt.legend()
plt.show()

I only exaimined the variables independently while there are probably many dependencies between them so I am not going to draw conclusions or perform feature selection based on these p-values.

## 4. Multicollinearity

I am going to find features with correlation coefficient greater than 0.9 and drop them.

In [None]:
#MULTICOLLINEARITY (CORRELATION BETWEEN PREDICTOR VARIABLES)

cor_matrix = df.corr().abs()
cor_matrix.style.background_gradient(sns.light_palette('red', as_cmap=True))

In [None]:
#Dropping correlated data

upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool)) #upper triangle of the correlation matrix

dropped_cols = set()
for feature in upper_tri.columns:
    if any(upper_tri[feature] > 0.9): #more than 0.9 corr. coeficient -> dropped
        dropped_cols.add(feature)

print("There are %d dropped columns" %len(dropped_cols))

X = X.drop(dropped_cols,axis=1)
X.head()

PCA is a way to decorrelate and reduce the dimensionality of the data through the change of the basis. I am going to try if the method helps to decorrelate the data.

In [None]:
#PCA

scaler = StandardScaler() 
X_for_pca = pd.DataFrame(data=scaler.fit_transform(X),index=X.index,columns=X.columns) #standardized dataset

n_components = 10

pca = PCA(n_components=n_components)
principal_components = pca.fit_transform(X_for_pca)
X_pc = pd.DataFrame(data=principal_components, columns=['PC %d'%d for d in range(n_components)])

print("Explained variance by 10 components %.2f" %sum(pca.explained_variance_ratio_))

With 10 principal components the explained variance is still very low, so I do not find the PCA transformation useful for this data.

## 5. Data Imbalance

There is a huge imbalance between the data (only 3.2% companies from the dataset bankrupted). Before training a model I need to deal with this problem, otherwise the model would just predict every company to not bankrupt. 

I decided to try two ways:
1. ***Introducing weights*** \
Every datapoint from the minority class is considered "more important" than from the majority class, the weights for the two classes are inversely proportional to the number of datapoints in that class. Implemented within the SVM in next section.

2. ***SMOTE*** \
The Synthetic Minority Over-sampling TEchnique. \
Creates new synthetic datapoints using the k-nearest neighbor algorithm. \
With this method I am going to obtain the dataset where the value counts for both categories are the same.

In [None]:
#DATA IMBALANCE
#SMOTE 

sm = SMOTE(random_state=42)

X_sm, y_sm = sm.fit_resample(X, y)

print('New balance of 1 and 0 classes (%):')
y_sm.value_counts()

## 6. SVM 

I am going to train a SVM model. First with SMOTE-dataset, then without SMOTE data and lastly with SMOTE-dataset but reduced to 10% of the data.

The function <code>train_test_SVM(X,y)</code> has multiple steps:
1. Splitting the data
2. Assigning the weights
3. Creating a <code>Pipeline</code>
4. Using <code>GridSearchCV</code> to find the optimal hyperparameters \
Train the model
5. Score
6. Confusion matrix

The SVM training takes quite long (around 4 minutes for me).
- big amount of datapoints (perhaps too many for a SVM)
- <code>GridSearchCV</code> using cross validation for different (C, gamma) combinations
- training <code>'rbf'</code> kernel is slower than linear kernel

In [None]:
#SVM

def train_test_SVM(X,y):
    """Function finds the optimal hyperparameters of the SVM, plots the confusion matrix of test data, returns the model"""
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, stratify=y) #stratify adresses the unbalance only in the train test splitting
    
    sw_train = class_weight.compute_sample_weight(class_weight = 'balanced', y = y_train) #when balanced sw_train = [1.1.1...1]
    
    steps = [('scaler', StandardScaler()), ('SVM', SVC(cache_size=7000))]
    pipeline = Pipeline(steps)
    
    #parameters' names must match the 'SVM' name in Pipeline followed by two underscores!
    #standard SVM hyperparameters
    param_grid = {
    'SVM__C':[0.01,0.1,1,10],
    'SVM__gamma':[0.1,0.01,0.001,0.0001],
    'SVM__kernel':['rbf']
    }
    
    f1 = make_scorer(f1_score , average='macro')
    grid = GridSearchCV(pipeline,param_grid=param_grid, cv=5, scoring=f1, verbose=0) #verbose controls the training progression display!
    grid.fit(X_train, y_train, SVM__sample_weight = sw_train)
    
    print("best parameters: ")
    print(grid.best_params_)
    
    model = grid.best_estimator_
    y_pred = model.predict(X_test)
    
    print("f1 score is %.2f "%f1_score(y_test, y_pred))
    print("Precision: %.2f" %precision_score(y_test, y_pred))
    print("Recall: %.2f" %recall_score(y_test, y_pred))
    print("Precision: %.2f" %precision_score(y_test, y_pred))
    plot_confusion_matrix(model,
                         X_test,
                         y_test,
                         values_format='d')
    return model

In [None]:
model = train_test_SVM(X_sm,y_sm)

In [None]:
#Training and testing without SMOTE

train_test_SVM(X,y)

Without SMOTE the performance is a way worse. The model is not "meaningless" as it would be without the weights, however I suppose the weights are simply just "not enough" for such a big imbalance. 

In [None]:
#Training and testing with 10% of the data

Xy_sm = pd.concat([X_sm,y_sm],axis=1)
Xy_reduced_1 = Xy_sm[Xy_sm["Bankrupt?"]==1].sample(frac=0.1) #taking 10% of the datapoints "Bankrupt?" = 1
Xy_reduced_0 = Xy_sm[Xy_sm["Bankrupt?"]==0].sample(frac=0.1) #taking 10% of the datapoints "Bankrupt?" = 0
Xy_reduced = pd.concat([Xy_reduced_1,Xy_reduced_0],axis=0) #the dataset is going to be shuffled in train_test_split

y_reduced = Xy_reduced["Bankrupt?"]
X_reduced = Xy_reduced.drop("Bankrupt?",axis=1)

train_test_SVM(X_reduced,y_reduced)

SVM does not use all the data to make a decision boundary, that is why the model works quite good with only 10% data. And the training is much faster.

## Conclusion and links

I preprocessed the data and trained an SVM estimator. I used a few widely used machine learning techniques along the way. However, there are more things that can be done to understand the data more. I would be interested in feature selection as well as applying other machine learning algorithms and observing the differences in their performance. 

### Sources: 
- PCA: https://www.youtube.com/watch?v=lrHboFMio7g&list=PLtV9G3anqB8B5s6FtBmqyc9VeZtlNp8GV&ab_channel=algomanic
- Imbalanced Data: https://towardsdatascience.com/imbalanced-data-when-details-matter-16bd3ec7ef74
- https://medium.com/@radecicdario
- https://brilliant.org/