# <font color='Orange'> ML project </font>

### Introduction:

The data were collected from the Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange.

### Objective:

The goal of this notebook is various predictive algorms are available as far as we can predict which goals in order to see failure in the future.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Data manipulation
import pandas as pd
import numpy as np
import re

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve

# preprocessing and resampling
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

# metrics
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

# ML model
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

import sklearn

np.warnings.filterwarnings('ignore')

# Import and data cleansing

In [None]:
df = pd.read_csv ('/kaggle/input/company-bankruptcy-prediction/data.csv')
df.rename(columns = {'Bankrupt?':'y'}, inplace = True)
# rename columns
lista_new_col = []
for col in df.columns:
  col = re.sub(' ', '_', col)
  col = re.sub('-', '_', col)
  lista_new_col.append(col)
df.columns = lista_new_col

print ( 'Shape of dataset:', df.shape )
print ( '*' * 50 )
df.head()

In [None]:
df.isna().sum()

There is no missing data in the dataset.

# Data analysis & visualization

In [None]:
df.info()

The dataset is composed of a combination of 6819 observations per each of our 96 features.

In [None]:
df.describe().T

In [None]:
print(df.y.value_counts())
print('-'* 30)
print('Financially stable: ', round(df.y.value_counts()[0]/len(df) * 100, 2), '% of the dataset')
print('Financially unstable: ', round(df.y.value_counts()[1]/len(df) * 100, 2), '% of the dataset')

In [None]:
plt.figure(figsize = (7,4))
sns.countplot(df.y)
plt.title('Class Distributions Count \n (0: Stable || 1: Unstable)', fontsize=12)
plt.show()

The classes are heavily skewed we need to solve this issue later, with algorithm SMOTE (**S**ynthetic **M**inority **O**versampling **TE**chnique).

Class 1 represents 96.77% of the dataset, while class 2 only 3.23%.

In [None]:
cat_cols = df.select_dtypes(include=['object','category','int64']).columns
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(7,4), constrained_layout=True)
ax=ax.flatten()
fig.suptitle('\nCount plot of categorical features\n', size=12)
for x,i in enumerate(cat_cols[1:]):
    sns.countplot(x=df[i], ax=ax[x])

In [None]:
print ( 'The column "Net income flag" have just value:', df._Net_Income_Flag.unique() )
df.drop ('_Net_Income_Flag', axis=1, inplace=True )

The feature 'Net income flag' always takes the value 1 for all observations, so it can be removed from the model.

In [None]:
# Looking at the histograms of numerical data
df.hist(figsize = (35,30), bins = 50 )
plt.show()

# Re-sampling train_set (SMOTE algorithm)

In [None]:
# SMOTE
X, y = df.iloc [ :, 1: ].values , df.iloc [:, 0].values
X_train, X_test, y_train, y_test = train_test_split ( X, y,
                                                     test_size = 0.3,
                                                     random_state = 1,
                                                     stratify = y)
smt = SMOTE ()
X_train_sm, y_train_sm = smt.fit_resample (X_train, y_train)

print ('# class in y_train:', np.bincount (y_train_sm) )

After the application of SMOTE the classes are rebalanced and both classes of target variable have the same number of 4619 observations.

Let's now apply the machine learning models with the training dataset resampled and balanced according to the target variable: therefore each class will have the same weight and will not be treated differently than the others.

NB: I would like to clarify that the balancing of the classes is done in the training set, therefore in the training phase, and that a correct prediction is right to do it on unnoticed data, precisely the test set.

# Models

We will see the performances on the SMOTE oversampled data (we call *X_train_sm* and y_train_sm). For this I decided to use a few different models classifier:

---


- Logistic Regression
- Decision Tree Classifier
- K-Nearest Neighbors
- Support Vector Machine
- Random Forest Classifier



---



## Logistic Regression

In [None]:
pipeline_lr = make_pipeline ( StandardScaler(),
                          LogisticRegression ( penalty='none', C=1.0, solver='saga', random_state=24 )
                          )
pipeline_lr.fit ( X_train_sm, y_train_sm )
scores = cross_val_score ( estimator=pipeline_lr,
                          X = X_train_sm,
                          y = y_train_sm,
                          cv=5,
                          n_jobs = 2)
y_pred_lr = pipeline_lr.predict (X_test)
print ( 'Accuracy train: %.3f' %pipeline_lr.score (X_train_sm, y_train_sm) )
print ( 'Accuracy cross-validation: %.3f' %scores.mean() )
print ( 'Accuracy test: %.3f' %pipeline_lr.score (X_test, y_test) )

In [None]:
# learning curve
train_sizes, train_scores, test_scores =\
                learning_curve(estimator=pipeline_lr,
                               X=X_train_sm,
                               y=y_train_sm,
                               train_sizes=np.linspace(0.1, 1.0, 10),
                               cv=10,
                               n_jobs=2)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

In [None]:
plt.plot(train_sizes, train_mean,
         color='blue', marker='o',
         markersize=5, label='Training accuracy')

plt.fill_between(train_sizes,
                 train_mean + train_std,
                 train_mean - train_std,
                 alpha=0.15, color='blue')


plt.plot(train_sizes, test_mean,
         color='green', linestyle='--',
         marker='s', markersize=5,
         label='Validation accuracy')

plt.fill_between(train_sizes,
                 test_mean + test_std,
                 test_mean - test_std,
                 alpha=0.15, color='green')

plt.xlabel('Number of training examples', size=12)
plt.ylabel('Accuracy', size=12)
plt.legend(loc='upper left')
plt.ylim([0.8, 1.03])
plt.axvline(x=6670, color = 'red', linestyle = '--', alpha = 0.5)
plt.axvline(x=7480, color = 'yellow', linestyle = '--', alpha = 0.7)
plt.text(5750, 0.825, 'Underfitting', fontsize=12, color='white', bbox ={'facecolor':'grey', 'pad':2} )
plt.text(7600, 0.96, 'Overfitting', fontsize=12, color='white',bbox ={'facecolor':'grey', 'pad':2} )
plt.title ('Learning curve', size=14)
plt.tight_layout()
plt.grid(b=False)
plt.show()

As can be seen from the graph, the model performs well on both training and validation data if at least 6670 examples are submitted to it during training (as indicated by the dashed red vertical line).

Also note, as indicated by the yellow dashed vertical line, that the distance between accuracy in training and accuracy in validation widens with a dataset of more than 7480 examples: an indicator of an increasing level of overfitting.

## Decision Tree Classifier

In [None]:
pipeline_tree = make_pipeline ( StandardScaler(),
                          DecisionTreeClassifier ( min_samples_split=4, random_state=42 )
                          )
pipeline_tree.fit ( X_train_sm, y_train_sm )
scores = cross_val_score ( estimator=pipeline_tree,
                          X = X_train_sm,
                          y = y_train_sm,
                          cv=5,
                          n_jobs = 2)
y_pred_tree = pipeline_tree.predict (X_test)
print ( 'Accuracy train: %.3f' %pipeline_tree.score (X_train_sm, y_train_sm) )
print ( 'Accuracy cross-validation: %.3f' %scores.mean() )
print ( 'Accuracy test: %.3f' %pipeline_tree.score (X_test, y_test) )

In [None]:
# validation curve
param_range = [1,2,3,4,5,6,7,8,9,10]

train_scores, test_scores = validation_curve(
                estimator=pipeline_tree, 
                X=X_train, 
                y=y_train, 
                param_name='decisiontreeclassifier__min_samples_split', 
                param_range=param_range,
                cv=5)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

In [None]:
plt.plot(param_range, train_mean, 
         color='blue', marker='o', 
         markersize=5, label='Training accuracy')

plt.fill_between(param_range, train_mean + train_std,
                 train_mean - train_std, alpha=0.15,
                 color='blue')

plt.plot(param_range, test_mean, 
         color='green', linestyle='--', 
         marker='s', markersize=5, 
         label='Validation accuracy')

plt.fill_between(param_range, 
                 test_mean + test_std,
                 test_mean - test_std, 
                 alpha=0.15, color='green')

plt.grid(b=False)
plt.legend(loc='lower right')
plt.xlabel('Parameter min_samples_split', size=12)
plt.ylabel('Accuracy', size=12)
plt.title ('Validation curve as a function of the regularization parameter\n min_samples_split\n', size=14)
plt.ylim([0.8, 1.0])
plt.axvline(x=4, color = 'red', linestyle = '--', alpha = 0.5)
plt.text(4.5, 0.9, 'Best min_samples_split : 4', fontsize=12, color='white',bbox ={'facecolor':'grey', 'pad':2} )
plt.tight_layout()
plt.show()

## K-Nearest Neighbors

In [None]:
pipeline_knn = make_pipeline ( StandardScaler(),
                          KNeighborsClassifier (n_neighbors=5, weights='distance', p=2 )
                          )
pipeline_knn.fit ( X_train_sm, y_train_sm )
scores = cross_val_score ( estimator=pipeline_knn,
                          X = X_train_sm,
                          y = y_train_sm,
                          cv=5,
                          n_jobs = 2)
y_pred_knn = pipeline_knn.predict (X_test)
print ( 'Accuracy train: %.3f' %pipeline_knn.score (X_train_sm, y_train_sm) )
print ( 'Accuracy cross-validation: %.3f' %scores.mean() )
print ( 'Accuracy test: %.3f' %pipeline_knn.score (X_test, y_test) )

## Support vector machine

In [None]:
pipeline_svm = make_pipeline ( StandardScaler(),
                          SVC (random_state=1, C=1.0, kernel='rbf', gamma='scale')
                          )
pipeline_svm.fit ( X_train_sm, y_train_sm )
scores = cross_val_score ( estimator=pipeline_svm,
                          X = X_train_sm,
                          y = y_train_sm,
                          cv=5,
                          n_jobs = 2)
y_pred_svm = pipeline_svm.predict (X_test)
print ( 'Accuracy train: %.3f' %pipeline_svm.score (X_train_sm, y_train_sm) )
print ( 'Accuracy cross-validation: %.3f' %scores.mean() )
print ( 'Accuracy test: %.3f' %pipeline_svm.score (X_test, y_test) )

## Ensemble learning - Random Forest

In [None]:
pipeline_rfc = make_pipeline ( StandardScaler(),
                          RandomForestClassifier(criterion='entropy', random_state=1)
                          )
pipeline_rfc.fit ( X_train_sm, y_train_sm )
scores = cross_val_score ( estimator=pipeline_rfc,
                          X = X_train_sm,
                          y = y_train_sm,
                          cv=5,
                          n_jobs = 2)
y_pred_rfc = pipeline_rfc.predict (X_test)
print ( 'Accuracy train: %.3f' %pipeline_rfc.score (X_train_sm, y_train_sm) )
print ( 'Accuracy cross-validation: %.3f' %scores.mean() )
print ( 'Accuracy test: %.3f' %pipeline_rfc.score (X_test, y_test) )

**RESUME:** Trained Classifiers Performance Report on Test Set:

In [None]:
pred_list = [y_pred_lr, y_pred_tree, y_pred_knn, y_pred_rfc, y_pred_svm ]
name_clf = [ 'Logistic Regression', 'Decision Tree', 'K-NN', 'Random Forest', 'Support vector machine' ]
for name, y_pred in zip(name_clf, pred_list ):
  print (f'---> {name}')
  print (f'Accuracy: %0.3f' %accuracy_score (y_test, y_pred))
  print (f'Recall: %0.3f' %recall_score (y_test, y_pred))
  print (f'F1 score: %0.3f' %f1_score (y_test, y_pred))
  print ('')

The results show that the model with the greatest accuracy is the Random Forest. However, in this case we are more concerned with minimizing the likelihood of not detecting banks that really are close to failure. This is why it is best to look at the results of *Recall*, for which it seems that the best decision is Logistic Regression.

Now let's optimize its hyperparameters with a grid search.

# Hyperparameter optimization

In [None]:
# classification report before optimizing hyperparameters
label = ['Stable', 'Unstable']
print(classification_report(y_test, y_pred_lr, target_names=label))

In [None]:
# pipeline
pipeline_lr = make_pipeline ( StandardScaler(),
                          LogisticRegression ()
                          )

# range value C
param_range = [ 0.001, 0.01, 0.1, 1.0 ]
# creo griglia parametri
grid_param = [ { 'logisticregression__C' : param_range,
               'logisticregression__penalty' : ['l2', 'l1', 'none'],
               'logisticregression__solver' : ['lbfgs', 'saga'] } ]
gs = GridSearchCV ( estimator = pipeline_lr,
                   param_grid = grid_param,
                   scoring = 'recall',
                   cv = 5,
                   refit = True,
                   n_jobs = 2
                   )

gs = gs.fit ( X_train_sm, y_train_sm )

print ( 'Best score: %.3f' %gs.best_score_ )

print ( 'Best hyperparameter:', gs.best_params_ )

y_pred_gs = gs.predict (X_test)

In [None]:
# classification report before optimizing hyperparameters
label = ['Bank stable', 'Bank unstable']
print(classification_report(y_test, y_pred_gs, target_names=label))

Now, looking at the confusion matrix:

In [None]:
conf_matrix = confusion_matrix (  y_test, y_pred_gs )

# plot
fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(conf_matrix, cmap='CMRmap', alpha=0.7)
for i in range(conf_matrix.shape[0]):
    for j in range(conf_matrix.shape[1]):
        ax.text(x=j, y=i, s=conf_matrix[i, j], va='center', ha='center')
plt.xlabel('Predicted label', size = 20)
plt.ylabel('True label', size = 20)
plt.tight_layout()
plt.show()

From the confusion matrix it is concluded that:

- the optimized model commits 221 classification errors

- deeming it more serious to make the mistake of classifying a company as stable in an unstable reality, our classifier does an excellent job because he makes only 15 mistakes (of 221
totals) of this type on the entire test dataset (consisting of 2046 examples).