## Attribute Information:

There are total 95 features. This Dataset has a lot of features. The Dataset description is given on the Data's page itself.
Before blindly performing EDA it's important to have information about the data.

https://www.kaggle.com/fedesoriano/company-bankruptcy-prediction

First we will train the model on raw data, and we will use <b> Feature Selection </b> technique to highlight some of the features and train on selected features. Hence, we will compare the models and accuracy.


## Our Plan



- <b> 1. Observe Dataset </b>


- <b> 2. Exploratory Data Analysis </b>

    - 2.1 Datset Cleaning
    - 2.2 Check for data imbalance
    


- <b> 3. Data Preprocessing </b>

    - 3.2 Split Training and testing
    - 3.2 Feature Selection with RandomForest
    - 3.3 PCA
    
    
- <b> 4. Models, Hyperparameter Tuning, Cross Validation and Model Evaluation </b>

    - 4.1 Logistic Regression
    - 4.2 Naive Bayes
    - 4.3 K-Nearest Neighbor
    - 4.4 Decision Tree
    - 4.5 Random Forest
    - 4.6 XGBoost
    


# 1. Observe Dataset

In [None]:
import pandas as pd
pd.set_option('max_columns', None)

In [None]:
path = "../input/company-bankruptcy-prediction/"

In [None]:
df = pd.read_csv(path + 'data.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [None]:
corr = df.corr()

In [None]:
fig, ax = plt.subplots(figsize = (15,15))
sns.heatmap(corr, ax = ax, cmap = 'viridis', linewidth = 0.1)

In [None]:
df.info()

### Observation

- All the features are numerical (int64 or float64)
- All the values are scaled between -1 to 1.

# Exploratory Data Analysis

- 2.1 Checking for data imbalance
- 2.2 Outliers
- 2.3 Filling null values


### 2.1 Checking for Data imbalance

In [None]:
df['Bankrupt?'].value_counts()

In [None]:
print('Financially stable:', round(df['Bankrupt?'].value_counts()[0] / len(df) * 100,2) ,'%')
print('Financially unstable:', round(df['Bankrupt?'].value_counts()[1] / len(df) * 100, 2), '%')

We see the data is highly skewed towards, Financially stable. If we train the model on this dataset, our prediction will be biased towards Financially stabled.

We will balance the dataset, to train our model.

Notice: Notice how imbalanced is our original dataset! Most of the comapnies are Financially Stable. If we use this dataframe as the base for our predictive models and analysis we might get a lot of errors and our algorithms will probably overfit since it will "assume" that most of the companies are Financially Stable. But we don't want our model to assume, we want our model to detect patterns that give signs of Bankrupt!

In [None]:
## Visualizing the datas

sns.set_theme(context = 'paper')


plt.figure(figsize = (8,8))
sns.countplot(x = 'Bankrupt?', data = df);
plt.title('Class Distributions: \n 0: Financially Stable & 1: Financially Unstable');

#### Splitting the Data (Original DataFrame)


Before proceeding with the <b> RandomUnderSampling </b> technique we have to seperate the original dataframe. 

<b>Why? </b>

for testing purposes, remeber although we are splitting the data when implementing Random UnderSampling or OverSampling techniques, we want to test our models on the original testing set not on the testing set created by either of these techniques. The main goal is to fit the model with the dataframes that were undersample and oversample (in order for our model to detect the patterns) and test it on the original testing set.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import KFold, StratifiedKFold

print("Financially Stable:", round(df['Bankrupt?'].value_counts()[0] / len(df) * 100, 2), '% of the dataset')
print("Financially Unstable:", round(df['Bankrupt?'].value_counts()[1] / len(df) * 100,2),'% of the dataset')

X = df.drop('Bankrupt?', axis = 1)
y = df['Bankrupt?']

sss = StratifiedKFold(n_splits = 5, random_state = None, shuffle = False)

for train_index, test_index in sss.split(X,y):
    print("\n Train", train_index, "Test", test_index)
    org_Xtrain, org_Xtest = X.iloc[train_index], X.iloc[test_index]
    org_ytrain, org_ytest = y.iloc[train_index], y.iloc[test_index]
    

In [None]:
import numpy as np

In [None]:
## turn into an array

org_Xtrain = org_Xtrain.values
org_Xtest = org_Xtest.values
org_ytrain = org_ytrain.values
org_ytest = org_ytest.values

## See if both the train and test label distribution are similarly distributed 
train_unique_label, train_counts_label = np.unique(org_ytrain, return_counts = True)
test_unique_label, test_counts_label = np.unique(org_ytest, return_counts = True)

print('Label Distirubtions: \n')
print(train_counts_label / len(org_ytrain))
print(test_counts_label / len(org_ytest))

#### Random Under-Sampling and OverSampling

In this phase of the project we will implement "Random Under Sampling" which basically consists of removing data in order to have a more balanced dataset and this avoiding our models to overfitting.


In [None]:
## Lets shuffle the data before creating the subsamples

xdf = df.sample(frac = 1)

## amount of Financially unstable data is 220
# sdf = Financially stable
# ndf = Financially unstable

sdf = df.loc[xdf['Bankrupt?'] == 0][:220]
ndf = df.loc[xdf['Bankrupt?']==1]

normal_distributed_df = pd.concat([sdf, ndf])

# Shuffling again

nxdf = normal_distributed_df.sample(frac = 1, random_state = 42)

In [None]:
nxdf.head()

In [None]:
## Checking new dataframe

print("Distribution of the Classes in the subsample dataset")
print(nxdf['Bankrupt?'].value_counts() / len(nxdf))

sns.countplot('Bankrupt?', data = nxdf)
plt.title("Equally Distributed Class", fontsize = 14)
plt.show()

#### Correlation Matrices

Correlation matrices are the essence of understanding our data. We want to know if there are features that influence heavily in whether a specific transaction is a fraud. However, it is important that we use the correct dataframe (subsample) in order for use to see which features have a high positive or negative correlation with regards to fraud transactions.

In [None]:
## make sure we use the subsampe in our correlation

f, (ax1, ax2) = plt.subplots(2,1, figsize = (54,50))

## Entire data frame

corr = df.corr()
sns.heatmap(corr, cmap = 'coolwarm_r', annot_kws = {'size': 20}, ax= ax1)
ax1.set_title("Imbalanced Correlated Matrix \n")


sub_sample_corr = nxdf.corr()
sns.heatmap(sub_sample_corr, cmap = 'coolwarm_r', annot_kws = {'size': 20}, ax = ax2)
ax2.set_title("SubSample Correlation Matrix")
plt.show()

In [None]:
nxdf.hist(bins = 50, figsize = (35,20))
plt.show()

We can see there are large number of blue square boxes and red square boxes which indicates, those column are has high or low correlation with one or other. So we will use PCA (Dimensionality Reduction) technqiue. 

<b> PCA vs Feature Selection? </b>

https://stackoverflow.com/questions/16249625/difference-between-pca-principal-component-analysis-and-feature-selection#:~:text=The%20difference%20is%20that%20PCA,takes%20the%20target%20into%20consideration.&text=PCA%20is%20based%20on%20extracting,data%20shows%20the%20highest%20variability.

Just to add to the very good answers above. The difference is that PCA will try to reduce dimensionality by exploring how one feature of the data is expressed in terms of the other features(linear dependecy). Feature selection instead, takes the target into consideration. It will rank your input variables in terms of how useful they are to predict the target value. This is true for univariate feature selection. Multi variate feature selection can also do something that can be considered a form of PCA, in the sense that it will discard some of the features in the input. But don't take this analogy too far.

## 3. Data Preprocessing

- Split Training and Testing
- Feature selection with RandomForest
- PCA

#### Split Training and Testing

In [None]:
## this is equally sampled dataset (perfectly balanced target)

X = nxdf.drop(['Bankrupt?'],1)
y = nxdf['Bankrupt?']

rf_fs_Xtrain, rf_fs_Xtest, rf_fs_ytrain, rf_fs_ytest = train_test_split(X,y, test_size = 0.1, random_state = 1)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

### Feature selection with RandomForest

In [None]:
## modelling with balanced traget 

model = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
model.fit(rf_fs_Xtrain, rf_fs_ytrain)

sel = SelectFromModel(model)

In [None]:
## balanced target

sel.fit(rf_fs_Xtrain, rf_fs_ytrain)

In [None]:
# balanced

selected_feat= rf_fs_Xtrain.columns[(sel.get_support())]
len(selected_feat)

In [None]:
selected_feat

In [None]:
## Creating a dataframe for only selected values to train later

rf_fs = pd.DataFrame()

for column in selected_feat:
    if column in nxdf:
        rf_fs[column] = nxdf[column].values
        


In [None]:
rf_fs

### PCA

In [None]:
from sklearn.decomposition import PCA

In [None]:
n_components = 2
pca = PCA(n_components = n_components)
pca.fit(nxdf)

In [None]:
X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(rf_fs_Xtrain.values)

In [None]:
x_pca = pca.transform(nxdf)

In [None]:
x_pca.shape

In [None]:
# PCA scatter plot
plt.figure(figsize = (8,8))
plt.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(rf_fs_ytrain == 0), cmap='coolwarm', label= 'Stable_Company', linewidths=2)
plt.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(rf_fs_ytrain == 1), cmap='coolwarm', label= 'Unstable_Company', linewidths=2)
plt.show()

## Testing our Models

We will test our all the dataset (i.e normal, random forest feature selection and PCA dataset with each model.

For comparison we will make a new dataFrame, and comapre which method performed better

Also as it is classification problem, we will test it with following algorithms

- Logistic Regression
- Naive Bayes
- KNN
- Decision Trees
- Random Forest
- SVM

#### Preparing all the dataset for the models

- <b> nxdf </b> is the original dataset.
- <b> rf_fs </b> is the dataset with Feature Selection from Random Forest


In [None]:
## Splitting dataset for Normal data without feature selection

X_train, X_test, y_train, y_test = train_test_split(nxdf.drop('Bankrupt?', axis = 1), nxdf['Bankrupt?'],test_size = 0.1, random_state = 1)

In [None]:
nxdf.shape

In [None]:
rf_fs.shape

- Since <b> rf_fs </b> target feature <b> Bankrupt? </b> has already been dropped. We know nxdf and rf_fs has same target value i.e ['Bankrupt'] so we will use the target value from nxdf for splitting Selected Dataset

In [None]:
## Splitting RandomForest Feature Selection dataset

fs_Xtrain, fs_Xtest, fs_ytrain, fs_ytest = train_test_split(rf_fs, nxdf['Bankrupt?'], test_size = 0.1, random_state = 1)

In [None]:
model_score = pd.DataFrame(columns = ("Original_Dataset","Selected_Dataset"))

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, precision_score, recall_score, confusion_matrix

### Original Dataset

In [None]:
lrmodel1 = LogisticRegression(max_iter = 1000)
lrmodel1.fit(X_train, y_train)
score1 = lrmodel1.score(X_test, y_test)
lr_pred1 = lrmodel1.predict(X_test)

In [None]:
## Accuracy on Original Datset without Feature Selection:

print("Score:", score1)

In [None]:
lr_cm1 = confusion_matrix(y_test, lr_pred1, labels = (1,0))

In [None]:
lr_cm1

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(lr_cm1, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

In [None]:
y_test.value_counts()

### Feature Selection Dataset

In [None]:
lrmodel2 = LogisticRegression(max_iter = 1000)
lrmodel2.fit(fs_Xtrain, fs_ytrain)
score2 = lrmodel2.score(fs_Xtest, fs_ytest)
lr_ypred2 = lrmodel2.predict(fs_Xtest)

In [None]:
print("Score", score2)

In [None]:
lr_cm2 = confusion_matrix(fs_ytest, lr_ypred2, labels = (1,0))
print("Confusion Matrix: \n", lr_cm2)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(lr_cm2, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

In [None]:
model_score = model_score.append(pd.DataFrame({'Original_Dataset':[score1], 'Selected_Dataset': [score2]}, index = ['LogisticRegression']))

In [None]:
model_score

# Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

### Original dataset

In [None]:
naiveb1 = GaussianNB()

In [None]:
naiveb1.fit(X_train, y_train)
score1 = naiveb1.score(X_test, y_test)
nb_pred1 = naiveb1.predict(X_test)

In [None]:
print("Score:", score1)

In [None]:
nb_cm1 = confusion_matrix(y_test, nb_pred1, labels = (1,0))

In [None]:
print("Confusion Matrix: \n", nb_cm1)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(nb_cm1, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

### Feature Selection Dataset

In [None]:
naiveb2 = GaussianNB()

In [None]:
naiveb2.fit(fs_Xtrain, fs_ytrain)
score2 = naiveb2.score(fs_Xtest, fs_ytest)
nb_pred2 = naiveb2.predict(fs_Xtest)

In [None]:
print("Score:", score2)

In [None]:
nb_cm2 = confusion_matrix(fs_ytest, nb_pred2, labels = [1,0])
print("Confusion Matrix: \n", nb_cm2)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(nb_cm2, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

In [None]:
model_score = model_score.append(pd.DataFrame({'Original_Dataset': [score1], 'Selected_Dataset': [score2]}, index = ['NaiveBayes']))

In [None]:
model_score

# KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

### Original Dataset

In [None]:
knn1 = KNeighborsClassifier(n_neighbors = 7)

In [None]:
knn1.fit(X_train, y_train)

In [None]:
score1 = knn1.score(X_test, y_test)
print(score1)

In [None]:
knn_pred1 = knn1.predict(X_test)
knn_cm1 = confusion_matrix(y_test, knn_pred1, labels = (1,0))
print("Confusion Matrix:\n", knn_cm1)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(knn_cm1, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

In [None]:
### Hyperparameter tuning for KNN

In [None]:
error_rate = []

for i in range(1,40):
    knn = KNeighborsClassifier(n_neighbors = i)
    knn.fit(X_train, y_train)
    pred_knn = knn.predict(X_test)
    error_rate.append(np.mean(pred_knn != y_test))

In [None]:
plt.figure(figsize = (8,8))
plt.plot(range(1,40), error_rate, color = 'blue', linestyle = 'dashed', marker = 'o', markerfacecolor = 'red', markersize = 10);
plt.title('Error Rate vs K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
## let's see how much difference does it makes

tuned_knn1 = KNeighborsClassifier(n_neighbors = 4)
tuned_knn1.fit(X_train, y_train)

In [None]:
tuned_score1 = tuned_knn1.score(X_test, y_test)
print(tuned_score1)

We can see, it's not that different

### Feature Selection Dataset

In [None]:
knn2 = KNeighborsClassifier(n_neighbors = 7)
knn2.fit(fs_Xtrain, fs_ytrain)

In [None]:
score2 = knn2.score(fs_Xtest, fs_ytest)
print(score2)

In [None]:
knn_pred2 = knn2.predict(fs_Xtest)
knn_cm2 = confusion_matrix(fs_ytest, knn_pred2, labels = (1,0))
print("Confusion Matrix: \n", knn_cm2)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(knn_cm2, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

In [None]:
## Hyperparamter tuning for this

In [None]:
error_rate = []

for i in range(1,40):
    knn = KNeighborsClassifier(n_neighbors = i)
    knn.fit(fs_Xtrain, fs_ytrain)
    pred_knn = knn.predict(fs_Xtest)
    error_rate.append(np.mean(pred_knn != fs_ytest))

In [None]:
plt.figure(figsize = (8,8))
plt.plot(range(1,40), error_rate, color = 'blue', linestyle = 'dashed', marker = 'o', markerfacecolor = 'red', markersize = 10);
plt.title('Error Rate vs K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

Let's test with, K = 5

In [None]:
tuned_knn2 = KNeighborsClassifier(n_neighbors = 14)
tuned_knn2.fit(fs_Xtrain, fs_ytrain)

In [None]:
tuned_score2 = tuned_knn2.score(fs_Xtest, fs_ytest)
print(tuned_score2)

In [None]:
tuned_knn_pred2 = knn.predict(fs_Xtest)
tuned_cm2 = confusion_matrix(fs_ytest, tuned_knn_pred2, labels = (1,0))
print("Confusion Matrix: \n", tuned_cm2)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(tuned_cm2, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

A bit better but not that great

In [None]:
model_score = model_score.append(pd.DataFrame({'Original_Dataset': [tuned_score1], 'Selected_Dataset': [tuned_score2]}, index = ['KNN']))


In [None]:
model_score

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt1 = DecisionTreeClassifier()

### Original data

In [None]:
dt1 = dt1.fit(X_train, y_train)

In [None]:
score1 = dt1.score(X_test, y_test)
print(score1)

In [None]:
dt_pred1 = dt1.predict(X_test)
dt_cm1 = confusion_matrix(y_test, dt_pred1, labels = [1,0])
print("Confusion Matrix: \n", dt_cm1)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(dt_cm1, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

### Feature Selection Data

In [None]:
dt2 = DecisionTreeClassifier()

In [None]:
dt2 = dt2.fit(fs_Xtrain, fs_ytrain)

In [None]:
score2 = dt2.score(fs_Xtest, fs_ytest)
print(score2)

In [None]:
dt_pred2 = dt2.predict(fs_Xtest)
dt_cm2 = confusion_matrix(fs_ytest, dt_pred2, labels = [1,0])
print("Confusion Matrix: \n", dt_cm2)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(dt_cm2, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

In [None]:
model_score = model_score.append(pd.DataFrame({'Original_Dataset': [score1], 'Selected_Dataset': [score2]}, index = ['DecisionTrees']))


In [None]:
model_score

# Random Forest

In [None]:
rfclf1 = RandomForestClassifier(n_estimators = 100)

### With Original dataset

In [None]:
rfclf1.fit(X_train, y_train)

In [None]:
score1 = rfclf1.score(X_test, y_test)
print(score1)

In [None]:
rf_pred1 = rfclf1.predict(X_test)
rf_cm1 = confusion_matrix(y_test, rf_pred1, labels = (1,0))
print("Confusion Matrix: \n", rf_cm1)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(rf_cm1, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

### With Selected Features

In [None]:
rfclf2 = RandomForestClassifier(n_estimators = 100)

In [None]:
rfclf2.fit(fs_Xtrain, fs_ytrain)

In [None]:
score2 = rfclf2.score(fs_Xtest, fs_ytest)
print(score2)

In [None]:
rf_pred2 = rfclf2.predict(fs_Xtest)
rf_cm2 = confusion_matrix(fs_ytest, rf_pred2, labels = [1,0])
print("Confusion Matrix: \n", rf_cm2)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(rf_cm2, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

#### Hyperparamter Tuning

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 80, num = 10)
               ]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [2,4]

# Minimum number of samples required to split a node
min_samples_split = [2, 5]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2]

# Method of selecting samples for training each tree
bootstrap = [True, False]

In [None]:
# Create the param grid
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(param_grid)

In [None]:
tuned_rf = RandomForestClassifier()

In [None]:
from sklearn.model_selection import GridSearchCV
rf_Grid = GridSearchCV(estimator = tuned_rf, param_grid = param_grid, cv = 3, verbose=2, n_jobs = 4)

In [None]:
rf_Grid.fit(X_train, y_train)

In [None]:
rf_Grid.best_params_

In [None]:
print (f'Train Accuracy - : {rf_Grid.score(X_train,y_train):.3f}')
print (f'Test Accuracy - : {rf_Grid.score(X_test,y_test):.3f}')

In [None]:
tuned_score2 = rf_Grid.score(X_test, y_test)
print(tuned_score2)

In [None]:
model_score = model_score.append(pd.DataFrame({'Original_Dataset': [score1], 'Selected_Dataset': [tuned_score2]}, index = ['RandomForest']))

In [None]:
model_score

# XGBoost

In [None]:
from xgboost import XGBClassifier

### With Original Dataset

In [None]:
xgb1 = XGBClassifier(n_estimators = 100)
xgb1.fit(X_train, y_train)

In [None]:
score1 = xgb1.score(X_test, y_test)
print(score1)

In [None]:
xgb_pred1 = xgb1.predict(X_test)
xgb_cm1 = confusion_matrix(y_test, xgb_pred1, labels = [1,0])
print("Confusion Matrix: \n", xgb_cm1)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(xgb_cm1, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

### With Selected Dataset

In [None]:
xgb2 = XGBClassifier(n_estimators = 100)
xgb2.fit(fs_Xtrain, fs_ytrain)

In [None]:
score2 = xgb2.score(fs_Xtest, fs_ytest)
print(score2)

In [None]:
xgb_pred2 = xgb2.predict(fs_Xtest)
xgb_cm2 = confusion_matrix(fs_ytest, xgb_pred2, labels = [1,0])
print("Confusion Matrix: \n", xgb_cm2)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(xgb_cm2, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

### Hyperparamter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
params={
 "learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
    
}

In [None]:
tuned_xgb = XGBClassifier()

In [None]:
random_search = RandomizedSearchCV(tuned_xgb, param_distributions = params, n_iter = 5, scoring = 'roc_auc', n_jobs = 1, cv = 5, verbose = 3)

In [None]:
random_search.fit(X_train, y_train)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_params_

In [None]:
tuned_score1 = random_search.score(X_test, y_test)
print(tuned_score1)

In [None]:
print (f'Train Accuracy - : {random_search.score(X_train,y_train):.3f}')
print (f'Test Accuracy - : {random_search.score(X_test,y_test):.3f}')

In [None]:
tuned_xgb_pred1 = random_search.predict(X_test)
tuned_xgb_cm1 = confusion_matrix(y_test, tuned_xgb_pred1, labels = [1,0])
print("Confusion Matrix: \n", tuned_xgb_cm1)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(tuned_xgb_cm1, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

Definitely this is biased towards, postive class. Since this is unbalanced dataset. We will hyeprtune with equally balanced dataset 

### Hypertuning for balanced dataset

In [None]:
tuned_xgb2 = XGBClassifier()

In [None]:
random_search2 = RandomizedSearchCV(tuned_xgb2, param_distributions = params, n_iter = 5, scoring = 'roc_auc', n_jobs = 1, cv = 5, verbose = 3)

In [None]:
random_search2.fit(fs_Xtrain, fs_ytrain)

In [None]:
random_search2.best_estimator_

In [None]:
random_search2.best_params_

In [None]:
tuned_score2 = random_search2.score(fs_Xtest, fs_ytest)
print(tuned_score2)

In [None]:
print (f'Train Accuracy - : {random_search2.score(fs_Xtrain,fs_ytrain):.3f}')
print (f'Test Accuracy - : {random_search2.score(fs_Xtest,fs_ytest):.3f}')

In [None]:
tuned_xgb_pred2 = random_search2.predict(fs_Xtest)
tuned_xgb_cm2 = confusion_matrix(fs_ytest, tuned_xgb_pred2, labels = [1,0])
print("Confusion Matrix: \n", tuned_xgb_cm2)

In [None]:
x_axis_labels = [1,0]
y_axis_labels = [1,0]

sns.set(font_scale=1.4)
sns.heatmap(tuned_xgb_cm2, xticklabels = x_axis_labels, yticklabels = y_axis_labels, annot = True, annot_kws = {'size': 16})
plt.xlabel("Actual Class")
plt.ylabel("Predicted Class")
plt.show()

In [None]:
model_score = model_score.append(pd.DataFrame({'Original_Dataset': [tuned_score1], 'Selected_Dataset': [tuned_score2]}, index = ['XGBoost']))


In [None]:
model_score

As we can see, <b> XGBoost </b> performs best, on <b> selected features </b>. 

In [None]:
## Checking Classification report of the best model

print(classification_report(fs_ytest, tuned_xgb_pred2))

In [None]:
## Biased model

print(classification_report(y_test, tuned_xgb_pred1))

In [None]:
### Checking Classification report of the worst model

print(classification_report(y_test, lr_ypred2 ))