<h1 align="center">  Bank Marketing Data </h1>

#### [UCI: BANK MARKETING DATASET](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) 
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

# Contents:
1. [Attribute Information](#attribute)
2. [Data Cleaning and Preprocessing](#cleaning)
    -  [Standardization](#scaling)
    -  [Train/Test Split](#split)
3. [Approaches](#approach)
    - [Approach 1: Baseline Model](#approach1)
    - [Approach 2: Oversampling- RandomOverSampler](#approach2)
    - [Approach 3: Undersampling - RandomUnderSampler](#approach3)
    - [Approach 4: Oversampling- SMOTE](#approach4)
4. [Feature Importance and Final Model Selection](#keyfeatures)
    - [Feature Importance in RandomOverSampler](#keyfeatures1)
    - [Feature Importance in RandomOverSampler](#keyfeatures2)

#### Attribute Information:
<a id="attribute"></a>
- Age
- Job - type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
- marital - marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
- Education - Shows the level of education of each customer (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
- Default - Whether a customer has credit in default (categorical: 'no','yes','unknown')
- Housing - Does the customer have a housing loan? (categorical: 'no','yes','unknown')
- Loan - Does the customer have a personal loan? (categorical: 'no','yes','unknown')
- Contact - The contact communication type (categorical: 'cellular','telephone')
- Month - Last contact month of year
- day_of_week - Last contact day of Week
- Duration - Last contact duration in seconds. Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no').
- Campaign - Number of contact performed for the client during the campaign
- pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- previous: number of contacts performed before this campaign and for this client
- poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success') 
- emp.var.rate: employment variation rate - quarterly indicator
- cons.price.idx: consumer price index - monthly indicator
- cons.conf.idx: consumer confidence index - monthly indicator
- euribor3m: euribor 3 month rate - daily indicator
- nr.employed: number of employees - quarterly indicator
- y - has the client subscribed a term deposit? (binary: 'yes','no')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve, recall_score
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold, RandomizedSearchCV, cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC 
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from collections import Counter
from pprint import pprint
from scipy import stats
from sklearn.preprocessing import StandardScaler

In [None]:
dataset = pd.read_csv("../input/bank-marketing-campaigns-dataset/bank-additional-full.csv", sep=';')
dataset.head()

# Data Cleaning and Preprocessing
<a id="cleaning"></a>

In [None]:
dataset.info()

*There are 41188 observations with 21 features.*

In [None]:
# check for missing values in any column
dataset.isnull().sum()

*There are no missing values in the dataset.*

In [None]:
dataset.describe()

In [None]:
dataset.hist(bins = 15, figsize = (10,10), xlabelsize = 0.1, ylabelsize = 0.1)
plt.show()

In [None]:
dataset.pdays.value_counts(normalize=True)

*Values of pdays column shows very little variation. Most of the values consist of 999 which means client was not previously contacted. It does not give us much information. Therefore, it is better to drop.*

In [None]:
sns.catplot(x='default',hue='y',kind='count',data=dataset)

In [None]:
pd.crosstab(dataset['default'], dataset.y)

*Dropping default column is better because all values of default are no or unknown. It does not give much information.*

In [None]:
dataset.y.value_counts(normalize=True)

In [None]:
colors = ["#0101DF", "#DF0101"]

sns.countplot('y', data=dataset, palette=colors)
plt.title('Deposit Distributions \n (0: No || 1: Yes)', fontsize=14)

*From the above distribution we can be sure that the data is imbalanced, as the number of "no"s are also 8 times the number of "yes".*

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(dataset.corr(),square=True,annot=True,cmap= 'twilight_shifted')

*emp.var.rate, nr.employed and euribor3m are highly correlated. Since multicollinearity is not a problem for all algorithm, we decide to keep them.*

## Standardization
<a id="scaling"></a>

In [None]:
# make a copy of dataset to scaling
bank_scale=dataset.copy()

# remove 'pdays' and 'default' columns
bank_scale= bank_scale.drop(['pdays', 'default'], axis=1)

bank_scale.y.replace(('yes', 'no'), (1, 0), inplace=True)

# standardization for just numerical variables 
categorical_cols= ['job','marital', 'education',  'housing', 'loan', 'contact', 'month', 'day_of_week','poutcome','y']
feature_scale=[feature for feature in bank_scale.columns if feature not in categorical_cols]

scaler=StandardScaler()
scaler.fit(bank_scale[feature_scale])

In [None]:
scaled_data = pd.concat([bank_scale[categorical_cols].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(bank_scale[feature_scale]), columns=feature_scale)],
                    axis=1)

categorical_cols1= ['job','marital', 'education', 'housing', 'loan', 'contact', 'month', 'day_of_week','poutcome']
scaled_data= pd.get_dummies(scaled_data, columns = categorical_cols1, drop_first=True)
scaled_data.head()

## Train/Test Split
<a id="split"></a>

In [None]:
X = scaled_data.iloc[:,1:]
Y = scaled_data.iloc[:,-0]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=2)

# Approach 1: Baseline Model
<a id="approach1"></a>
*In this approach, we do not make the data balance to understand whether the models improved after balancing data.*

In [None]:
import warnings
warnings.filterwarnings("ignore")
# Tuning parameter for RF ( tuning parameters are choosen based on best parameters of RandomizedSearchCV)
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 200, num = 5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(1, 45, num = 3)]
min_samples_split = [5, 10]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}
tuning_rf = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = random_grid, n_iter = 10, cv = 10, verbose=2, random_state=42, n_jobs = -1, scoring='roc_auc')
tuning_rf.fit(X_train,y_train)
print('Best Parameter for Random Forest', tuning_rf.best_params_, tuning_rf.best_score_)

# Tuning parameter for Tree
param_dict= {"criterion": ['gini', 'entropy'],
            "max_depth": range(1,10),
            "min_samples_split": range(1,10),
            "min_samples_leaf": range(1,5)}
tuning_tree = GridSearchCV(DecisionTreeClassifier(random_state=12),  param_grid=param_dict, cv=10, verbose=1, n_jobs=-1)
tuning_tree.fit(X_train,y_train)
print('Best Parameter for Tree', tuning_tree.best_params_, tuning_tree.best_score_)

# Xgboost Parameters
param_xgb = {
 'max_depth':[4,5,6],
 'min_child_weight':[4,5,6],
 'gamma':[i/10.0 for i in range(0,5)]
}
tuning_xgb = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_xgb, scoring='roc_auc',n_jobs=4, cv=5)
tuning_xgb.fit(X_train,y_train)
print('Best Parameter for XGBoost', tuning_xgb.best_params_, tuning_xgb.best_score_)

In [None]:
%%time
# Voting Classifier
clf1 = DecisionTreeClassifier()
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
clf4= KNeighborsClassifier()
clf5= LinearDiscriminantAnalysis()
clf6= XGBClassifier()

# Instantiate the classfiers and make a list
classifiers = [LinearDiscriminantAnalysis(),
               KNeighborsClassifier(),
               GaussianNB(), 
               SVC(kernel='linear'),
               DecisionTreeClassifier(criterion='gini', max_depth=6, min_samples_split=9,min_samples_leaf=2, random_state=12),
               RandomForestClassifier(n_estimators=155, max_features='auto', max_depth=45, min_samples_split=10, random_state=27),
               XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=5, min_child_weight=4, gamma=0.3, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
               VotingClassifier(estimators = [('DTree', clf1), ('rf', clf2), ('gnb', clf3),  ('knn', clf4),('lda', clf5), ('xgb', clf6)], voting ='soft')]

# Define a result table as a DataFrame
result_table = pd.DataFrame(columns=['classifiers', 'fpr1','tpr1','fpr','tpr','train_accuracy','test_accuracy', 'train_auc', 'test_auc', 'f1_score', 'precision','recall','confusion matrix','Report'])

# Train the models and record the results
for cls in classifiers:
    model = cls.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    train_accuracy= accuracy_score(y_train, y_train_pred)
    test_accuracy= accuracy_score(y_test, y_test_pred)
     
    fpr, tpr, _ = roc_curve(y_test,  y_test_pred)
    fpr1, tpr1, _ = roc_curve(y_train,  y_train_pred)
    
    train_auc = roc_auc_score(y_train, y_train_pred)
    test_auc = roc_auc_score(y_test, y_test_pred)
    
    f1_score= metrics.f1_score(y_test, y_test_pred)
    precision = metrics.precision_score(y_test, y_test_pred)
    recall = metrics.recall_score(y_test, y_test_pred)
    
    conf_mat= confusion_matrix(y_test,y_test_pred)
    report=classification_report(y_test,y_test_pred, digits=3, output_dict=True)
    
    result_table = result_table.append({'classifiers':cls.__class__.__name__,
                                        'fpr1':fpr1,
                                        'tpr1':tpr1,
                                        'fpr':fpr, 
                                        'tpr':tpr, 
                                        'train_accuracy': train_accuracy,
                                        'test_accuracy': test_accuracy,
                                        'train_auc':train_auc,
                                        'test_auc':test_auc,
                                        'f1_score': f1_score,
                                        'precision': precision,
                                        'recall': recall,
                                        'confusion matrix':conf_mat,
                                        'Report':report}, ignore_index=True)

# Set name of the classifiers as index labels
result_table.set_index('classifiers', inplace=True)

In [None]:
result_table.rename(index={'VotingClassifier':'Model Ensemble'},inplace=True)
result_table

In [None]:
pd.DataFrame(result_table.iloc[0,12]).transpose()

In [None]:
fig = plt.figure(figsize=(15,10))

for i in range(result_table.shape[0]):
    plt.plot(result_table.iloc[i,]['fpr'], 
             result_table.iloc[i,]['tpr'], 
             label="{}, AUC={:.3f}".format(result_table.index[i], result_table.iloc[i,]['test_auc']))
    
plt.plot([0,1], [0,1], color='orange', linestyle='--')
plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("False Positive Rate", fontsize=15)
plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)
plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')
plt.show()

In [None]:
plt.figure(figsize=(12,7))
plt.plot(result_table.iloc[:,[5,7,8,9,10]])
plt.xlabel('Models')
plt.xticks(rotation=90)
plt.ylabel('Score')
plt.title('Result of Models')
plt.legend(['Accuracy', 'ROC_AUC','F1 score','Precision','Recall'])
plt.show();

In [None]:
plt.figure(figsize=(12,7))
plt.plot(result_table.iloc[:,[7,10]])
plt.xlabel('Models')
plt.xticks(rotation=90)
plt.ylabel('Score')
plt.title('Result of Models')
plt.legend(['ROC_AUC','Recall'])
plt.show();

# Approach 2: Oversampling - RandomOverSampler
<a id="approach2"></a>
*In this approach, to alleviate the effects of imbalance during model training, we use oversampling technique which imputes additional data points to improve balance across classes using RandomOverSampler.*

In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1))) 
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0))) 

from imblearn.over_sampling import RandomOverSampler
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority') 
X_train_over, y_train_over = oversample.fit_resample(X_train, y_train)
  
print('After OverSampling, the shape of X_train: {}'.format(X_train_over.shape)) 
print('After OverSampling, the shape of y_train: {} \n'.format(y_train_over.shape)) 
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1))) 
print("After OverSampling, counts of label '0': {}".format(sum(y_train_over == 0))) 

*All tuning parameters are choosen based on best parameters of RandomizedSearchCV and GridSearchCV.*

In [None]:
# Tuning parameter for RF 
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 200, num = 5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(1, 45, num = 3)]
min_samples_split = [5, 10]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}
tuning_rf = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = random_grid, n_iter = 10, cv = 10, verbose=2, random_state=42, n_jobs = -1, scoring='roc_auc')
tuning_rf.fit(X_train_over,y_train_over)
print('Best Parameter for Random Forest', tuning_rf.best_params_, tuning_rf.best_score_)

# Tuning parameter for Tree
param_dict= {"criterion": ['gini', 'entropy'],
            "max_depth": range(1,10),
            "min_samples_split": range(1,10),
            "min_samples_leaf": range(1,5)}
tuning_tree = GridSearchCV(DecisionTreeClassifier(random_state=12),  param_grid=param_dict, cv=10, verbose=1, n_jobs=-1)
tuning_tree.fit(X_train_over,y_train_over)
print('Best Parameter for Tree', tuning_tree.best_params_, tuning_tree.best_score_)

# Xgboost Parameters
param_xgb = {
 'max_depth':[4,5,6],
 'min_child_weight':[4,5,6],
 'gamma':[i/10.0 for i in range(0,5)]}
tuning_xgb = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_xgb, scoring='roc_auc',n_jobs=4, cv=5)
tuning_xgb.fit(X_train_over,y_train_over)
print('Best Parameter for XGBoost', tuning_xgb.best_params_, tuning_xgb.best_score_)

In [None]:
%%time
# Voting Classifier
clf1 = DecisionTreeClassifier()
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
clf4 = KNeighborsClassifier()
clf5= LinearDiscriminantAnalysis()
clf6= XGBClassifier()

# Instantiate the classfiers and make a list
classifiers = [LinearDiscriminantAnalysis(),
               KNeighborsClassifier(),
               GaussianNB(), 
               SVC(kernel='linear'),
               DecisionTreeClassifier(criterion='gini', max_depth=9, min_samples_split=5,min_samples_leaf=1, random_state=12),
               RandomForestClassifier(n_estimators=200, max_features='sqrt', max_depth=45, min_samples_split=5, random_state=27),
               XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=4, min_child_weight=6, gamma=0.4, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
               VotingClassifier(estimators = [('DTree', clf1), ('rf', clf2), ('gnb', clf3), ('knn', clf4), ('lda', clf5), ('xgb', clf6)], voting ='soft')]

# Define a result table as a DataFrame
result_table1 = pd.DataFrame(columns=['classifiers', 'fpr1','tpr1','fpr','tpr','train_accuracy','test_accuracy', 'train_auc', 'test_auc', 'f1_score', 'precision','recall','confusion matrix','Report'])

# Train the models and record the results
for cls in classifiers:
    model = cls.fit(X_train_over, y_train_over)
    y_train_pred = model.predict(X_train_over)
    y_test_pred = model.predict(X_test)
    
    train_accuracy= accuracy_score(y_train_over, y_train_pred)
    test_accuracy= accuracy_score(y_test, y_test_pred)
     
    fpr, tpr, _ = roc_curve(y_test,  y_test_pred)
    fpr1, tpr1, _ = roc_curve(y_train_over,  y_train_pred)
    
    train_auc = roc_auc_score(y_train_over, y_train_pred)
    test_auc = roc_auc_score(y_test, y_test_pred)
    
    f1_score= metrics.f1_score(y_test, y_test_pred)
    precision = metrics.precision_score(y_test, y_test_pred)
    recall = metrics.recall_score(y_test, y_test_pred)
    
    conf_mat= confusion_matrix(y_test,y_test_pred)
    report=classification_report(y_test,y_test_pred, digits=3, output_dict=True)
    
    result_table1 = result_table1.append({'classifiers':cls.__class__.__name__,
                                        'fpr1':fpr1,
                                        'tpr1':tpr1,
                                        'fpr':fpr, 
                                        'tpr':tpr, 
                                        'train_accuracy': train_accuracy,
                                        'test_accuracy': test_accuracy,
                                        'train_auc':train_auc,
                                        'test_auc':test_auc,
                                        'f1_score': f1_score,
                                        'precision': precision,
                                        'recall': recall,
                                        'confusion matrix':conf_mat,
                                        'Report':report}, ignore_index=True)

# Set name of the classifiers as index labels
result_table1.set_index('classifiers', inplace=True)

In [None]:
result_table1.rename(index={'VotingClassifier':'Model Ensemble'},inplace=True)
result_table1

In [None]:
fig = plt.figure(figsize=(15,10))

for i in range(result_table1.shape[0]):
    plt.plot(result_table1.iloc[i,]['fpr'], 
             result_table1.iloc[i,]['tpr'], 
             label="{}, AUC={:.3f}".format(result_table1.index[i], result_table1.iloc[i,]['test_auc']))
    
plt.plot([0,1], [0,1], color='orange', linestyle='--')
plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("False Positive Rate", fontsize=15)
plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)
plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')
plt.show()

In [None]:
plt.figure(figsize=(12,7))
plt.plot(result_table1.iloc[:,[5,7,8,9,10]])
plt.xlabel('Models')
plt.xticks(rotation=90)
plt.ylabel('Score')
plt.title('Result of Models')
plt.legend(['Accuracy', 'ROC_AUC','f1_score','precision','recall'])
plt.show();

In [None]:
plt.figure(figsize=(12,7))
plt.plot(result_table1.iloc[:,[7,10]])
plt.xlabel('Models')
plt.xticks(rotation=90)
plt.ylabel('Score')
plt.title('Result of Models')
plt.legend(['ROC_AUC','Recall'])
plt.show();

## APPROACH 3: UNDERSAMPLING - RandomUnderSampler
<a id="approach3"></a>
*In this approach, we use undersampling technique which randomly removes observations of the majority class to improve the balance accross classes using RandomOverSampler function in imblearn.*

In [None]:
print("Before Undersampling, counts of label '1': {}".format(sum(y_train == 1))) 
print("Before Undersampling, counts of label '0': {} \n".format(sum(y_train == 0))) 
  
# import the Random Under Sampler object.
from imblearn.under_sampling import RandomUnderSampler
# create the object.
under_sampler = RandomUnderSampler(random_state=2)
# fit the object to the training data.
X_train_under, y_train_under = under_sampler.fit_sample(X_train, y_train.ravel())
  
print('After Undersampling, the shape of X_train: {}'.format(X_train_under.shape)) 
print('After Undersampling, the shape of y_train: {} \n'.format(y_train_under.shape)) 

print("After Undersampling, counts of label '1': {}".format(sum(y_train_under == 1))) 
print("After Undersampling, counts of label '0': {}".format(sum(y_train_under == 0)))

In [None]:
# Tuning parameter for RF ( tuning parameters are choosen based on best parameters of RandomizedSearchCV)
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 200, num = 5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(1, 45, num = 3)]
min_samples_split = [5, 10]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}
tuning_rf = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = random_grid, n_iter = 10, cv = 10, verbose=2, random_state=42, n_jobs = -1, scoring='roc_auc')
tuning_rf.fit(X_train_under,y_train_under)
print('Best Parameter for Random Forest', tuning_rf.best_params_, tuning_rf.best_score_)

# Tuning parameter for Tree
param_dict= {"criterion": ['gini', 'entropy'],
            "max_depth": range(1,10),
            "min_samples_split": range(1,10),
            "min_samples_leaf": range(1,5)}
tuning_tree = GridSearchCV(DecisionTreeClassifier(random_state=12),  param_grid=param_dict, cv=10, verbose=1, n_jobs=-1)
tuning_tree.fit(X_train_under,y_train_under)
print('Best Parameter for Tree', tuning_tree.best_params_, tuning_tree.best_score_)

# Xgboost Parameters
param_xgb = {
 'max_depth':[4,5,6],
 'min_child_weight':[4,5,6],
 'gamma':[i/10.0 for i in range(0,5)]}
tuning_xgb = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_xgb, scoring='roc_auc',n_jobs=4, cv=5)
tuning_xgb.fit(X_train_under,y_train_under)
print('Best Parameter for XGBoost', tuning_xgb.best_params_, tuning_xgb.best_score_)

In [None]:
%%time
# Voting Classifier
clf1 = DecisionTreeClassifier()
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
clf4 = KNeighborsClassifier()
clf5= LinearDiscriminantAnalysis()
clf6= XGBClassifier()

# Instantiate the classfiers and make a list
classifiers = [LinearDiscriminantAnalysis(),
               KNeighborsClassifier(),
               GaussianNB(), 
               SVC(kernel='linear'),
               DecisionTreeClassifier(criterion='entropy', max_depth=6, min_samples_split=2,min_samples_leaf=4, random_state=12),
               RandomForestClassifier(n_estimators=155, max_features='auto', max_depth=45, min_samples_split=10, random_state=27),
               XGBClassifier(learning_rate =0.1,n_estimators=140,max_depth=4,min_child_weight=5,gamma=0.4,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,seed=27),
               VotingClassifier(estimators = [('DTree', clf1), ('rf', clf2), ('gnb', clf3), ('knn', clf4), ('lda', clf5), ('xgb', clf6)], voting ='soft')]

# Define a result table as a DataFrame
result_table2 = pd.DataFrame(columns=['classifiers', 'fpr1','tpr1','fpr','tpr','train_accuracy','test_accuracy', 'train_auc', 'test_auc', 'f1_score', 'precision','recall','confusion matrix','Report'])

# Train the models and record the results
for cls in classifiers:
    model2 = cls.fit(X_train_under, y_train_under)
    y_train_pred2 = model2.predict(X_train_under)
    y_test_pred2 = model2.predict(X_test)
    
    train_accuracy= accuracy_score(y_train_under, y_train_pred2)
    test_accuracy= accuracy_score(y_test, y_test_pred2)
     
    fpr, tpr, _ = roc_curve(y_test,  y_test_pred2)
    fpr1, tpr1, _ = roc_curve(y_train_under,  y_train_pred2)
    
    train_auc = roc_auc_score(y_train_under, y_train_pred2)
    test_auc = roc_auc_score(y_test, y_test_pred2)
    
    f1_score= metrics.f1_score(y_test, y_test_pred2)
    precision = metrics.precision_score(y_test, y_test_pred2)
    recall = metrics.recall_score(y_test, y_test_pred2)
    
    conf_mat= confusion_matrix(y_test,y_test_pred2)
    report=classification_report(y_test,y_test_pred2, digits=3, output_dict=True)
    
    result_table2 = result_table2.append({'classifiers':cls.__class__.__name__,
                                        'fpr1':fpr1,
                                        'tpr1':tpr1,
                                        'fpr':fpr, 
                                        'tpr':tpr, 
                                        'train_accuracy': train_accuracy,
                                        'test_accuracy': test_accuracy,
                                        'train_auc':train_auc,
                                        'test_auc':test_auc,
                                        'f1_score': f1_score,
                                        'precision': precision,
                                        'recall': recall,
                                        'confusion matrix':conf_mat,
                                        'Report':report}, ignore_index=True)

# Set name of the classifiers as index labels
result_table2.set_index('classifiers', inplace=True)

In [None]:
result_table2.rename(index={'VotingClassifier':'Model Ensemble'},inplace=True)
result_table2

In [None]:
fig = plt.figure(figsize=(15,10))

for i in range(result_table2.shape[0]):
    plt.plot(result_table2.iloc[i,]['fpr'], 
             result_table2.iloc[i,]['tpr'], 
             label="{}, AUC={:.3f}".format(result_table2.index[i], result_table2.iloc[i,]['test_auc']))
    
plt.plot([0,1], [0,1], color='orange', linestyle='--')
plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("False Positive Rate", fontsize=15)
plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)
plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')
plt.show()

In [None]:
plt.figure(figsize=(12,7))
plt.plot(result_table2.iloc[:,[5,7,8,9,10]])
plt.xlabel('Models')
plt.xticks(rotation=90)
plt.ylabel('Score')
plt.title('Result of Models')
plt.legend(['Accuracy', 'ROC_AUC','f1_score','precision','recall'])
plt.show();

In [None]:
plt.figure(figsize=(12,7))
plt.plot(result_table2.iloc[:,[7,10]])
plt.xlabel('Models')
plt.xticks(rotation=90)
plt.ylabel('Score')
plt.title('Result of Models')
plt.legend(['ROC_AUC','Recall'])
plt.show();

## APPROACH 4 : OVERSAMPLING- SMOTE
<a id="approach4"></a>
In this approach, we use oversampling technique using Synthetic Minority Oversampling Technique, or SMOTE for short.

In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1))) 
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0))) 

from imblearn.over_sampling import SMOTE 
sm = SMOTE(random_state = 2 )#, sampling_strategy=0.25) 
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel()) 
  
print('After OverSampling, the shape of X_train: {}'.format(X_train_res.shape)) 
print('After OverSampling, the shape of y_train: {} \n'.format(y_train_res.shape)) 
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1))) 
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0))) 

In [None]:
# Tuning parameter for RF ( tuning parameters are choosen based on best parameters of RandomizedSearchCV)
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 200, num = 5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(1, 45, num = 3)]
min_samples_split = [5, 10]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}
tuning_rf = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = random_grid, n_iter = 10, cv = 10, verbose=2, random_state=42, n_jobs = -1, scoring='roc_auc')
#tuning_rf.fit(X_train_res,y_train_res)
#tuning_rf.best_params_, tuning_rf.best_score_

# Tuning parameter for Tree
param_dict= {"criterion": ['gini', 'entropy'],
            "max_depth": range(1,10),
            "min_samples_split": range(1,10),
            "min_samples_leaf": range(1,5)}
tuning_tree = GridSearchCV(DecisionTreeClassifier(random_state=12),  param_grid=param_dict, cv=10, verbose=1, n_jobs=-1)
#tuning_tree.fit(X_train_res,y_train_res)
#tuning_tree.best_params_, tuning_tree.best_score_

# Xgboost Parameters
param_xgb = {
 'max_depth':[4,5,6],
 'min_child_weight':[4,5,6],
 'gamma':[i/10.0 for i in range(0,5)]
}
tuning_xgb = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_xgb, scoring='roc_auc',n_jobs=4, cv=5)
#tuning_xgb.fit(X_train_res,y_train_res)
#tuning_xgb.best_params_, tuning_xgb.best_score_

In [None]:
%%time
# Voting Classifier
clf1 = DecisionTreeClassifier()
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
clf4= KNeighborsClassifier()
clf5= LinearDiscriminantAnalysis()
clf6= XGBClassifier()

# Instantiate the classfiers and make a list
classifiers = [LinearDiscriminantAnalysis(),
               KNeighborsClassifier(),
               GaussianNB(), 
               SVC(kernel='linear'),
               DecisionTreeClassifier(criterion='gini', max_depth=9,min_samples_split=2,min_samples_leaf=3, random_state=12),
               RandomForestClassifier(n_estimators=200, max_features='sqrt', max_depth=45, min_samples_split=5, random_state=27),
               XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=4, min_child_weight=5, gamma=0.4, subsample=0.8, colsample_bytree=0.8, nthread=4,scale_pos_weight=1,seed=27),
               VotingClassifier(estimators = [('DTree', clf1), ('rf', clf2), ('gnb', clf3), ('knn', clf4), ('lda', clf5), ('xgb', clf6)], voting ='soft')]

# Define a result table as a DataFrame
result_table3 = pd.DataFrame(columns=['classifiers', 'fpr1','tpr1','fpr','tpr','train_accuracy','test_accuracy', 'train_auc', 'test_auc', 'f1_score', 'precision','recall','confusion matrix','Report'])

# Train the models and record the results
for cls in classifiers:
    model3 = cls.fit(X_train_res, y_train_res)
    y_train_pred3 = model3.predict(X_train_res)
    y_test_pred3 = model3.predict(X_test)
    
    train_accuracy= accuracy_score(y_train_res, y_train_pred3)
    test_accuracy= accuracy_score(y_test, y_test_pred3)
     
    fpr, tpr, _ = roc_curve(y_test,  y_test_pred3)
    fpr1, tpr1, _ = roc_curve(y_train_res,  y_train_pred3)
    
    train_auc = roc_auc_score(y_train_res, y_train_pred3)
    test_auc = roc_auc_score(y_test, y_test_pred3)
    
    f1_score= metrics.f1_score(y_test, y_test_pred3)
    precision = metrics.precision_score(y_test, y_test_pred3)
    recall = metrics.recall_score(y_test, y_test_pred3)
    
    conf_mat= confusion_matrix(y_test,y_test_pred3)
    report=classification_report(y_test,y_test_pred3, digits=3, output_dict=True)
    
    result_table3 = result_table3.append({'classifiers':cls.__class__.__name__,
                                        'fpr1':fpr1,
                                        'tpr1':tpr1,
                                        'fpr':fpr, 
                                        'tpr':tpr, 
                                        'train_accuracy': train_accuracy,
                                        'test_accuracy': test_accuracy,
                                        'train_auc':train_auc,
                                        'test_auc':test_auc,
                                        'f1_score': f1_score,
                                        'precision': precision,
                                        'recall': recall,
                                        'confusion matrix':conf_mat,
                                        'Report':report}, ignore_index=True)

# Set name of the classifiers as index labels
result_table3.set_index('classifiers', inplace=True)

In [None]:
result_table3.rename(index={'VotingClassifier':'Model Ensemble'},inplace=True)

result_table3

In [None]:
fig = plt.figure(figsize=(15,10))

for i in range(result_table3.shape[0]):
    plt.plot(result_table3.iloc[i,]['fpr'], 
             result_table3.iloc[i,]['tpr'], 
             label="{}, AUC={:.3f}".format(result_table3.index[i], result_table3.iloc[i,]['test_auc']))
    
plt.plot([0,1], [0,1], color='orange', linestyle='--')
plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("False Positive Rate", fontsize=15)
plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)
plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')
plt.show()

In [None]:
plt.figure(figsize=(12,7))
plt.plot(result_table3.iloc[:,[5,7,8,9,10]])
plt.xlabel('Models')
plt.xticks(rotation=90)
plt.ylabel('Scores')
plt.title('Result of Models')
plt.legend(['Accuracy', 'ROC_AUC','f1_score','precision','recall'])
plt.show();

In [None]:
plt.figure(figsize=(12,7))
plt.plot(result_table3.iloc[:,[7,10]])
plt.xlabel('Models')
plt.xticks(rotation=90)
plt.ylabel('Score')
plt.title('Result of Models')
plt.legend([ 'ROC_AUC','Recall'])
plt.show();

In [None]:
# Baseline Model
result_table.iloc[:,[4,5,6,7,8,9,10]]

In [None]:
# Oversampling with RandomOverSampler
result_table1.iloc[:,[4,5,6,7,8,9,10]]

In [None]:
# Undersampling
result_table2.iloc[:,[4,5,6,7,8,9,10]]

In [None]:
# Oversampling with SMOTE
result_table3.iloc[:,[4,5,6,7,8,9,10]]

# Feature Importance and Final Model Selection
<a id="keyfeatures"></a>

## Feature Importance in RandomOverSampler
<a id="keyfeatures1"></a>

*After 4 different approaches, we decide to choose best model according to AUC, f1, precision and recall score. The best model is Xgboost in RandomOverSampler.*

In [None]:
xgb = XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=4, min_child_weight=6, gamma=0.4, subsample=0.8, colsample_bytree=0.8, nthread=4, scale_pos_weight=1,seed=27)
model_xgb = xgb.fit(X_train_over, y_train_over)
y_train_xgb = model_xgb.predict(X_train_over)
y_test_xgb = model_xgb.predict(X_test)

print(confusion_matrix(y_test,y_test_xgb))
print(classification_report(y_test,y_test_xgb, digits=3))

print('Train accuracy: %0.3f' % accuracy_score(y_train_over, y_train_xgb))
print('Test accuracy: %0.3f' % accuracy_score(y_test, y_test_xgb))

print('Train AUC: %0.3f' % roc_auc_score(y_train_over, y_train_xgb))
print('Test AUC: %0.3f' % roc_auc_score(y_test, y_test_xgb))

### Shap Value

In [None]:
import shap
expl_xgb = shap.TreeExplainer(model_xgb)
shap_xgb = expl_xgb.shap_values(X_train_over)

In [None]:
shap.summary_plot(shap_xgb, X_train_over, plot_type="bar")

In [None]:
shap.summary_plot(shap_xgb, X_train_over)

In [None]:
# XGBoost Tree SHAP algorithm computes the SHAP values with respect to the margin, not the transformed probability. So we are seeing log odds values. In order to have probability values link='logit' is added.
shap.initjs()
shap.force_plot(expl_xgb.expected_value, shap_xgb[1050,:], X_train_over.iloc[1050,:], link='logit')

*For example: Above forceplot shows that the output value is the prediction for 1050th observation which is lower than base value. The base value is “the value that would be predicted if we did not know any features for the current output.” In other words, it is the mean prediction, or mean(yhat). Duration variable tends to push prediction lower the most.*

In [None]:
# XGBoost Tree SHAP algorithm computes the SHAP values with respect to the margin, not the transformed probability. So we are seeing log odds values. In order to have probability values link='logit' is added.
shap.initjs()
shap.force_plot(expl_xgb.expected_value, shap_xgb[4000,:], X_train_over.iloc[4000,:], link='logit')

*Above forceplot shows that the output value is the prediction for 4000th observation which is higher than base value. Duration variable tends to push prediction higher the most.*

In [None]:
# base value
y_train_over.mean()

In [None]:
X_train_over.mean()

In [None]:
X_train_over.iloc[4000,]

### Feature Selection

In [None]:
from sklearn.feature_selection import SelectFromModel
from numpy import sort
# Fit model using each importance as a threshold
thresholds = sort(model_xgb.feature_importances_)
for thresh in thresholds:
	# select features using threshold
	selection = SelectFromModel(model_xgb, threshold=thresh, prefit=True)
	select_X_train = selection.transform(X_train_over)
	# train model
	selection_model =  XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=4, min_child_weight=6, gamma=0.4, subsample=0.8, colsample_bytree=0.8, nthread=4, scale_pos_weight=1,seed=27)
	selection_model.fit(select_X_train, y_train_over)
	# eval model
	select_X_test = selection.transform(X_test)
	y_pred = selection_model.predict(select_X_test)
	predictions = [round(value) for value in y_pred]
	auc= roc_auc_score(y_test, predictions)
	print("Thresh=%.3f, n=%d, AUC: %.2f%%" % (thresh, select_X_train.shape[1], auc*100.0))

In [None]:
def select_features(X_train, y_train, X_test):
# configure to select a subset of features
    fs = SelectFromModel( XGBClassifier(learning_rate =0.1, 
                                        n_estimators=140, 
                                        max_depth=4, 
                                        min_child_weight=6, 
                                        gamma=0.4, 
                                        subsample=0.8, 
                                        colsample_bytree=0.8, 
                                        nthread=4, 
                                        scale_pos_weight=1,
                                        seed=27), max_features=22).fit(X_train_over, y_train_over)
    # transform train input data
    X_train_fs = fs.transform(X_train_over)
    # transform test input data
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs
 

In [None]:
X_train_fs, X_test_fs, fs = select_features(X_train_over, y_train_over, X_test)
# fit the model
model = XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=4, min_child_weight=6, gamma=0.4, subsample=0.8, colsample_bytree=0.8, nthread=4, scale_pos_weight=1,seed=27)
model.fit(X_train_fs, y_train_over)
y_train_pred = model.predict(X_train_fs)
# evaluate the model
yhat = model.predict(X_test_fs)

In [None]:
feature_idx = fs.get_support()
feature_name = X_train_over.columns[feature_idx]
feature_name

In [None]:
print(confusion_matrix(y_test,yhat))
print(classification_report(y_test,yhat, digits=3))

print('Train accuracy: %0.3f' % accuracy_score(y_train_over, y_train_pred))
print('Test accuracy: %0.3f' % accuracy_score(y_test, yhat))

print('Train AUC: %0.3f' % roc_auc_score(y_train_over, y_train_pred))
print('Test AUC: %0.3f' % roc_auc_score(y_test, yhat))

In [None]:
shap1 = shap.TreeExplainer(model)
shap_xgb1 = shap1.shap_values(X_train_fs)

In [None]:
shap.summary_plot(shap_xgb1, X_train_fs, plot_type="bar")

In [None]:
shap.summary_plot(shap_xgb1, X_train_fs)

In [None]:
shap.initjs()
shap.force_plot(shap1.expected_value, shap_xgb1[4000,:], X_train_fs[4000,:], link='logit')

## Feature Importance in RandomUnderSampler
<a id="keyfeatures2"></a>
*Second best model was Xgboost in RandomUnderSampler, let's have a look at it.*

In [None]:
xgb_under = XGBClassifier(learning_rate =0.1,n_estimators=140,max_depth=4,min_child_weight=5,gamma=0.4,subsample=0.8,colsample_bytree=0.8,nthread=4,scale_pos_weight=1,seed=27)
xgb_under.fit(X_train_under, y_train_under)
y_train_pred= xgb_under.predict(X_train_under)

# making predictions on the testing set 
y_pred = xgb_under.predict(X_test)

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

print('Train accuracy:%0.3f' %(accuracy_score(y_train_under, y_train_pred)))
print('Test accuracy:%0.3f' %(accuracy_score(y_test, y_pred)))

print('AUC score for train: %0.3f ' % (roc_auc_score(y_train_under,y_train_pred)))
print('AUC score for test: %0.3f' % (roc_auc_score(y_test, y_pred)))

#Draw ROC Curve
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred)
roc_auc = metrics.roc_auc_score(y_test, y_pred)
plt.figure(figsize=(12,6))
plt.plot(fpr,tpr,label="ROC Curve (area = %0.2f)" % roc_auc , color='darkorange')
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlabel('False Positive Rate (1-Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

In [None]:
import shap
explainer = shap.TreeExplainer(xgb_under)
shap_under = explainer.shap_values(X_train_under)

In [None]:
shap.summary_plot(shap_under, X_train_under, plot_type="bar")

In [None]:
shap.summary_plot(shap_under, X_train_under)

In [None]:
# XGBoost Tree SHAP algorithm computes the SHAP values with respect to the margin, not the transformed probability. So we are seeing log odds values. In order to have probability values link='logit' is added.
shap.initjs()
shap.force_plot(explainer.expected_value, shap_under[1050,:], X_train_under.iloc[1050,:], link='logit')

In [None]:
# XGBoost Tree SHAP algorithm computes the SHAP values with respect to the margin, not the transformed probability. So we are seeing log odds values. In order to have probability values link='logit' is added.
shap.initjs()
shap.force_plot(explainer.expected_value, shap_under[4000,:], X_train_under.iloc[4000,:], link='logit')

In [None]:
from sklearn.feature_selection import SelectFromModel
from numpy import sort
# Fit model using each importance as a threshold
thresholds = sort(xgb_under.feature_importances_)
for thresh in thresholds:
	# select features using threshold
	selection = SelectFromModel(xgb_under, threshold=thresh, prefit=True)
	select_X_train = selection.transform(X_train_under)
	# train model
	selection_model = XGBClassifier(learning_rate =0.1,n_estimators=140,max_depth=4,min_child_weight=5,gamma=0.4,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,seed=27)
	selection_model.fit(select_X_train, y_train_under)
	# eval model
	select_X_test = selection.transform(X_test)
	y_pred = selection_model.predict(select_X_test)
	predictions = [round(value) for value in y_pred]
	auc= roc_auc_score(y_test, predictions)
	print("Thresh=%.3f, n=%d, AUC: %.2f%%" % (thresh, select_X_train.shape[1], auc*100.0))