## Project on Ensemble Techniques

                                                         

### About dataset:

The data is related with direct marketing campaigns of a
Portuguese banking institution. The marketing campaigns
were based on phone calls. Often, more than one contact to
the same client was required, in order to access if the product
(bank term deposit) would be ('yes') or not ('no') subscribed.


### Domain 
Banking

### Context:
Leveraging customer information is paramount for most
businesses. In the case of a bank, attributes of customers like
the ones mentioned below can be crucial in strategizing a
marketing campaign when launching a new product.

### Objective:
The classification goal is to predict if the client will subscribe
(yes/no) a term deposit (variable y).

### Learning Outcomes:

* Exploratory Data Analysis

* Preparing the data to train a model

* Training and making predictions using an Ensemble
  Model

* Tuning an Ensemble model

### Import  all neccessary libraries

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import  KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import confusion_matrix,precision_score,classification_report,f1_score,roc_curve,roc_auc_score,auc,accuracy_score
from sklearn import metrics
import pylab as pl
%matplotlib inline
import warnings 
warnings.simplefilter("ignore")

#### Load dataset

In [2]:
df =pd.read_csv("../input/bankfullcsv/bank-full.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'bank-full.csv'

In [None]:
df.head(10)

In [None]:
# Data types of the attributes
df.info()

From the above output we can see the datatypes of the different attributes.
It contains 7 integer type and 10 object type attribute.

In [None]:
# Shape of the data
df.shape

In [None]:
#Checking the presence of missing values
df.isnull().sum()

There are no missing values in the dataset

In [None]:
#5 point summary of numerical attributes
df.describe().T

In [None]:
#Checking the presence of outliers
# Outlier is defined as Data points above or below than 1.5 times the Inter Quartile Range of the data.
numerical = ['age','balance','day','duration','campaign','pdays','previous']
Q1 = df[numerical].quantile(0.25)
Q3 = df[numerical].quantile(0.75)
IQR = Q3 - Q1
out = (df[numerical] < (Q1 - 1.5 * IQR)) | (df[numerical] > (Q3 + 1.5 * IQR))
out.sum()

There are 6 columns in the dataset which has outliers

In [None]:
#finding unique data
df.apply(lambda x: len(x.unique()))

#### Data Distribution of features

Distribution of some of the categorical features

In [None]:
fig,ax = plt.subplots(figsize=(12,8))
ax=sns.countplot(df['job'],hue=df['Target'],order=df['job'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40)
ax.set_title("Job Type Vs Target")
plt.legend()
plt.show()


Most of the persons belong to the Blue-collar job type .
However more numnber of  persons belonging to management jobs has subscribe to the term deposit.
There is also an unknown category in the job type which needs to be replaced.

In [None]:
fig,axes = plt.subplots(figsize=(12,8))
ax=sns.countplot(df['marital'],hue=df['Target'],order=df['marital'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40)
ax.set_title("Marital Status Vs Target")
plt.legend()
plt.show()


More than half of the customers are married. Less no of divorced customers has said yes for term deposit.

In [None]:
fig,axes = plt.subplots(figsize=(10,8))
ax=sns.countplot(df['education'],hue=df['Target'],order=df['education'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40)
ax.set_title("Education Vs Target")
plt.legend()
plt.show()


Most of the contacted List of Customer have secondary level of education.

In [None]:
fig,axes = plt.subplots(figsize=(10,8))
ax=sns.countplot(df['default'],hue=df['Target'],order=df['default'].value_counts().index)
ax.set_title("Credit in default Vs Target")
plt.legend()
plt.show()


Majority of the customer have credit in default.

In [None]:
fig,axes = plt.subplots(figsize=(10,8))
ax=sns.countplot(df['housing'],hue=df['Target'],order=df['housing'].value_counts().index)
ax.set_title("Housing Vs Target")
plt.legend()
plt.show()


There are more number of customers who has housing loan, however customers that dont have housing loan has more number in saying yes for term deposit.

In [None]:
fig,axes = plt.subplots(figsize=(10,8))
ax=sns.countplot(df['loan'],hue=df['Target'],order=df['loan'].value_counts().index)
ax.set_title(" Personal Loan Vs Target")
plt.legend()
plt.show()


Majority of the customers has personal loan

In [None]:
fig,axes = plt.subplots(figsize=(12,6))
ax=sns.countplot(df['contact'],hue=df['Target'],order=df['contact'].value_counts().index)
ax.set_title("Contact communication type Vs Target")
plt.legend()
plt.show()

Most of the customers has been contacted via cellular communication type. There are many unknown communication type in the dataset.However the type of communication does not effect the target variable.

In [None]:
fig,axes = plt.subplots(figsize=(12,6))
ax=sns.countplot(df['month'],hue=df['Target'],order=df['month'].value_counts().index)
ax.set_title("Last contact month Vs Target")
plt.legend()
plt.show()

Most of the customers has been contacted in the month of may.

In [None]:
fig,axes = plt.subplots(figsize=(12,6))
ax=sns.countplot(df['poutcome'],hue=df['Target'],order=df['poutcome'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40)
ax.set_title("Previous campaign Outcome Vs Target")
plt.legend()
plt.show()

More than 75% of the values in the column are unknown and we cannot drop those values as there will be huge data loss.
It is also observed from the graph that customers who are successfully enrolled from the previous campaign are more likely to say yes for this current campaign also.

In [None]:
sns.lmplot(x='campaign', y='duration', data=df, fit_reg=False, hue='Target',height=8)
plt.xlabel('Number of Calls')
plt.ylabel('Duration of Calls (Seconds)')
plt.title('The Relationship between the Number and Duration of Calls (with Response Result)')
plt.show()

It is observed from the graph that as the call duration increases there is increment in number of saying yes.
However if the number of calls increases it is more likely that the customers will say no.

Distribution of numerical features


In [None]:
df[numerical].hist(figsize=(15,10))
plt.show()

* Age - From the graph it is observed that the Age column has slight right skewed distribution.
* Balance - The distribution is highly right skewed and most of the customers have balance less than 5000 euros
* Campaign - The distribution is highly right skewed.
* Day - The distribution shows that most of the customers contact in the mid on the month.
* Duration - The distribution is highly right skewed.
* pdays - It is observed that less number of customers have been contacted by the bank.Distribution is hoghly skewed.



#### Distribution of Target Column

In [None]:
df.groupby("Target").agg({'Target': 'count'})

In [None]:
fig,axes = plt.subplots(figsize=(8,7) )
df['Target'].value_counts(sort=True).plot(kind='pie',autopct='%1.1f%%', fontsize= 20,startangle=130)
plt.legend(['Rejected ','Accepted '])
plt.title('Percentage of customers for Accepting / Rejecting the offer')
plt.show()

From the graph it is observed that dataset is highly baised.Only 11.7% customers had accepted the loan.

##### Corelation between columns

In [None]:
# Plot for Visualising the correlation between variables and Target Column.

fig,ax = plt.subplots( figsize=(16,8) )
sns.heatmap(df.corr(),annot=True)
plt.title('Heatmap for Correlation')
plt.show()

### Preparation of data 

#### Get rid of missing or unknown values

In [None]:
print("List of unique values in poutcome \n", df['poutcome'].value_counts())
print("\nList of unique values in education \n", df['education'].value_counts())
print("\nList of unique values in job \n", df['job'].value_counts())

There are 36959 unknown values in poutcome column which is 81.74% of the total.If we drop the values there will be significant loss in data.Hence we keep the data and assumed that unknown is a category for the particular feature.
There are 1840 other values which we can replace to unknown type.

In [None]:
df['poutcome'].replace('other','unknown',inplace=True)
print(df['poutcome'].unique())

There are 1857 unknown values in Job column and 288 unknown values in education column,which is around 4% and 0.6% of the data.
Hence we can drop those values.

In [None]:
df.drop(df[df['job']=='unknown'].index,inplace=True,axis=0)
df.drop(df[df['education']=='unknown'].index,inplace=True,axis=0)
print("Unique values in job",df['job'].unique())
print("Unique values in eduaction ",df['education'].unique())

#### Converting Categorical features into numerical

In [None]:
categorical_column = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month','poutcome']

In [None]:
df_encoded = pd.get_dummies(df,columns=categorical_column)

In [None]:
df_encoded.columns

* Dropping the duration column from the data as it is highly affects the output.Thus, this input should
only be included for benchmark purposes and should be
discarded if the intention is to have a realistic predictive
model. 

In [None]:
df_encoded.drop("duration",axis=1,inplace=True)

#### Separating the target column

In [None]:
df_encoded['Target'] = df_encoded['Target'].map({'yes': 1, 'no': 0})

In [None]:
y = df_encoded["Target"]
X = df_encoded.drop("Target",axis=1)

#### Spliiting the data set

In [None]:
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.30, random_state=1)

#### Scaling the data

In [None]:
sc= StandardScaler()
scaledX_train = sc.fit_transform(X_train)
scaledX_test = sc.transform(X_test)

#### Applyling different Classification Models

* Classification models used are Logistic Regression and Decision Trees.
* KNN is not used because k-NN doesn’t perform well on imbalanced data. And it is a highly imbalance dataset.
* SVM model is difficult to understand and interpret by human beings unlike Decision Trees.And here we are focusing on ensemble models more. All ensemble algorithm are used for training the model.

##### Logistic Regression

In [None]:
lr = LogisticRegression(C=1)
lr.fit(scaledX_train,y_train)
lr_pred = lr.predict(scaledX_test)
lr_training = lr.score(scaledX_train,y_train)
lr_testing = lr.score(scaledX_test,y_test)
lr_precision = precision_score(y_test,lr_pred)
lr_f1 = f1_score(y_test,lr_pred)
print("Training Accuracy :", lr_training)
print("Testing Accuracy :",lr_testing )
print("Precision :",lr_precision )
print("F1 Score: ",lr_f1 )
print('Confusion Matrix - Logistic Regression :\n\n',confusion_matrix(y_test, lr_pred) )
#tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()

*  We have a uneven class distribution so accuracy and presicion would not be good metrics to compare.
* So F1 score can be used for comparing the models 

  Logistic Regression - 
* Training and testing accuracy are good, however there high number of False Positive.
* F1 score this model is 0.28

In [None]:
print('F1 Sccore\n :',classification_report(y_test, lr_pred))

#### Decision Tree

In [None]:
# Creating CART model with max_depth = 5

dt = DecisionTreeClassifier(criterion='entropy',max_depth=1)
dt.fit(scaledX_train,y_train)
dt_pred = dt.predict(scaledX_test)
dt_training = dt.score(scaledX_train,y_train)
dt_testing = dt.score(scaledX_test,y_test)
dt_precision = precision_score(y_test,dt_pred)
dt_f1 = f1_score(y_test,dt_pred)
print("Traing Accuracy :", dt_training)
print("Testing Accuracy :",dt_testing )
print("F1 Score: ",dt_f1 )
print('Decision Tree Confusion matrix :\n\n',confusion_matrix(y_test, dt_pred) )

Decision Trees 
* Good testing and training cuuracy
* False Positive reduced compared to other above two models(still high)
* F1 score is 0.25

In [None]:
print('F1 Score\n :',classification_report(y_test, dt_pred))

### Models on Ensemble 

#### Random Forest

In [None]:
rf = RandomForestClassifier(criterion='entropy',max_depth=50,n_estimators=50)
rf.fit(scaledX_train,y_train)
rf_pred = rf.predict(scaledX_test)
rf_training = rf.score(scaledX_train,y_train)
rf_testing = rf.score(scaledX_test,y_test)
rf_precision = precision_score(y_test,rf_pred)
rf_f1 = f1_score(y_test,rf_pred)
print("Traing Accuracy :", rf_training)
print("Testing Accuracy :",rf_testing )
print("F1 Score: ",dt_f1 )
print('Random Forest Confusion matrix :\n\n',confusion_matrix(y_test, rf_pred) )

Random Forest
* Testing and training accuracy are good.
* False positive is comparitively lower than classification models
* F1 score is 0.29

In [None]:
print('F1 Score\n :',classification_report(y_test, rf_pred))

#### Bagging Classifier

In [None]:
bg = BaggingClassifier(base_estimator=DecisionTreeClassifier(),n_estimators=500,bootstrap=True,max_samples=100)
bg.fit(scaledX_train,y_train)
bg_pred = bg.predict(scaledX_test)
bg_training = bg.score(scaledX_train,y_train)
bg_testing = bg.score(scaledX_test,y_test)
bg_precision = precision_score(y_test,bg_pred)
bg_f1 = f1_score(y_test,bg_pred)
print("Traing Accuracy :", bg_training)
print("Testing Accuracy :",bg_testing )
print("F1 Score: ",bg_f1 )
print('Bagging Classifier Confusion matrix :\n\n',confusion_matrix(y_test, bg_pred) )

Bagging Classifier
* Testing and training accuracy are good.
* False positive is comparitively lower than classification models
* F1 score is 0.23

In [None]:
print('F1 Score\n :',classification_report(y_test, bg_pred))

#### AdaBoost Classifier

In [None]:
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2),n_estimators=10,learning_rate=0.5)
ada.fit(scaledX_train,y_train)
ada_pred = ada.predict(scaledX_test)
ada_training = ada.score(scaledX_train,y_train)
ada_testing = ada.score(scaledX_test,y_test)
ada_f1 = f1_score(y_test,ada_pred)
ada_precision = precision_score(y_test,ada_pred)
print("Traing Accuracy :", ada_training)
print("Testing Accuracy :",ada_testing )
print("F1 Score: ",ada_f1 )
print('AdaBoost Classifier Confusion matrix :\n\n',confusion_matrix(y_test, ada_pred) )


AdaBoost Classifier
* Testing and training accuracy are good.
* High number of False positives (more than bagging classifier)
* F1 score is 0.23

In [None]:
print('F1 Score\n :',classification_report(y_test, ada_pred))

#### GradientBoost Classifier


In [None]:
gbc = GradientBoostingClassifier(learning_rate=0.02,n_estimators=65)
gbc.fit(scaledX_train,y_train)
gbc_pred = gbc.predict(scaledX_test)
gbc_training = gbc.score(scaledX_train,y_train)
gbc_testing = gbc.score(scaledX_test,y_test)
gbc_f1 = f1_score(y_test,gbc_pred)
gbc_precision = precision_score(y_test,gbc_pred)
print("Traing Accuracy :", gbc_training)
print("Testing Accuracy :",gbc_testing )
print("F1 Score: ",gbc_f1 )
print('AdaBoost Classifier Confusion matrix :\n\n',confusion_matrix(y_test, gbc_pred) )


AdaBoost Classifier
* Testing and training accuracy are good.
* False positive is comparitively low (lowest among other models)
* F1 score is 0.20 

In [None]:
print('F1 Score\n :',classification_report(y_test, gbc_pred))

### ROC for different models

### ROC for ensemble models

In [None]:
# ROC for Random Forest
rf_prob = rf.predict_proba(X_test)
fpr,tpr,thresh=roc_curve(y_test,rf_prob[:,1])
auc1 = auc(fpr,tpr)
print("Area under the curve for Random Forest ", auc1)

#ROC for Bagging
bg_prob = bg.predict_proba(X_test)
fpr1,tpr1,thresh1=roc_curve(y_test,bg_prob[:,1])
auc2 = auc(fpr1,tpr1)
print("Area under the curve for Bagging ", auc2)

#ROC for AdaBoost
ada_prob = ada.predict_proba(X_test)
fpr2,tpr2,thresh2=roc_curve(y_test,ada_prob[:,1])
auc3 = auc(fpr2,tpr2)
print("Area under the curve for AdaBoost ", auc3)

#ROC for GradientBoost
gbc_prob = gbc.predict_proba(X_test)
fpr3,tpr3,thresh3=roc_curve(y_test,gbc_prob[:,1])
auc4 = auc(fpr3,tpr3)
print("Area under the curve for  Gradient ", auc4)

lr_prob = lr.predict_proba(X_test)
lr_fpr,lr_tpr,lr_thresh=roc_curve(y_test,lr_prob[:,1])
lr_auc = auc(lr_fpr,lr_tpr)
print("Area under the curve for  Logistic Regression ", lr_auc)

dt_prob = dt.predict_proba(X_test)
dt_fpr,dt_tpr,dt_thresh=roc_curve(y_test,dt_prob[:,1])
dt_auc = auc(dt_fpr,dt_tpr)
print("Area under the curve for Decision Trees ", dt_auc)


In [None]:
#Plot the ROC curve 
plt.clf()
fig, ax= plt.subplots(nrows = 2, ncols = 2, figsize = (12,10))
ax[0,0].plot(fpr, tpr, label='AUC area = %0.2f' % auc1)
ax[0,0].plot([0, 1], [0, 1], 'k--')
ax[0,0].set_xlabel('False Positive Rate')
ax[0,0].set_ylabel('True Positive Rate')
ax[0,0].set_title('ROC for Random Forest')
ax[0,0].legend(loc="lower right")

ax[0,1].plot(fpr1, tpr1, label='AUC = %0.2f' % auc2)
ax[0,1].plot([0, 1], [0, 1], 'k--')
ax[0,1].set_xlabel('False Positive Rate')
ax[0,1].set_ylabel('True Positive Rate')
ax[0,1].set_title('ROC for Bagging')
ax[0,1].legend(loc="lower right")

ax[1,0].plot(fpr2, tpr2, label='AUC = %0.2f' % auc3)
ax[1,0].plot([0, 1], [0, 1], 'k--')
ax[1,0].set_xlabel('False Positive Rate')
ax[1,0].set_ylabel('True Positive Rate')
ax[1,0].set_title('ROC for AdaBoost')
ax[1,0].legend(loc="lower right")

ax[1,1].plot(fpr3, tpr3, label='AUC = %0.2f' % auc4)
ax[1,1].plot([0, 1], [0, 1], 'k--')
ax[1,1].set_xlabel('False Positive Rate')
ax[1,1].set_ylabel('True Positive Rate')
ax[1,1].set_title('ROC Gradient')
ax[1,1].legend(loc="lower right")


plt.show()



#### Comparining different models

In [None]:
df_compare = pd.DataFrame([[lr_training,lr_testing,lr_precision,lr_f1,lr_auc],[dt_training,dt_testing,dt_precision,dt_f1,dt_auc],
                          [rf_training,rf_testing,rf_precision,rf_f1,auc1],[bg_training,bg_testing,bg_precision,bg_f1,auc2],
                          [ada_training,ada_testing,ada_precision,ada_f1,auc3],[gbc_training,gbc_testing,gbc_precision,gbc_f1,auc4]],
    columns=['Training Accuracy','Testing Accuracy','Precision Score','F1Score','AUC'],
                       index=['Logistic Regression','DecisionTrees',
                              'RandomForest','BaggingClassifier','AdaBoost','GradientBoosting'])
df_compare

#### Key Observations

* As the dataset highly baised most of the algorithm are having training accuracy in range of (88-89.9)%.
* All of the models are not trained equally to both set of classes(majority is class 0 as not accepting the term deposit). So     the models are performing well for class 0 ie not accepting the loan and not able to perform well on the positive class ie       class 1.This is the main reason for inferior precison as well as F1 scores.
* Trianing accuracy of the models are somehow large than testing accuracy.

#### Some more observations

* We cannot compare the models on the basis of high Accuracy,Presion or Recall score as it is a highly biased dataset.
* Other metrics that can be used for comparision are - F1 Score , AUC .
* Also according to the given context/ objective of the project it can be say that reducing the values of False Positve should     be given more importance.So it can also be one of the factors for comparision. 


In [None]:
# Top 3 models with Highest testing accuracy are - 
df_compare.sort_values(ascending=False,by=['Testing Accuracy'])['Testing Accuracy'].head(3)

In [None]:
# Top 3 models with highest F1 Score are 
df_compare.sort_values(ascending=False,by=['F1Score'])['F1Score'].head(3)

In [None]:
# Top 3 Models with highest AUC score
df_compare.sort_values(ascending=False,by=['AUC'])['AUC'].head(3)

In [None]:
# Confusion Matrix of the models sorted according to False Postive scores in ascending order

print("Gradient Boost Classifier\n",confusion_matrix(y_test, gbc_pred))
print("Bagging Classifier \n",confusion_matrix(y_test, bg_pred))
print("AdaBoost Classifier\n",confusion_matrix(y_test, ada_pred))
print("Decision Tree Classifier \n",confusion_matrix(y_test, dt_pred))
print("Random Forest Classifier \n",confusion_matrix(y_test, rf_pred))
print("Logistic Regression \n",confusion_matrix(y_test, lr_pred))


#### Conclusion

* We cannot commnent on any one of the models as performing best than others ones, depending on the metrics such as 
  F1 Score  Random Forest has highest score, if we consider the AUC for comparision then GradientBoost & AdaBoost has             highest AUC value. GradientBoost also has less number of False positive compare to other models.
* We can conclude that ensembles models are performing better than the single classification models.
* We can improve the accuracy of the models if we can somehow balance the dataset using sampling techniques such as upsampling     and downsampling.