<a href="https://colab.research.google.com/github/sananda2005/Bank-Marketing-Effectiveness-Prediction/blob/main/Bank_Marketing_Effectiveness_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Predicting the effectiveness of bank marketing campaigns </u></b>

## <b> Problem Description </b>

### The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. The classification goal is to predict if the client will subscribe a term deposit (variable y).


## <b> Data Description </b>

## <b>Input variables: </b>
### <b> Bank Client data: </b>

* ### age (numeric)
* ### job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
* ### marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
* ### education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
* ### default: has credit in default? (categorical: 'no','yes','unknown')
* ### housing: has housing loan? (categorical: 'no','yes','unknown')
* ### loan: has personal loan? (categorical: 'no','yes','unknown')

### <b> Related with the last contact of the current campaign:</b>
* ### contact: contact communication type (categorical: 'cellular','telephone')
* ### month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
* ### day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
* ### duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

### <b>Other attributes: </b>
* ### campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
* ### pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
* ### previous: number of contacts performed before this campaign and for this client (numeric)
* ### poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')


### <b>Output variable (desired target):</b>
* ### y - has the client subscribed a term deposit? (binary: 'yes','no')

##**Importing The Libraries**

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from imblearn.over_sampling import SMOTE
from sklearn.ensemble import IsolationForest


#**Importing and loading our dataset**

In [2]:
# mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# reading csv file
data= pd.read_csv ("/content/drive/MyDrive/Bank Marketing Effectiveness Prediction/Copy of bank-full.csv",sep=';')
df=data.copy()

In [None]:
df.head()

In [None]:
df.tail()

#**Understanding of Dataset**

In [None]:
df.shape

In [None]:
df.columns

In [None]:
#detail informations of features
df.info()

#**Checking null values**

In [None]:
# check null values
df.isnull().sum()

####**There are no null values in the dataset**

#**Checking unique and duplicate values**

In [None]:
# checking unique values
df.nunique()

In [None]:
#Checking duplicate values
df.duplicated().sum()

####**There is no duplicate values present in the dataset**

In [None]:
# statistical summary of our data
df.describe(include='all')

#**DESCRIPTIVE ANALYSIS**
###There are two types of variable in our data
**1**.**Numerical** 

**2**.**Catagorical**

###**List of Numerical features**

In [None]:
# list of numerical features
numerical_feature = list(df.select_dtypes(exclude=['object']))
numerical_feature

###**List of Catagorial features**

In [None]:
# list of catagorical features
categorical_feature = list(df.select_dtypes(include=['object']))
categorical_feature

#**EXPLORATORY DATA ANALYSIS (EDA)**

#**Target Variable**

####**Target Variable : y - has the client subscribed a term deposit (binary: 'yes', 'no')**

In [None]:
df.y.value_counts()

In [None]:
# Visualising the target variable
y_df = sns.countplot(df['y'])

####**As We Can See that our data is highly imbalanced, because majority of the data points belong to 'no' class.**

In [None]:
# piechart for percentage of number of subscribers and non-subscribers for term deposit(Traget Variable)
labels = 'Not Subscribed', 'Subscribed'
sizes = df.y.value_counts()
colors = ['black','orange']
explode = (0.1,0.0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors, 
        autopct='%1.1f%%',shadow=True,startangle=200)
plt.axis('equal')
plt.title("Proportion of Subscribed & Not Subscribed term Deposit",fontsize=15)
plt.plot()
fig=plt.gcf()
fig.set_size_inches(8,7)
plt.show()


####**We can see from the above plot that the dataset is imbalanced, where the number of Not-Subscribed class is close to 8 times the number of Subscribed Class.**

#**Univariate Analysis**

###**Let's begin performing EDA on the remaining columns of datapoints.**

##**Explore the Categorical Features**

In [None]:
# Each variable is represented by a bar graph.

#Countplot of categorical features
for i in categorical_feature:
  print('Column name : ' , i)
  print(data[i].value_counts())
  plt.figure(figsize=(10,8))
  sns.countplot(x = data[i])
  plt.xlabel(i)
  plt.title(format(i))
  plt.xticks(rotation=40)
  plt.show()

###**Categorical variable's graph representation related to the target variable**

In [None]:
#Countplot of categorical features
for i in categorical_feature:
  plt.figure(figsize=(12,8))
  sns.countplot(x=data[i] , hue=data['y'])
  plt.xlabel(i)
  plt.title(format(i))
  plt.xticks(rotation=40)
  plt.show()

#**From the above plots we can analyze that:**

####**Top clients are from the 'blue-collar','management', and 'technician' employment types.**
####**Retired client has high interest on deposit.**
####**In month of March, September, October and December, client show high interest to deposit.**
####**In month of may, records are high but client interest ratio is very less.**
####**Success rate is highest for student.**
####**People whose previous outcome is non-existent have actually subscribed more than any other group of people belonging to previous outcome.**
####**Very few clients are contacted who are defaulter.**
####**People who are married have subscribed for deposits more than people with any other marital status.**
####**Client who has housing loan seems to be not interested much on deposit.**


##**Explore the numerical_feature**

In [None]:
#boxplot to show target distribution with respect numerical features
plt.figure(figsize=(25,30),facecolor='white')
plotnumber=1
for i in numerical_feature:
    ax = plt.subplot(12,3,plotnumber)
    sns.boxplot(x="y", y= df[i],data=df)
    plt.title(format(i))
    plt.xlabel(i)
    plotnumber+=1
plt.show()

####**Since the age feature is not linearly separable for each of the target variables, it is obvious from the above plot that the majority of customers  call are in between 30s to 40s (people who are 33 to 48 years old fall within the 25th to 75th percentiles). Age will therefore have less of an impact on us.**
####**As We can see that there are many Outliers in No part As well Yes Part but here our data is Imbalanced so we are keeping this Outliers.**


In [None]:
#Distribution plot of continuous feature
plt.figure(figsize=(25,60))
plotnumber =1
for i in numerical_feature:
    ax = plt.subplot(12,3,plotnumber)
    sns.distplot(data[i],color ='blue')
    plt.title(format(i))
    plt.xlabel(i)
    plotnumber+=1
plt.show()

#**Take-away:**

####**It seems age, days distributed normally.**

####**Balance, duration, campaign, pdays, and previous are all strongly left-skewed and appear to contain some outliers.**

####**The majority of the customers, as shown in the distribution above, are between the ages of 30 and 40.**

##**Correlation Matrix of the numerical features**

In [None]:
df.corr()

In [None]:
## Checking for correlation
cor_mat=df.corr()
fig = plt.figure(figsize=(12,6))
sns.heatmap(cor_mat,annot=True, cmap =plt.cm.Reds)

####**There is no variable highly correlated to y (Target variable).**

#**Data Preprocessing**

In [None]:
df.shape

In [None]:
df.head()

####**We can see there are some binary columns(default, housing, loan) which are object type, we need to convert into numeric value.**

####**There are categorical columns as well, but the options are few. These include job, marriage, education, contact, month, and outcome. That must also be transformed into a numerical format.**

####**The model can only be fed data when all feature columns have been converted to numeric values.**

####**To convert default column into numeric value We can convert the 'yes' values to 1, and the 'no' values to 0.**

##**Creating one-hot encoding for non-numeric MARITAL column**



In [None]:
marital_dummies = pd.get_dummies(df['marital'], prefix= 'marital')
marital_dummies.head()

In [None]:
# combine the marital column and marital_dummies
pd.concat([df['marital'], marital_dummies], axis = 1).head()

####**As we can see, each row has one value of 1, which corresponds to the value in the marital column in the corresponding column.**

####**There are three values; if two of the dummy columns' values for a given row are 0, the third column's value must be 1. Redundancy and correlations in features should be eliminated because it can be challenging to determine which feature is most crucial for minimising the overall error.**

####**So let's eliminate the column divorced.**

In [None]:
# Elimainating marital_divorced column
marital_dummies.drop('marital_divorced', axis =1, inplace = True)
marital_dummies.head()

In [None]:
# merging marital_dummies into main dataframe
df = pd.concat([df, marital_dummies], axis = 1)
df.head()

##**Creating one hot encoding for JOB column**

In [None]:
job_dummies = pd.get_dummies(df['job'], prefix= 'job')
job_dummies.head()

In [None]:
# Elimainating job_admin column
job_dummies.drop('job_admin.', axis=1, inplace=True)

In [None]:
# Merging job_dummies into main dataframe
df = pd.concat([df, job_dummies], axis=1)
df.head()

##**Creating one hot encoding for EDUCATION column**

In [None]:
education_dummies = pd.get_dummies(df['education'], prefix = 'education')
education_dummies.head()

In [None]:
# Elimainating education_primary column
education_dummies.drop('education_primary', axis=1, inplace=True)

In [None]:
# Merging education_dummies into main dataframe
df = pd.concat([df, education_dummies], axis=1)
df.head()

##**Creating one hot encoding for CONTACT column**

In [None]:
contact_dummies = pd.get_dummies(df['contact'], prefix = 'contact')
contact_dummies.head()

In [None]:
# Elimainating contact_cellular column
contact_dummies.drop('contact_cellular', axis=1, inplace=True)

In [None]:
# Merging contact_dummies into main dataframe
df = pd.concat([df, contact_dummies], axis=1)
df.head()

##**Creating one hot encoding for POUTCOME column**

In [None]:
poutcome_dummies = pd.get_dummies(df['poutcome'], prefix = 'poutcome')
poutcome_dummies.head()

In [None]:
# Elimainating poutcome_failure column
poutcome_dummies.drop('poutcome_failure', axis=1, inplace=True)

In [None]:
#Merging poutcome_dummies into main dataframe
df = pd.concat([df, poutcome_dummies], axis=1)
df.head()

####**We need to convert some binary columns that represent object types (default, housing, and loan) into numeric values.**
####**There are also categorical columns, but there are only a few options. Job,marriage, education, contacts, month, and poutcome are some of them.** **Additionally, that needs to be converted to numerical form. Only after all feature columns have been converted to numeric values can we feed them into the model.**

##**Converting month column into numeric value**

In [None]:
months = {'jan':1, 'feb':2, 'mar':3, 'apr':4, 'may':5, 'jun':6, 'jul':7, 'aug':8, 'sep':9, 'oct':10, 'nov':11, 'dec': 12}
df['month'] = df['month'].map(months)
df['month'].head(5)

####**Changing the default column's value to numeric value  For the default column, we can change the yes values to 1 and the no values to 0. For it, we'll use a lambda function.**

##**Converting default column into numeric value**

In [None]:
df['new_default'] = df['default'].apply(lambda row: 1 if row == 'yes' else 0 )
df[['default', 'new_default']].head()

In [None]:
df[df['pdays'] == -1]['pdays'].count()

In [None]:
df['was_contacted'] = df['pdays'].apply(lambda row: 0 if row == -1 else 1)
df[['pdays','was_contacted']].head()

##**Converting loan column into numeric value**

In [None]:
df['new_loan'] = df['loan'].apply(lambda row: 1 if row == 'yes' else 0)
df[['loan', 'new_loan']].head()

##**Converting housing column into numeric value**

In [None]:
df['new_housing'] = df['housing'].apply(lambda row : 1 if row == 'yes' else 0)
df[['housing', 'new_housing']].head()

##**Converting target column ‘y’ into numeric value**

In [None]:
df['y_target'] = df['y'].apply(lambda row: 1 if row == 'yes' else 0)
df[['y', 'y_target']].head()

In [None]:
df.head()

####**Eliminating the columns for age, job, marital, education, default, housing, loan, day, contact, month, duration, poutcome, and y.**

In [None]:
df.drop(['job', 'education', 'marital', 'default', 'housing', 'loan', 'contact', 'poutcome', 'y','month','duration','age','day'], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df.shape

##**Removing outliers**

In [None]:
# removing the outlier using IsolationForest Technique

features = df.drop(['y_target'],axis=1)

anomaly_filter = IsolationForest(contamination=0.1,n_jobs=-1)
anomalies = pd.Series(anomaly_filter.fit_predict(features))
df['new_anomaly'] = anomalies
df = df[df['new_anomaly']==1].drop(['new_anomaly'],axis=1)

In [None]:
df.shape

In [None]:
# Giving values to independent variables
X = df.drop('y_target', axis = 1)
X.head().T

In [None]:
# Giving the values of dependent variables
y = df['y_target']
y.head()

#**Oversampling using SMOTE**

##**SMOTE-**

####**'Synthetic Minority Oversampling Technique' (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way. The component works by generating new instances from existing minority cases that you supply as input**

In [None]:
# Over sampling the data using SMOTE
import imblearn
from imblearn.over_sampling import SMOTE
sampler = SMOTE()
X,y = sampler.fit_resample(X.values, y.values)

In [None]:
X.shape

In [None]:
y.shape

In [None]:
# countplot of dependent column y

plt.figure(figsize = (10,8))
sns.countplot(x = y)
plt.xlabel('Y')
plt.ylabel('Count')
plt.title('Distribution of Y')
plt.show()

#**Model Building**

###**Logistic Regression**
###**Random Forest Classifier**
###**Decision Tree Classifier**
###**K-Nearest Neighbors (KNN)**
###**XGBoost Classifier**

#**Splitting data in Train and Test**

In [None]:
# Scale the data using Standard Scaler
ss = StandardScaler()
x = ss.fit_transform(X)

In [None]:
# splitting the dataset into the training set and test set
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 42)

#shape of training dataset.
print(f'shape of x_train set: {x_train.shape}')

#shape of testing dataset.
print(f'shape of x_test set: {x_test.shape}')

#**Implementing Various Machine learning Models**

#**1.Logistic Regression**

In [None]:
# Data fitting in Logistic Regression
log_reg = LogisticRegression(fit_intercept = True, max_iter = 10000)
log_reg.fit(x_train, y_train)

#prediction of test data
logistic_prediction = log_reg.predict(x_test)

# Get the accuracy scores
logistic_accuracy = accuracy_score(y_test,logistic_prediction)

#Checking the traning accuracy
print("Training accuracy Score : ",log_reg.score(x_train, y_train))
#Checking the testing accuracy
print("Testing accuracy Score : ",logistic_accuracy )

In [None]:
# Classification Report
from sklearn.metrics import classification_report
print(classification_report(logistic_prediction,y_test))

In [None]:
#confusion matrix
conf_matrix = confusion_matrix(y_test,logistic_prediction)
f,ax = plt.subplots(figsize=(8,6))
sns.heatmap(conf_matrix, annot=True,fmt="d", linewidths=.5, ax=ax )
plt.title("Confusion Matrix", fontsize=15)
ax.set_yticks(np.arange(conf_matrix.shape[0]) + 0.5, minor=False)
ax.set_xticklabels("Refused T. Deposits', 'Accepted T. Deposits")
ax.set_yticklabels(['Refused T. Deposits', 'Accepted T. Deposits'], fontsize=10, rotation=360)
plt.show()

##**ROC AOC Curve for Logistic Regression**

In [None]:
from sklearn.metrics import roc_curve,roc_auc_score
from sklearn.metrics import auc

# getting the roc_score
log_reg_probability = log_reg.predict_proba(x_test)[:,1]
roc_score = roc_auc_score(y_test, log_reg_probability)
print(f'roc_score: {roc_score}')

In [None]:
# plot the roc curve for the model
from sklearn.metrics import roc_curve
logistic_FPR, logistic_TPR, _ = roc_curve(y_test, log_reg_probability)

plt.title('ROC curve of Logistic Regression')
plt.xlabel('False Positive Rate (Precision)')
plt.ylabel('True Positive Rate (Recall)')
plt.plot(logistic_FPR,logistic_TPR)
plt.plot((0,1),ls='dashed',color='green')
plt.show()

#**2) Random Forest Classifier**

In [None]:
# Data fitting in Random Forest model
rf_clf = RandomForestClassifier()
rf_clf.fit(x_train, y_train)

#prediction of test data
rf_prediction = rf_clf.predict(x_test)

# Get the accuracy scores
rf_accuracy = accuracy_score(y_test,rf_prediction)

#Checking the traning accuracy
print("Training accuracy Score : ",rf_clf.score(x_train, y_train))
#Checking the testing accuracy
print("Testing accuracy Score : ",rf_accuracy )

In [None]:
# Classification Report
print(classification_report(rf_prediction,y_test))

##**Confusion Matrix for Random Forest Classifier**

In [None]:
#confusion matrix
conf_matrix = confusion_matrix(y_test,rf_prediction)
f,ax = plt.subplots(figsize=(8,6))
sns.heatmap(conf_matrix, annot=True,fmt="d", linewidths=.5, ax=ax )
plt.title("Confusion Matrix", fontsize=15)
ax.set_yticks(np.arange(conf_matrix.shape[0]) + 0.5, minor=False)
ax.set_xticklabels("Refused T. Deposits', 'Accepted T. Deposits")
ax.set_yticklabels(['Refused T. Deposits', 'Accepted T. Deposits'], fontsize=10, rotation=360)
plt.show()


##**ROC AOC Curve for Random Forest Classifier**

In [None]:
# getting the roc_score
rf_clf_probability = rf_clf.predict_proba(x_test)[:,1]
roc_socre=roc_auc_score(y_test, rf_clf_probability)
print(f'roc_score: {roc_score}')

In [None]:
# plot the roc curve for the model

random_forest_FPR, random_forest_TPR,_ = roc_curve(y_test, rf_clf_probability)

plt.title('Random Forest Classifier ROC curve')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(random_forest_FPR,random_forest_TPR)
plt.plot((0,1), ls='dashed',color='green')
plt.show()

##**Important Feature for Random Forest Classifier**

In [None]:
rf_clf.feature_importances_

In [None]:
features=df.columns
importances = rf_clf.feature_importances_
indices = np.argsort(importances)

In [None]:
plt.figure(figsize=(20,15))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='purple', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

#**Hyperparameter Tuning**

In [None]:
## Hyperparameter tuning using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
param_dict = {
     "n_estimators":[50,100,200,250],
    "max_depth":[5,10,15],
    "min_samples_split":[50,100,150,200],
    "min_samples_leaf":[40,50,60]}

#Creating an instance of the RandomForestClassifier
rf_clf = RandomForestClassifier()

#random search
random_rf = RandomizedSearchCV(estimator=rf_clf,param_distributions=param_dict,cv=5,verbose=2,scoring='roc_auc',n_iter=5,random_state=0)
random_rf.fit(x_train, y_train)

In [None]:
#Best estimator for random forest
random_rf.best_estimator_

In [None]:
random_rf.best_params_

In [None]:
# Making predictions on test data
y_pred = random_rf.predict(x_test)

# Calculating accuracy on train and test
print(f'Training accuracy Score: {accuracy_score(y_train,random_rf.predict(x_train))}')
print(f'Testing accuracy Score: {accuracy_score(y_test,y_pred)}')

In [None]:
# Classification Report
print(classification_report(y_pred,y_test))

In [None]:
#confusion matrix
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test,y_pred)
f,ax = plt.subplots(figsize=(8,6))
sns.heatmap(conf_matrix, annot=True,fmt="d", linewidths=.5, ax=ax )
plt.title("Confusion Matrix", fontsize=15)
ax.set_yticks(np.arange(conf_matrix.shape[0]) + 0.5, minor=False)
ax.set_xticklabels("Refused T. Deposits', 'Accepted T. Deposits")
ax.set_yticklabels(['Refused T. Deposits', 'Accepted T. Deposits'], fontsize=16, rotation=360)
plt.show()

##**ROC AOC Curve for Random Forest Classifier After Hyperparameric Tuning**

In [None]:
# getting the roc_score after Hyperparamer Tuning
random_rf_probability = random_rf.predict_proba(x_test)[:,1]
roc_socre=roc_auc_score(y_test, random_rf_probability)
print(f'roc_score: {roc_score}')

In [None]:
# plot the roc curve for the model
random_forest_FPR, random_forest_TPR,_ =  roc_curve(y_test, random_rf_probability)
plt.title('Random Forest Classifier ROC curve After Hyperparamater Tuning')
plt.xlabel('False Positive Rate (Precision)')
plt.ylabel('True Positive Rate  (Recall)')
plt.plot(random_forest_FPR,random_forest_TPR)
plt.plot((0,1), ls='dashed',color='green')
plt.show()

#**3) Decision Tree**

In [None]:
# Data fitting in Decision Tree model
dec_tree_model = DecisionTreeClassifier()
dec_tree_model.fit(x_train, y_train)

#prediction of test data
Decision_prediction = dec_tree_model.predict(x_test)
# Get the accuracy scores
decision_accuracy = accuracy_score(y_test,Decision_prediction)

#Checking the traning accuracy
print(f'Training accuracy Score : {dec_tree_model.score(x_train, y_train)}')
# checking the testing accuracy
print(f'Testing accuracy score : {decision_accuracy}')

In [None]:
# Classification report
print(classification_report(Decision_prediction,y_test))

In [None]:
#confusion matrix
conf_matrix = confusion_matrix(y_test,Decision_prediction)
f,ax = plt.subplots(figsize=(8,6))
sns.heatmap(conf_matrix, annot=True,fmt="d", linewidths=.5,ax=ax)
plt.title('Confusion Matrix', fontsize=15)
ax.set_yticks(np.arange(conf_matrix.shape[0]) + 0.5, minor=False)
ax.set_xticklabels("Refused T. Deposits', 'Accepted T. Deposits")
ax.set_yticklabels(["Refused T. Deposits", "Accepted T. Deposits"],fontsize=10, rotation=360)
plt.show()

#**ROC AOC Curve for Decision Tree**

In [None]:
# getting the roc_score
dec_tree_probability = dec_tree_model.predict_proba(x_test)[:,1]
roc_socre=roc_auc_score(y_test, dec_tree_probability)
print(f'roc_score: {roc_score}')

In [None]:
# plot the roc curve for the model
dec_tree_FPR, dec_tree_TPR,_ =  roc_curve(y_test, dec_tree_probability)
plt.title('Decision Tree Classifier of ROC curve')
plt.xlabel('False Positive Rate (Precision)')
plt.ylabel('True Positive Rate  (Recall)')
plt.plot(dec_tree_FPR,dec_tree_TPR)
plt.plot((0,1), ls='dashed',color='green')
plt.show()

##**Important Feature for Decision Tree**

In [None]:
dec_tree_model.feature_importances_

In [None]:
features = df.columns
importances = dec_tree_model.feature_importances_
indices = np.argsort(importances)

In [None]:
plt.figure(figsize=(20,15))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='purple', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

#**4) K-Nearest Neighbors (KNN)**

In [None]:
# Data fitting in KNN
K_model = KNeighborsClassifier()
K_model.fit(x_train, y_train)

#prediction of test data
K_model_prediction = K_model.predict(x_test)
# get the accuracy scores
k_model_accuracy = accuracy_score(y_test, K_model_prediction)

#Checking the traning accuracy
print(f'Training accuracy score : {K_model.score(x_train, y_train)}')
# checking the testing accuracy
print(f'Testing accuracy score : {k_model_accuracy}')

In [None]:
# Classification report
print(classification_report(K_model_prediction,y_test))

In [None]:
#confusion matrix
conf_matrix = confusion_matrix(y_test,K_model_prediction)
f,ax = plt.subplots(figsize=(8,6))
sns.heatmap(conf_matrix, annot=True,fmt="d", linewidths=.5,ax=ax)
plt.title('Confusion Matrix', fontsize=15)
ax.set_yticks(np.arange(conf_matrix.shape[0]) + 0.5, minor=False)
ax.set_xticklabels("Refused T. Deposits', 'Accepted T. Deposits")
ax.set_yticklabels(["Refused T. Deposits", "Accepted T. Deposits"],fontsize=10, rotation=360)
plt.show()

##**ROC AOC Curve for K Neighbors**

In [None]:
# getting the roc_score
K_model_probability = K_model.predict_proba(x_test)[:,1]
roc_socre=roc_auc_score(y_test, K_model_probability)
roc_socre

In [None]:
# plotting the roc curve for the model
KNN_FPR, KNN_TPR,_ = roc_curve(y_test, K_model_probability)
plt.title('K Neighbors Classifier of ROC curve')
plt.xlabel('False Positive Rate (Precision)')
plt.ylabel('True Positive Rate  (Recall)')
plt.plot(KNN_FPR, KNN_TPR)
plt.plot((0,1), ls='dashed',color='green')
plt.show()

#**5) XG Boost**

In [None]:
import xgboost as xgb
# Data fitting in xgboost model
XGB_model = xgb.XGBClassifier()
XGB_model.fit(x_train, y_train)

#prediction of test data
XGB_model_prediction = XGB_model.predict(x_test)
# get the accuracy scores
XGB_model_accuracy = accuracy_score(y_test, XGB_model_prediction)

#Checking the traning accuracy
print(f'Training accuracy score : {XGB_model.score(x_train, y_train)}')
# checking the testing accuracy
print(f'Testing accuracy score : {XGB_model_accuracy}')

In [None]:
# Classification report
print(classification_report(XGB_model_prediction,y_test))

In [None]:
#confusion matrix
conf_matrix = confusion_matrix(y_test,XGB_model_prediction)
f,ax = plt.subplots(figsize=(8,6))
sns.heatmap(conf_matrix, annot=True,fmt="d", linewidths=.5,ax=ax)
plt.title('Confusion Matrix', fontsize=15)
ax.set_yticks(np.arange(conf_matrix.shape[0]) + 0.5, minor=False)
ax.set_xticklabels("Refused T. Deposits', 'Accepted T. Deposits")
ax.set_yticklabels(["Refused T. Deposits", "Accepted T. Deposits"],fontsize=10, rotation=360)
plt.show()

##**ROC AOC Curve for XGBoost Classifier**

In [None]:
# getting the roc_score

Xgb_probability = XGB_model.predict_proba(x_test)[:,1]
roc_socre=roc_auc_score(y_test, Xgb_probability)
roc_socre

In [None]:
# plotting the roc curve for the XGB model
XGB_FPR, XGB_TPR,_ = roc_curve(y_test, K_model_probability)
plt.title('XG Boost Classifier ROC curve')
plt.xlabel('False Positive Rate (Precision)')
plt.ylabel('True Positive Rate  (Recall)')
plt.plot(XGB_FPR, XGB_TPR)
plt.plot((0,1), ls='dashed',color='green')
plt.show()

##**Important Feature for XG Boost Classifier**

In [None]:
XGB_model.feature_importances_

In [None]:
features = df.columns
importances = XGB_model.feature_importances_
indices = np.argsort(importances)

In [None]:
plt.figure(figsize=(20,15))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='purple', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

###**roc_auc_score for different classifiers**

In [None]:
print(f'Logistic_Regression score: {roc_auc_score(y_test, log_reg_probability)}')
print(f'Random Forest Classifier Score: {roc_auc_score(y_test, random_rf_probability)}')
print(f'Decision Tree Score: {roc_auc_score(y_test, dec_tree_probability)}')
print(f'XGB Classifier score: {roc_auc_score(y_test, Xgb_probability)}')
print(f'KNN Score: {roc_auc_score(y_test, K_model_probability)}')

In [None]:
# plotting the roc curve of models
def graph_roc_curve_multiple(logistic_FPR, logistic_TPR,random_forest_FPR,random_forest_TPR,dec_tree_FPR,dec_tree_TPR, XGB_FPR,XGB_TPR,KNN_FPR,KNN_TPR):
    plt.figure(figsize=(7,5))
    plt.title('comparing the models on the basis of ROC Curve', fontsize=14)
    plt.plot(logistic_FPR, logistic_TPR, label='Logistic Regression (Score = 93.22%)')
    plt.plot(random_forest_FPR, random_forest_TPR, label='Random Forest (Score = 92.63%)')
    plt.plot(dec_tree_FPR, dec_tree_TPR, label='Decision Tree (Score = 88.95%)')
    plt.plot(KNN_FPR, KNN_TPR, label='KNN (Score = 93.28%)')
    plt.plot(XGB_FPR, XGB_TPR, label='XGB Classifier (Score = 93.29%)')
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    plt.legend()
    
graph_roc_curve_multiple(logistic_FPR, logistic_TPR,random_forest_FPR, random_forest_TPR, dec_tree_FPR,dec_tree_TPR, XGB_FPR,XGB_TPR, KNN_FPR,KNN_TPR)
plt.show()

#**Conclusion**


* Blue-collar, management and technician showed maximum interest in subscription.  

* Divorce people have no interest in term deposit.

* The majority of the customers are between the ages of 30 and 40.

* The model can assist in identifying customers based on whether they have made deposits or not.

* Most people have home loans, but only a small percentage of them chose term deposits.

*  The outcome of the campaign is significantly influenced by the customer's account balance. We can then interact with those customers who have a balanced account balance.

* The model can assist in identifying customers based on whether they have made deposits or not.

* Instead of wasting time on the wrong customer, the model helps to target the right one.


* After implementating all the ML models We get maximum accuracy and ROC-AUC score in XGboost. So we can conclude that it is the best model for us.
