# DATA SET: bank-full.csv 

Data Description:

The data is related with direct marketing campaigns of a Portuguese
banking institution. The marketing campaigns were based on phone
calls. Often, more than one contact to the same client was required, in
order to access if the product (bank term deposit) would be ('yes') or not
('no') subscribed.

Domain:Banking

Context:

Leveraging customer information is paramount for most businesses. In
the case of a bank, attributes of customers like the ones mentioned
below can be crucial in strategizing a marketing campaign when
launching a new product.

# 1. Import the necessary libraries

In [None]:
# To enable plotting graphs in Jupyter notebook
%matplotlib inline

# Importing libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression

# importing ploting libraries
import matplotlib.pyplot as plt   

#importing seaborn for statistical plots
import seaborn as sns

#Let us break the X and y dataframes into training set and test set. For this we will use
#Sklearn package's data splitting function which is based on random function

from sklearn.model_selection import train_test_split

import numpy as np
#import os,sys
from scipy import stats

# calculate accuracy measures and confusion matrix
from sklearn import metrics

# 2. Read the data as a data frame 

In [None]:
datapath = '../input'
my_data = pd.read_csv(datapath+'/bank-full.csv')

# 3. Basic EDA

In [None]:
my_data.head(10)

a.There are 7 Independent variables:

    1.Age(Numeric)
    2.Balance: average yearly balance, in euros (numeric)
    3.Day: last contact day of the month (numeric 1 -31)
    4.Duration: last contact duration, in seconds (numeric).
    5.Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 
    6.pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
    7.previous: number of contacts performed before this campaign and for this client (numeric)

b.There are 8 Ordinal Categorical Variables:

    1.Job : type of job 
    2.Marital : marital status 
    3.Education
    4.Default: has credit in default? (categorical: 'no','yes','unknown')
    5.Housing: has housing loan? (categorical: 'no','yes','unknown')
    6.Loan: has personal loan? (categorical: 'no','yes','unknown')
    7.Contact: contact communication type (categorical:'cellular','telephone')
    8.poutcome: outcome of the previous marketing campaign(categorical: 'failure','nonexistent','success')

c.And the Target variable is binary category variable(desired target):

    Target:has the client subscribed a term deposit? (binary: 'yes', 'no')


# 3.a. Shape of the data 

In [None]:
my_data.shape

There are 45211 clients.

In [None]:
my_data.columns

# 3.b. Data type of each attribute 

In [None]:
my_data.dtypes

### Some Attributes are having object data type and some are having integer data type.

### Decision tree in Python can take only numerical / categorical colums. It cannot take string / obeject types. 

# 3.c. Checking the presence of missing values 

In [None]:
val=my_data.isnull().values.any()

if val==True:
    print("Missing values present : ", my_data.isnull().values.sum())
    my_data=my_data.dropna()
else:
    print("No missing values present")

## Check for the null values 

In [None]:
#null values
my_data.isnull().values.any()

# 3.d. 5 point summary of numerical attributes 

In [None]:
my_data.describe().T

In [None]:
my_data.info()

## Finding unique data 

In [None]:
my_data.apply(lambda x: len(x.unique()))

In [None]:
print('Jobs:\n',my_data['job'].unique())
print('Marital:\n',my_data['marital'].unique())
print('Default:\n',my_data['default'].unique())
print('Education:\n',my_data['education'].unique())
print('Housing:\n',my_data['housing'].unique())
print('Loan:\n',my_data['loan'].unique())
print('Contact:\n',my_data['contact'].unique())
print('Month:\n',my_data['month'].unique())
print('Day:\n',my_data['day'].unique())
print('Campaign:\n',my_data['campaign'].unique())

In [None]:
#Find Mean
my_data.mean()

In [None]:
#Find Median
my_data.median()

In [None]:
#Find Standard Deviation
my_data.std()

## Measure of skewness  

In [None]:
my_data.skew(axis = 0, skipna = True) 

# Ploting histogram to check that if data columns are normal or almost normal or not

In [None]:
my_data.hist(figsize=(10,10),color="blueviolet",grid=False)
plt.show()

# PairPlot 

In [None]:
sns.pairplot(my_data.iloc[:,1:])

## Here we can see that distribution for  'Age','Day','Month' and 'Job' is almost normally distributed.

# 3.e. Checking the presence of outliers 

## AGE

In [None]:
print('Min age: ', my_data['age'].max())
print('Max age: ', my_data['age'].min())

In [None]:
plt.figure(figsize = (30,12))
sns.countplot(x = 'age',  palette="rocket", data = my_data)
plt.xlabel("Age", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Age Distribution', fontsize=15)

In [None]:
sns.boxplot(x = 'age', data = my_data, orient = 'v')
plt.ylabel("Age", fontsize=15)
plt.title('Age Distribution', fontsize=15)

In [None]:
sns.distplot(my_data['age'])
plt.xlabel("Age", fontsize=15)
plt.ylabel('Occurence', fontsize=15)
plt.title('Age x Ocucurence', fontsize=15)

## Calculate the outliers of Age Attribute: 

In [None]:
# Quartiles
print('1º Quartile: ', my_data['age'].quantile(q = 0.25))
print('2º Quartile: ', my_data['age'].quantile(q = 0.50))
print('3º Quartile: ', my_data['age'].quantile(q = 0.75))
print('4º Quartile: ', my_data['age'].quantile(q = 1.00))

In [None]:
  # Interquartile range, IQR = Q3 - Q1
  # lower 1.5*IQR whisker = Q1 - 1.5 * IQR 
  # Upper 1.5*IQR whisker = Q3 + 1.5 * IQR
    
print('Ages above: ', my_data['age'].quantile(q = 0.75) + 
                      1.5*(my_data['age'].quantile(q = 0.75) - my_data['age'].quantile(q = 0.25)), 'are outliers')

In [None]:
print('Numerber of outliers: ', my_data[my_data['age'] > 70.5]['age'].count())
print('Number of clients: ', len(my_data))
#Outliers in %
print('Outliers are:', round(my_data[my_data['age'] > 70.5]['age'].count()*100/len(my_data),2), '%')

## Just looking at the graphs we cannot conclude if age have a high effect to our Target variable.
## Here we can see the percentage of the outliers is less, so we can  fit the model with and without them.


## Job

In [None]:
plt.figure(figsize = (30,12))
sns.countplot(x = 'job',data = my_data)
plt.xlabel("job", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Job Distribution', fontsize=20)

###  The count of 'Blue-collar' is higher than the other .Also the count for 'Management' is noticeable.

##  Marital

In [None]:
#plt.figure(figsize = (30,12))
sns.countplot(x = 'marital',data = my_data)
plt.xlabel("Marital", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Marital Distribution', fontsize=15)

In [None]:
sns.boxplot(x='marital',y='age',hue='Target',data=my_data)

## Here we can see the Mareied people are more subscribing a term deposit. But here is also 50 percente chances to suscribe by clients as we can see in graphs. 

## Married people are more ,we can see here clearly. 

## Education

In [None]:
#plt.figure(figsize = (30,12))
sns.countplot(x = 'education',data = my_data)
plt.xlabel("Education", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Education Distribution', fontsize=15)

## The clients having secondary education are more .And the clients having unknown eduction are less .

In [None]:
sns.boxplot(x='education',y='age',hue='Target',data=my_data)

## There are outliers present in each education criteria . But the clients having primary education are more who have subscribed a term deposit.

## Default 

In [None]:
#plt.figure(figsize = (30,12))
sns.countplot(x = 'default',data = my_data)
plt.xlabel("Default", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Default Distribution', fontsize=15)

In [None]:
sns.boxplot(x='default',y='age',hue='Target',data=my_data)

In [None]:
print('Default:\n No credit in default:'     , my_data[my_data['default'] == 'no']     ['age'].count(),
              '\n Yes to credit in default:' , my_data[my_data['default'] == 'yes']    ['age'].count())

## The clients having bydefault credit are less than those who don't have bydefault credit. 

## Housing

In [None]:
#plt.figure(figsize = (30,12))
sns.countplot(x = 'housing',data = my_data)
plt.xlabel("Housing", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Housing Distribution', fontsize=15)

In [None]:
print('Housing:\n No Housing:'     , my_data[my_data['housing'] == 'no']     ['age'].count(),
              '\n Yes Housing:' , my_data[my_data['housing'] == 'yes']    ['age'].count())

## The clients having Housing loan are more by almost 5000 count than the clients who don't have Housing Loan.

In [None]:
sns.boxplot(x='housing',y='age',hue='Target',data=my_data)

## The clients who don't have taken housing loan have subscribed a term deposite with more than 50% chances.

## Loan 

In [None]:
#plt.figure(figsize = (30,12))
sns.countplot(x = 'loan',data = my_data)
plt.xlabel("Loan", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Loan Distribution', fontsize=15)

In [None]:
print('Loan:\n No Personal loan:'     , my_data[my_data['loan'] == 'no']     ['age'].count(),
              '\n Yes Personal Loan:' , my_data[my_data['loan'] == 'yes']    ['age'].count())

## The clients having Personal loan are less than clients don't have Personal loan.Difference is almost 30000 count 

In [None]:
sns.boxplot(x='loan',y='age',hue='Target',data=my_data)

## Contact 

In [None]:
#plt.figure(figsize = (30,12))
sns.countplot(x = 'contact',data = my_data)
plt.xlabel("Contact", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Contact Distribution', fontsize=15)

In [None]:
print('Contact:\n Unknown Contact:'     , my_data[my_data['contact'] == 'unknown']     ['age'].count(),
              '\n Cellular Contact:'   , my_data[my_data['contact'] == 'cellular']    ['age'].count(),
              '\n Telephone Contact:'  , my_data[my_data['contact'] == 'telephone']   ['age'].count())

## The count of a clients who can be contacted by Cellular is high that the others. 

## Month

In [None]:
#plt.figure(figsize = (30,12))
sns.countplot(x = 'month',data = my_data)
plt.xlabel("In which Month was a person contacted", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Monthly Distribution', fontsize=15)

## The no. of contacts performed in May month is highest than the other months.But it is not sure as the year is not mentioned in the dataset. 

## Day 

In [None]:
sns.boxplot(x=my_data["day"])

## Most of the contacts are done in between 8th-21st day of the particular month.And Also there is no outlier present. 

## Duration of a call 

In [None]:
sns.boxplot(x=my_data["duration"])

In [None]:
sns.distplot(my_data['duration'])
plt.xlabel("duration", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Duration distribution', fontsize=15)

### Calculate the outliers of Duration of last contact:

In [None]:
# Quartiles
print('1º Quartile: ', my_data['duration'].quantile(q = 0.25))
print('2º Quartile: ', my_data['duration'].quantile(q = 0.50))
print('3º Quartile: ', my_data['duration'].quantile(q = 0.75))
print('4º Quartile: ', my_data['duration'].quantile(q = 1.00))

In [None]:
  # Interquartile range, IQR = Q3 - Q1
  # lower 1.5*IQR whisker = Q1 - 1.5 * IQR 
  # Upper 1.5*IQR whisker = Q3 + 1.5 * IQR
    
print('Duration above: ', my_data['duration'].quantile(q = 0.75) + 
                      1.5*(my_data['duration'].quantile(q = 0.75) - my_data['duration'].quantile(q = 0.25)), 'are outliers')

In [None]:
print('Numerber of outliers: ', my_data[my_data['duration'] > 643.0]['duration'].count())
print('Number of clients: ', len(my_data))
#Outliers in %
print('Outliers are:', round(my_data[my_data['duration'] > 643.0]['duration'].count()*100/len(my_data),2), '%')

## Just looking at the graphs we cannot conclude if duration have a high effect to our Target variable.
## Here we can see the percentage of the outliers is less.But count is high means 643 count is not less I think so.


In [None]:
# Look, if the call duration is iqual to 0, then is obviously that this person didn't subscribed, 
# THIS LINES NEED TO BE DELETED LATER 
my_data[(my_data['duration'] == 0)]

In [None]:
my_data[my_data['duration'] == 0]['duration'].count()

##  Campaign

In [None]:
plt.figure(figsize = (30,12))
sns.countplot(x = 'campaign', data = my_data)
plt.xlabel("Campaign", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Campaign Distribution', fontsize=15)

In [None]:
sns.boxplot(x = 'campaign', data = my_data, orient = 'v')
plt.ylabel("Campaign", fontsize=15)
plt.title('Campaign Distribution', fontsize=15)

In [None]:
sns.distplot(my_data['campaign'])
plt.xlabel("Campaign", fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Campaign distribution', fontsize=15)

### Calculate the outliers for Campaign attribute: 

In [None]:
# Quartiles
print('1º Quartile: ', my_data['campaign'].quantile(q = 0.25))
print('2º Quartile: ', my_data['campaign'].quantile(q = 0.50))
print('3º Quartile: ', my_data['campaign'].quantile(q = 0.75))
print('4º Quartile: ', my_data['campaign'].quantile(q = 1.00))

In [None]:
  # Interquartile range, IQR = Q3 - Q1
  # lower 1.5*IQR whisker = Q1 - 1.5 * IQR 
  # Upper 1.5*IQR whisker = Q3 + 1.5 * IQR
    
print('Campaign above: ', my_data['campaign'].quantile(q = 0.75) + 
                      1.5*(my_data['campaign'].quantile(q = 0.75) - my_data['campaign'].quantile(q = 0.25)), 'are outliers')

In [None]:
print('Numerber of outliers: ', my_data[my_data['campaign'] > 6.0]['campaign'].count())
print('Number of clients: ', len(my_data))
#Outliers in %
print('Outliers are:', round(my_data[my_data['campaign'] > 6.0]['campaign'].count()*100/len(my_data),2), '%')

## The percentage of presence of outlier is less as we can see.So we can fit the model with or without this attribute. 

In [None]:
sns.boxplot(x='campaign',y='age',hue='Target',data=my_data)

## pdays

In [None]:
sns.boxplot(x = 'pdays', data = my_data, orient = 'v')
plt.ylabel("pdays", fontsize=15)
plt.title('pdays Distribution', fontsize=15)

## Previous 

In [None]:
sns.boxplot(x = 'previous', data = my_data, orient = 'v')
plt.ylabel("Previous", fontsize=15)
plt.title('Previous', fontsize=15)

## poutcome: 

In [None]:
sns.countplot(x = 'poutcome', data = my_data, orient = 'v')
plt.ylabel("Poutcome", fontsize=15)
plt.title('Poutcome distribution', fontsize=15)

In [None]:
print('poutcome:\n Unknown poutcome:'     , my_data[my_data['poutcome'] == 'unknown']   ['age'].count(),
              '\n Failure in  poutcome:'  , my_data[my_data['poutcome'] == 'failure']   ['age'].count(),
              '\n Other poutcome:'        , my_data[my_data['poutcome'] == 'other']     ['age'].count(),
              '\n Success in poutcome:'   , my_data[my_data['poutcome'] == 'success']   ['age'].count())

## The success of the previous marketing campaign is not noticeable as we can see in graph.But still I am not sure as there are so many unknown options present. 

In [None]:
sns.boxplot(x='poutcome',y='age',hue='Target',data=my_data)

## Target column 

In [None]:
my_data.boxplot(by = 'Target',  layout=(4,4), figsize=(20, 20))

In [None]:
sns.countplot(x = 'Target', data = my_data, orient = 'v')
plt.ylabel("Target", fontsize=15)
plt.title('Target distribution', fontsize=15)

In [None]:
#Let us look at the target column which is "Target"(yes/no).
my_data.groupby(["Target"]).count()

## Calculate the correlation matrix 

In [None]:
cor=my_data.corr()
cor

### Heatmap 

In [None]:
plt.subplots(figsize=(10,8))
sns.heatmap(cor,annot=True)

# 11.Conclusion from EDA: 

### 1.The ages are not that much important and dont make sense relate with other variables will not tell any insight.Just looking at the graphs we cannot conclude if age have a high effect to our Target variable.
### 2.Here we can see the percentage of the outliers for 'Age' is less, so we can fit the model with and without them.
### 3.If we consider the Job attribute we can see the count of 'Blue-collar' is higher than the other .Also the count for 'Management' is noticeable.
### 4.Married people are more ,we can see in graph clearly.
### 5.The clients having secondary education are more .And the clients having unknown eduction are less .
### 6.The clients having bydefault credit are less than those who don't have bydefault credit.
### 7.The clients having Housing loan are more by almost 5000 count than the clients who don't have Housing Loan.
### 8.The clients having Personal loan are less than clients don't have Personal loan.Difference is almost 30000 count.
### 9.The count of a clients who can be contacted by Cellular is high that the others.
### 10.The no. of contacts performed in May month is highest than the other months.But it is not sure as the year is not mentioned in the dataset.
### 11.Most of the contacts are done in between 8th-21st day of the particular month.And Also there is no outlier present.
### 12.Just looking at the graphs we cannot conclude if duration have a high effect to our Target variable.Here we can see the percentage of the outliers is less.But count is high means 643 count is not less I think so.
### 13.The percentage of presence of outlier is less as we can see.So we can fit the model with or without this attribute.
### 14.The success of the previous marketing campaign is not noticeable as we can see in graph.But still I am not sure as there are so many unknown options present.
### 15.I think for the Jobs, Marital and Education  the best analisys is just the count of each variable, if we related with the other ones its is not conclusive.
### 16.The Mareied people are more subscribing a term deposit. But here is also 50 percente chances to suscribe by clients as we can see in graphs.
### 17.here are outliers present in each education criteria . But the clients having primary education are more who have subscribed a term deposit.
### 18.The clients who don't have taken housing loan have subscribed a term deposite with more than 50% chances.

# 4. Prepare the data to train a model – check if data types areappropriate, get rid of the missing values etc 

### Converting catagorical attributes to continuous due the feature scaling will be applied later. 

In [None]:
# Label encoder order in alphabetical
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
my_data['job']      = labelencoder_X.fit_transform(my_data['job']) 
my_data['marital']  = labelencoder_X.fit_transform(my_data['marital']) 
my_data['education']= labelencoder_X.fit_transform(my_data['education']) 
my_data['default']  = labelencoder_X.fit_transform(my_data['default']) 
my_data['housing']  = labelencoder_X.fit_transform(my_data['housing']) 
my_data['loan']     = labelencoder_X.fit_transform(my_data['loan']) 

my_data['contact']     = labelencoder_X.fit_transform(my_data['contact']) 
my_data['month']       = labelencoder_X.fit_transform(my_data['month']) 

In [None]:
#function to creat group of ages, this helps because we have 78 differente values here
def age(dataframe):
    dataframe.loc[dataframe['age'] <= 32, 'age'] = 1
    dataframe.loc[(dataframe['age'] > 32) & (dataframe['age'] <= 47), 'age'] = 2
    dataframe.loc[(dataframe['age'] > 47) & (dataframe['age'] <= 70), 'age'] = 3
    dataframe.loc[(dataframe['age'] > 70) & (dataframe['age'] <= 98), 'age'] = 4
           
    return dataframe

age(my_data);

In [None]:
my_data.head()

In [None]:
print(my_data.shape)
my_data.head()

In [None]:
def duration(data):

    data.loc[data['duration'] <= 102, 'duration'] = 1
    data.loc[(data['duration'] > 102) & (data['duration'] <= 180)  , 'duration']    = 2
    data.loc[(data['duration'] > 180) & (data['duration'] <= 319)  , 'duration']   = 3
    data.loc[(data['duration'] > 319) & (data['duration'] <= 644.5), 'duration'] = 4
    data.loc[data['duration']  > 644.5, 'duration'] = 5

    return data
duration(my_data);

In [None]:
my_data.head()

In [None]:
my_data.loc[(my_data['pdays'] == 999), 'pdays'] = 1
my_data.loc[(my_data['pdays'] > 0) & (my_data['pdays'] <= 10), 'pdays'] = 2
my_data.loc[(my_data['pdays'] > 10) & (my_data['pdays'] <= 20), 'pdays'] = 3
my_data.loc[(my_data['pdays'] > 20) & (my_data['pdays'] != 999), 'pdays'] = 4 
my_data.head()

In [None]:
my_data['poutcome'].replace(['unknown', 'failure','other', 'success'], [1,2,3,4], inplace  = True)

In [None]:
print(my_data.shape)
my_data.head()

In [None]:
Final_data=my_data
print(Final_data.shape)
Final_data.head()

# 5. Train a few standard classification algorithms, note and comment on their performances along different metrics. 

In [None]:
Final_data.head()

# 5.A.Applying  the NB model and print the accuracy of NB model. 

In [None]:
#from sklearn.preprocessing import Imputer
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [None]:
X = Final_data.values[:,0:15]  ## Features
Y = Final_data.values[:,16]  ## Target.values[:,10]  ## Target

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 7)

In [None]:
clf = GaussianNB()
clf.fit(X_train, Y_train)

In [None]:
Y_pred = clf.predict(X_test)

In [None]:
NB=accuracy_score(Y_test, Y_pred, normalize = True) #Accuracy of Naive Bayes' Model
print('Accuracy_score:',NB)

In [None]:
print('Confusion_matrix of NB:')
print(metrics.confusion_matrix(Y_test,Y_pred))

# 5.B.Applying  the KNN model and print the accuracy of KNN model. 

In [None]:
final_data = Final_data[['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
                     'contact', 'month', 'day', 'duration', 'campaign', 'pdays', 'previous', 'poutcome']]
final_data.shape

In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [None]:
X_std = pd.DataFrame(StandardScaler().fit_transform(final_data))
X_std.columns = final_data.columns

In [None]:
#split the dataset into training and test datasets
import numpy as np
from sklearn.model_selection import train_test_split

# Transform data into features and target
X = np.array(my_data.iloc[:,1:16]) 
y = np.array(my_data['Target'])

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

In [None]:
print(X_train.shape)
print(y_train.shape)

In [None]:
# loading library
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score

#Neighbors
neighbors = np.arange(0,25)

for k in neighbors:
    k_value = k+1
    knn = KNeighborsClassifier(n_neighbors = k_value)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    print(accuracy_score(y_test, y_pred))
    


In [None]:
myList = list(range(1,30))

# subsetting just the odd ones
neighbors = list(filter(lambda x: x % 2 != 0, myList))

In [None]:
ac_scores = []

# perform accuracy metrics for values from 1,3,5....19
for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    # predict the response
    y_pred = knn.predict(X_test)
    # evaluate accuracy
    scores = accuracy_score(y_test, y_pred)
    ac_scores.append(scores)

# changing to misclassification error
MSE = [1 - x for x in ac_scores]

# determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

In [None]:
#Plot misclassification error vs k (with k value on X-axis) using matplotlib.
import matplotlib.pyplot as plt
# plot misclassification error vs k
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

In [None]:
#Use k=23 as the final model for prediction
knn = KNeighborsClassifier(n_neighbors = 23)

# fitting the model
knn.fit(X_train, y_train)

# predict the response
y_pred = knn.predict(X_test)

# evaluate accuracy
KNN=accuracy_score(y_test, y_pred)   #Accuracy of KNN model
print('Accuracy_score:',KNN)    

In [None]:
print('Confusion_matrix:')
print(metrics.confusion_matrix(y_test, y_pred))

# 5.C.Applying Logistic Regression Model and Print accuracy and confusion matrix of Logistic Regression. 

In [None]:
array = my_data.values
X = array[:,0:16] # select all rows and first 16 columns which are the attributes
Y = array[:,16]   # select all rows and the 17th column which is the classification "yes", "no"
test_size = 0.30 # taking 70:30 training and test set
seed = 15  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed) # To set the random state
type(X_train)

In [None]:
# Fit the model on 30%
model = LogisticRegression()
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
LR = model.score(X_test, y_test)
print('Accuracy:',LR)
print('confusion_matrix:')
print(metrics.confusion_matrix(y_test, y_predict))
A=LR  # Accuracy of Logistic regression model

# 6. Build the ensemble models and compare the results with the base models. 

# A.Decision Tree

In [None]:
# Decision tree in Python can take only numerical / categorical colums. It cannot take string / obeject types. 
# The following code loops through each column and checks if the column type is object then converts those columns
# into categorical with each distinct value becoming a category or code.

for feature in my_data.columns: # Loop through all columns in the dataframe
    if my_data[feature].dtype == 'object': # Only apply for columns with categorical strings
        my_data[feature] = pd.Categorical(my_data[feature]).codes # Replace strings with an integer

In [None]:
my_data.info()

In [None]:
train_char_label = ['No', 'Yes']

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer  #DT does not take strings as input for the model fit step....

# splitting data into training and test set for independent attributes
from sklearn.model_selection import train_test_split

X_train, X_test, train_labels, test_labels = train_test_split(X, y, test_size=.30, random_state=1)


In [None]:
# splitting data into training and test set for independent attributes in the ratio of 70:30 
n=my_data['Target'].count()
train_set = my_data.head(int(round(n*0.7))) # Up to the last initial training set row
test_set = my_data.tail(int(round(n*0.3))) # Past the last initial training set row

# capture the target column ("Target") into separate vectors for training set and test set
train_labels = train_set.pop("Target")
test_labels = test_set.pop("Target")

In [None]:
# invoking the decision tree classifier function. Using 'entropy' method of finding the split columns. Other option 
# could be gini index.  Restricting the depth of the tree to 5 (no particular reason for selecting this)

#dt_model = DecisionTreeClassifier(criterion = 'entropy' , max_depth = 5, random_state = 100)
                                  
dt_model = DecisionTreeClassifier(criterion = 'entropy' )

In [None]:
dt_model.fit(train_set, train_labels)

In [None]:
#Print the accuracy of the model & print the confusion matrix
dt_model.score(test_set , test_labels)
test_pred = dt_model.predict(test_set)

In [None]:
print (pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index = train_set.columns))#Print the feature importance of the decision model

In [None]:
y_predict = dt_model.predict(test_set)

In [None]:
print(dt_model.score(train_set , train_labels))
print(dt_model.score(test_set , test_labels))

In [None]:
print(metrics.confusion_matrix(test_labels, y_predict))

# I think the data is overfitted.

#  Regularising the Decision Tree

In [None]:
reg_dt_model = DecisionTreeClassifier(criterion = 'entropy', max_depth = 7)
reg_dt_model.fit(train_set, train_labels)

In [None]:
print (pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index = train_set.columns))


In [None]:
y_predict = reg_dt_model.predict(test_set)

In [None]:
DTC=reg_dt_model.score(test_set , test_labels)
print(DTC)

In [None]:
print(metrics.confusion_matrix(test_labels, y_predict))

# B.Apply Bagging Classifier Algorithm and print the accuracy. 

In [None]:
from sklearn.ensemble import BaggingClassifier

bgcl = BaggingClassifier(base_estimator=dt_model, n_estimators=50)

#bgcl = BaggingClassifier(n_estimators=50)
bgcl = bgcl.fit(train_set, train_labels)


In [None]:
y_predict = bgcl.predict(test_set)

BGC=bgcl.score(test_set , test_labels)
print(BGC)

print(metrics.confusion_matrix(test_labels, y_predict))

# C. Apply Adaboost Ensemble Algorithm for the same data and print the accuracy. 

In [None]:
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(base_estimator=dt_model, n_estimators=10)
#abcl = AdaBoostClassifier( n_estimators=50)
abcl = abcl.fit(train_set, train_labels)


In [None]:
y_predict = abcl.predict(test_set)

ADE=abcl.score(test_set , test_labels)
print(ADE)

print(metrics.confusion_matrix(test_labels, y_predict))

# D.Apply GradientBoost Classifier Algorithm for the same data and print the accuracy.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50)
gbcl = gbcl.fit(train_set, train_labels)

In [None]:
y_predict = gbcl.predict(test_set)
GBC=gbcl.score(test_set , test_labels)
print(GBC)
print(metrics.confusion_matrix(test_labels, y_predict))

# E. Apply the Random forest model and print the accuracy of Random forest Model

## Note: Random forest can be used only with Decision trees. 

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50)
rfcl = rfcl.fit(train_set, train_labels)

In [None]:
y_predict = rfcl.predict(test_set)
RFC=rfcl.score(test_set , test_labels)
print(RFC)
print(metrics.confusion_matrix(test_labels, y_predict))

# 7. Compare performances of all the models

In [None]:
models = pd.DataFrame({
                'Models': [ 'Gausian NB','K-Near Neighbors','Logistic Model', 'Decision Tree Classifier',
                            'Bagging Classifier ', 'Adaboost Ensemble ','GradientBoost Classifier ', 'Random Forest Classifier'],
                'Score':  [NB, KNN, LR, DTC, BGC, ADE, GBC, RFC]})

models.sort_values(by='Score', ascending=False)

# Conclusions : 

## 1.The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
## 2.A bank wants to know whether clients will subscribe a term deposit or not; so that they need information about the correlation between the variables given in the dataset.
## 3.Here I used 7 classification models to study.
## 4.From the accuracy scores , it seems like "Logistic Regression" algorithm have the highest accuracy and stability.
## 5.But we can use "KNN" also as it has a good accuracy and stability as well than other models.