**Thanks for Dropping by.**

**Please consider upvoting** if you like the notebook.

If you have any suggestions and improvements please consider dropping them in comments and I will definetly go over them. 

**Thanks in Advance**

**Note:** I am still working on this notebook. Please consider dropping by again to see more updates.

**Summary:** 

The data is related with direct marketing campaigns of a Portuguese banking institution, based on phone calls (Moro, Cortez, and Rita 2014).The goal of the campaigns were to get the clients to subscribe to a term deposit. 
There are 20 input variables and 1 binary output variable (y) that indicates whether the client subscribed to a term deposit with values ‘yes’,‘no’. 
The input variables can be divided into four categories: 
1. bank client data 
2. data related to last contact of current campaign
3. social and economic context attributes
4. other attributes. 

Bank client data contains variables containing information about the client. It includes variables indicating age, job, marital status, education, whether they have credit in default, whether they have a housing loan, whether they have a personal loan. 

Data related to the last contact of the current campaign contain variables indicating the mode of communication, month of last communication, day of week when the last contact was made and the last call duration. 

Social and economic context attributes contain variables with the quarterly employment variation rate, monthly consumer price index, monthly consumer confidence index, number of employees and the euribor 3 month rate. 

Other attributes include number of previous contacts with the client during the current campaign, number of days since the last contact for the previous campaign, number of contacts performed before the current campaign for the client and the outcome of the previous marketing campaign. 

The goal of the project is to classify with high accuracy whether the campaign will be successful or not given a set of input variables.

**Proposed Plan:** 

In this project I will use the above data parameters to predict the outcome of the marketing campaign for the customer. 

I will be using Matplotlib and Seaborn for basic visualization and exploratory data analysis. I will also be making use of pandas packages to wrangle the data. Some data wrangling techniques that we will be using are imputation of missing/ NA data values, and converting categorical variables to numeric variables using one hot encoding. 

For classification, I am planning on using:
1. Logistic Regression 
2. Random Forests 
3. K-Nearest Neighbours 
4. Support Vector Machines
5. Neural Networks. 

The preliminary challenges that I might face would be in data wrangling and feature selection. Since there are 20 variables to fit the models and predict the outcome of the survey, it would be a challenge to select only those features which have a significant impact on the response variable. We plan to carry out feature engineering to create new features based on pre-existing ones.

**Preliminary Results:**

After performing exploratory data analysis, it was found that social and economic context attributes affect the outcome the most. The bank client data and data related to the last campaign had little to no effect on the outcome. We will use this as a basis to start building our machine learning models. Hence, we will be able to produce feasible models.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importing libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from imblearn.under_sampling import NearMiss
from scipy import stats

%matplotlib inline

In [None]:
df = pd.read_csv("../input/bank-marketing/bank-additional-full.csv", delimiter=';')

In [None]:
# Converting categorical into boolean using get_dummies 
# Getting the predicted values in terms of 0 and 1
Y = (df['y'] == 'yes')*1

In [None]:
# Looking at statistics of our data
df.info()

In [None]:
# Looking at all the columns in the dataset.
df.columns

In [None]:
# Dropping y from the original dataset as we have read it seperately
df.drop('y', axis = 1, inplace = True)

In [None]:
# First five rows of the dataset after dropping y from the dataset
print(df.head())

## **1. Exploratory Data Analysis**

Will perform some Exploratory Data Analysis to see how different features are distribute in the dataset. 

1.1 Visaulizing how age is distributed in the dataset

In [None]:
# Visaulizing how age is distributed in the dataset
sns.distplot(df['age'], hist = True, color = "#07247D", hist_kws = {'edgecolor':'black'})

1.2 Visualizing how Maritial Status and Education is distributed in the dataset.

In [None]:
# Visualizing how Maritial Status and Education is distributed in the dataset. 
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))

# First plot for marital status
sns.countplot(x = "marital", data = df, ax = ax1)
ax1.set_title("marital status distribution", fontsize = 13)
ax1.set_xlabel("Marital Status", fontsize = 12)
ax1.set_ylabel("Count", fontsize = 12)

# Second plot for Education distribution
sns.countplot(x = "education", data = df, ax = ax2)
ax2.set_title("Education distribution", fontsize = 13)
ax2.set_xlabel("Education level", fontsize = 12)
ax2.set_ylabel("Count", fontsize = 12)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation = 70)

1.3 Visualizing how Jobs are distribution

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(15,5)
sns.countplot(x = "job", data = df)
ax.set_xlabel('Job', fontsize = 12)
ax.set_ylabel('Count', fontsize = 12)
ax.set_title("Job Count Distribution", fontsize = 13)

1.4 Housing and Loan Distribution

Visualizing how: 

1. Housing Loans are distributed. 
2. Personal Loans are distributed. 

In [None]:
# Housing loan data distribution
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (15, 5))
sns.countplot(x = "housing", data = df, ax = ax1, order = ['yes', 'no', 'unknown'])
ax1.set_title("Housing Loan distribution")
ax1.set_xlabel("Housing Loan")
ax1.set_ylabel("Count")

# Personal loan data distribution
sns.countplot(x = "loan", data = df, ax = ax2, order = ['yes', 'no', 'unknown'])
ax2.set_title("Personal Loan Distribution")
ax2.set_xlabel("Personal Loan")
ax2.set_ylabel("Count")

Getting total count for: 

1. Credit Defaulters 
2. People with Housing loan 
3. People with Personal loan

*Credit Defaulter*

In [None]:
print("Number of people with credit default: ", df[df['default'] == 'yes']['default'].count())
print("Number of people with no credit default: ", df[df['default'] == 'no']['default'].count())
print("Number of people who's credit default is unknown: ", df[df['default'] == 'unknown']['default'].count())

*Housing Loan*

In [None]:
print("Number of people with Housing loan: ", df[df['housing'] == 'yes']['housing'].count())
print("Number of people with no Housing loan: ", df[df['housing'] == 'no']['housing'].count())
print("Number of people who's Housing loan is unknown: ", df[df['housing'] == 'unknown']['housing'].count())

*Personal Loan*

In [None]:
print("Number of people with Personal loan: ", df[df['loan'] == 'yes']['loan'].count())
print("Number of people with no Personal loan: ", df[df['loan'] == 'no']['loan'].count())
print("Number of people who's Personal loan is unknown: ", df[df['loan'] == 'unknown']['loan'].count())

1.4 Visualisation related to "Last Contact of the Current Campaign" 

<i> Visualisation related to Duration </i>

Plotting duration using boxplot makes it difficult to obtain some important values like average of distribution and so I am plotting histogram on the side to see how its distributed and check for mean value (If its possible). 

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (15, 5))

sns.boxplot(x = "duration", data = df, orient = 'v', ax = ax1)
ax1.set_xlabel("Calls")
ax1.set_ylabel("Duration")
ax1.set_title("Call distribution")

sns.distplot(df['duration'], ax = ax2)
ax2.set_xlabel("Call duration")
ax2.set_ylabel("Count")
ax2.set_title("Call Duration vs Count")

Getting all the Mean, Standard Diveation, Minimum and Maximum values for duration  

In [None]:
min_duration = df['duration'].min()
max_duration = df['duration'].max()
median_duration = df['duration'].mean()
standard_dev_duration = df["duration"].std()

print("Min call duration: ", min_duration)
print("Max call duration: ", max_duration)
print("Median call duration: ", round(median_duration, 2))
print("Standard diveation in call duration: ", round(standard_dev_duration, 2))

We can see from the box plot that most call duration is around the mean so finding the interquartile range will help us in understanding how long the call might last

In [None]:
first_quartile = df['duration'].quantile(q = 0.25)
second_quartile = df['duration'].quantile(q = 0.50)
third_quartile = df['duration'].quantile(q = 0.75)
fourth_quartile = df['duration'].quantile(q = 1)
IRQ = third_quartile - second_quartile

print("Second Quartile: ", second_quartile)
print("Third Quartile: ", third_quartile)
print("Inter quartile range(range within which most data is present): ",IRQ)

 <i> Visualisation related to "Contact, Month and Day of the week" </i>

In [None]:
# For contact and Days of the week
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (15, 5))

sns.countplot(x = 'contact', data = df, ax = ax1)
ax1.set_xlabel("Contact Method")
ax1.set_ylabel("Count")
ax1.set_title("Count of Contact Methods")

sns.countplot(df['day_of_week'], ax = ax2)
ax2.set_xlabel("Days of the week")
ax2.set_ylabel("Count")
ax2.set_title("Count of Calls made on Days of the week")


In [None]:
# For Months
fig, ax = plt.subplots(figsize = (15, 5))
sns.countplot(x = 'month', data = df, order = ['mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'])
ax.set_xlabel("Months")
ax.set_ylabel("Count")
ax.set_title("Count of contacts made in each month")

*Checking if there exists a relation between  Duration of call and Jobs*

In [None]:
fig, ax = plt.subplots(figsize = (15, 5))
sns.boxplot(x = "job", y = "duration", data = df, orient = 'v')
ax.set_xlabel("Jobs")
ax.set_ylabel("Duration")
ax.set_yscale("log")
ax.set_title("log(Duration) vs Jobs")

*Checking if there is a relation between average duration of call and eduacation *

In [None]:
fig, ax = plt.subplots(figsize = (15, 5))
sns.boxplot(x = "education", y = "duration", data = df, orient = 'v')
ax.set_xlabel("Education")
ax.set_ylabel("Duration")
ax.set_yscale("log")
ax.set_title("log(Duration) vs Education")

From the above graph we can observe that the average duration of call is less with illiterates. 

## 2. Categorical Treatment

*Different categorial features and there values in the dataset are:*

In [None]:
print("Jobs: \n", df["job"].unique(),'\n')
print("Marital Status: \n", df['marital'].unique(),'\n')
print("Education: \n", df['education'].unique(),'\n')
print("Default on Credit: \n", df['default'].unique(),'\n')
print("Housing loan: \n", df['housing'].unique(),'\n')
print("Loan default: \n", df['loan'].unique(),'\n')
print("Contact type: \n", df['contact'].unique(),'\n')
print("Months: \n", df['month'].unique(),'\n')
print("day_of_week: \n", df['day_of_week'].unique(),'\n')
print("Poutcome: \n",df["poutcome"].unique(),'\n')

#### Creating label encoders to treat all categorical variables

In [None]:
labelencoder_X = LabelEncoder()

In [None]:
df["job"] = labelencoder_X.fit_transform(df["job"])
df["marital"] = labelencoder_X.fit_transform(df["marital"])
df["education"] = labelencoder_X.fit_transform(df["education"])
df["default"] = labelencoder_X.fit_transform(df["default"])
df["housing"] = labelencoder_X.fit_transform(df["housing"])
df["loan"] = labelencoder_X.fit_transform(df["loan"])
df["contact"] = labelencoder_X.fit_transform(df["contact"])
df["month"] = labelencoder_X.fit_transform(df["month"])
df["day_of_week"] = labelencoder_X.fit_transform(df["day_of_week"])
df["poutcome"] = labelencoder_X.fit_transform(df["poutcome"])

In [None]:
# For dataframes to display all the columns in the output
pd.set_option('max_columns', None)
df.head()

In [None]:
df['y'] = Y

## 3. Undersampling and Feature Engineering

##### 3.1 Undersampling

In [None]:
df[df['y'] == 1].shape

From the above output we can see that out of 41k odd entries, we have only 4640 positive instances. Out data is imballenced.

#### Dealing with imballenced dataset

We can see that the data is imballenced so the output of the model will be biased. One way to move forward is:
1. Undersample the data
2. Oversample the data

In [None]:
# There are three types of undersample that we can use. I will be using version 3 for undersampling
undersample = NearMiss(version=3)

In [None]:
# Preparing the dataframe to be fed into undersample
df_x = df.iloc[:,:-1]
df_y = df['y']

In [None]:
# Getting out new data
X, y = undersample.fit_resample(df_x, df_y)

In [None]:
# Adding output column back to X to perform feature selection
X['y'] = y
X.head()

##### 3.2 Feature Selection

In [None]:
# Generate correlation matrix heat map to check which feature has greatest influence on the output 
fig, ax = plt.subplots(figsize = (20, 10))
matrix = np.triu(X.corr())
sns.heatmap(df.corr(), annot=True, fmt='.1f', vmin=-1, vmax=1, center= 0, cmap= 'coolwarm', mask=matrix)

Output has a good correlation with duration, pdays, emp.var.rate, euribor3m and nr.employed. Therefore they might form a very good features compared to others.
We should also note that emp.var.rate has a high correlation with nr.employed, euribor3m and cons.price.idx. So we might need to consider this while selecting our features.

##### For Categorical features:

In [None]:
# Checking to see if any categorical variables have direct relationship with y

for i in ["job","marital","education","default","housing","loan","contact","month","day_of_week","poutcome"]:
    print("Results for categorical variable {} is:\n".format(i))
    print(X.groupby(i)['y'].mean())

From the above output we cant come to any conclusion but one thing we can observe is that in poutcome, when the value is '2', there is 72.5% positive outcome. 

##### Lets check categorical plot for categorical features.

In [None]:
# Generate categorical plots for features
for col in ["job","marital","education","default","housing","loan","contact","month","day_of_week","poutcome"]:
    sns.catplot(x=col, y='y', data=X, kind='point', aspect=2, )
    plt.ylim(0, 1)

From the above graphs we can see that Housing and loan remains almost constant and dont have much if an effect. So probably we can elemenate them. Lets perform statistical tests to determine that.

##### Statistical Significane Test

In [None]:
def describe_cont_feature(feature):
    print('\n*** Results for {} ***'.format(feature))
    print(X.groupby('y')[feature].describe())
    print(ttest(feature))
    
def ttest(feature):
    survived = X[X['y']==1][feature]
    not_survived = X[X['y']==0][feature]
    tstat, pval = stats.ttest_ind(survived, not_survived, equal_var=False)
    print('t-statistic: {:.1f}, p-value: {:.3}'.format(tstat, pval))

In [None]:
for feature in ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed']:
    describe_cont_feature(feature)

From the above output, we can see that age, housing and month have p-value greater than 0.5 therefore we can eleminate them as they fail null hypothesis. 

## 4. Building Machine Learning Models