### Prédiction de la souscription à des dépôts à terme dans une banque : Predicting subscription to term deposits in a bank

# Problem Statement:-
A marketing campaign refers to initiatives deployed by an institution to promote the purchase or sale of a product or service, thus including advertising, marketing and distribution of products to customers or businesses. Similarly, a Portuguese banking institution plans to run a marketing campaign based on telephone call recordings from their latest initiative to predict whether its customers will subscribe to term deposits. Records of their actions are available as dataset.

A term deposit is a financial product offered by banking institutions where customers deposit a sum of money for a specific period at a fixed interest rate. Term deposits provide banks with a stable and predictable source of funding.

# Project Objective :-
The objective of the project is to develop a machine learning model to predict if customer contacted will subscribe to a a term deposit.

# Dataset :-
The bank's data set contains 21 variables providing information on 41,188 customer observations. The memory used by the dataframe is 6.6+ MB. 20 predictive variables and the target bariable: y. 

For More Information regarding dataset used, refer https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

# Project Work flow :-

# Data Analysis

Let's go ahead and load the dataset.

### Import the necessary librairies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# for the yeo-johnson transformation
import scipy.stats as stats
import pylab

# The machine learning models.

# To evaluate the models.

# To separate data into train and test.

## Variable descriptions

The bank's data set contains 21 variables providing information on 41,188 customer observations. The memory used by the dataframe is 6.6+ MB. 20 predictive variables and the target bariable: y. The different variables are as follows:

# Bank client data:

*age (numeric)
*job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
*marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
*education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
*default: has credit in default? (categorical: "no","yes","unknown")

6 - housing: has housing loan? (categorical: "no","yes","unknown")

7 - loan: has personal loan? (categorical: "no","yes","unknown")

# Related with the last contact of the current campaign:

8 - contact: contact communication type (categorical: "cellular","telephone") 

9 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

10 - day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")

11 - duration: last contact duration, in seconds (numeric). Important note:  this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

# Other attributes:

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")

# Social and economic context attributes

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)     

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)    

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

# Output variable (desired target):

21 - y - has the client subscribed a term deposit? (binary: "yes","no")

### Reading Dataset and Basic Data Exploration

In [4]:
# Load the dataset
bank_df = pd.read_csv('C:\\Users\\Aboubacar\\OneDrive\Bureau\\Bank_Marketing\\bank_additional.csv', delimiter=';')
# Drop column 'duration'
bank_df = bank_df.drop('duration',axis=1)
# Displays the dataset charactéristics
print(bank_df.info())
# To display all columns in the dataset.
pd.set_option('display.max_columns', None)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4119 entries, 0 to 4118
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             4119 non-null   int64  
 1   job             4119 non-null   object 
 2   marital         4119 non-null   object 
 3   education       4119 non-null   object 
 4   default         4119 non-null   object 
 5   housing         4119 non-null   object 
 6   loan            4119 non-null   object 
 7   contact         4119 non-null   object 
 8   month           4119 non-null   object 
 9   day_of_week     4119 non-null   object 
 10  campaign        4119 non-null   int64  
 11  pdays           4119 non-null   int64  
 12  previous        4119 non-null   int64  
 13  poutcome        4119 non-null   object 
 14  emp.var.rate    4119 non-null   float64
 15  cons.price.idx  4119 non-null   float64
 16  cons.conf.idx   4119 non-null   float64
 17  euribor3m       4119 non-null   f

  bank_df = pd.read_csv('C:\\Users\\Aboubacar\\OneDrive\Bureau\\Bank_Marketing\\bank_additional.csv', delimiter=';')


In [5]:
# Displays the first 5 raws
bank_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


In [6]:
# check the last 5 rows
bank_df.tail()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
4114,30,admin.,married,basic.6y,no,yes,yes,cellular,jul,thu,1,999,0,nonexistent,1.4,93.918,-42.7,4.958,5228.1,no
4115,39,admin.,married,high.school,no,yes,no,telephone,jul,fri,1,999,0,nonexistent,1.4,93.918,-42.7,4.959,5228.1,no
4116,27,student,single,high.school,no,no,no,cellular,may,mon,2,999,1,failure,-1.8,92.893,-46.2,1.354,5099.1,no
4117,58,admin.,married,high.school,no,no,no,cellular,aug,fri,1,999,0,nonexistent,1.4,93.444,-36.1,4.966,5228.1,no
4118,34,management,single,high.school,no,yes,no,cellular,nov,wed,1,999,0,nonexistent,-0.1,93.2,-42.0,4.12,5195.8,no


In [7]:
# check the randomly 5 rows
bank_df.sample(5)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
1249,56,self-employed,married,university.degree,no,no,no,cellular,apr,mon,3,999,1,failure,-1.8,93.075,-47.1,1.405,5099.1,no
84,38,entrepreneur,divorced,university.degree,no,yes,no,cellular,nov,thu,1,999,0,nonexistent,-0.1,93.2,-42.0,4.076,5195.8,no
2016,24,services,single,professional.course,no,yes,no,cellular,jul,wed,1,999,0,nonexistent,1.4,93.918,-42.7,4.962,5228.1,no
1922,32,blue-collar,married,professional.course,no,no,no,telephone,may,fri,2,999,0,nonexistent,1.1,93.994,-36.4,4.864,5191.0,no
3456,36,technician,married,professional.course,no,yes,no,telephone,may,thu,1,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,no
