## Project overview
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
There are four datasets:
* <strong>bank-additional-full.csv</strong>  with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
* <strong>bank-additional.csv</strong>  with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
* <strong>bank-full.csv</strong>  with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
* <strong>bank.csv</strong>  with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

* link dataset : <a href="https://archive.ics.uci.edu/ml/datasets/Bank+Marketing"> disini</a>

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).


## Import Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
import statsmodels.api as sm

%matplotlib inline

## Load Dataset

In [9]:
df = pd.read_csv("./Dataset/bank-additional/bank-additional-full.csv", sep=';')
df.sample(10)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
13712,32,admin.,married,university.degree,no,yes,no,cellular,jul,thu,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1,no
38863,35,management,married,university.degree,no,no,no,telephone,nov,mon,...,1,7,3,failure,-3.4,92.649,-30.1,0.714,5017.5,no
29365,38,management,divorced,university.degree,no,no,yes,cellular,apr,fri,...,2,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099.1,no
9475,26,services,single,high.school,no,no,yes,telephone,jun,mon,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.961,5228.1,no
30477,25,technician,unknown,university.degree,no,yes,no,cellular,may,mon,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.354,5099.1,no
21036,35,technician,single,professional.course,no,yes,yes,cellular,aug,thu,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.964,5228.1,no
41109,34,technician,married,unknown,no,yes,no,cellular,nov,tue,...,1,1,4,success,-1.1,94.767,-50.8,1.046,4963.6,no
29097,38,technician,married,professional.course,unknown,no,no,cellular,apr,fri,...,1,999,2,failure,-1.8,93.075,-47.1,1.405,5099.1,yes
20740,31,technician,married,university.degree,no,yes,no,cellular,aug,wed,...,2,999,0,nonexistent,1.4,93.444,-36.1,4.965,5228.1,no
12168,36,technician,married,high.school,no,no,no,telephone,jul,tue,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.955,5228.1,no


In [8]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [11]:
df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64