# LISUM01: Data science Group 1

Members:
1. William Ogweli Okomba, Kenya
2. Ece Kurnaz, Turkey
3. Collin Mburugu, Kenya
4. Udbhav Balaji, India

# 1. Defining the Question

## 1.1 Specifying the Data Analysis Question
Create a bank term deposit model to predict whether a customer will accept the product or not based on  the historical data in the given dataset. Select one or several suitable learning algorithms and a suitable metric for assessing quality model.


## 1.2 Defining the Metric for Success
Since the problem we are tackling is a classification problem, we will use classification reports and confusion matrices as well as accuracy and precision scores to measure the success of the models used.

## 1.3 Problem statement

ABC bank (a Portuguese banking institution) has a term deposit product that is desired to be sold to clients. We will focus on customer's past interactions with the bank or other financial institutions to have a better understanding on whether these particular clients will buy this product or not. Developing a model with using machine learning for this aim is reasonable. With performing this project, our aim is to save resources and time for ABC bank.


## 1.4 Business Understanding

Bank term deposit is a deposit product by ABC Bank with is offered to their customers in Portugal.
The potential customers are likely to buy the product when educated by marketing channel (tele marketing, SMS/email marketing etc) personnels.

The approval is based on a variety of information, from basic biographical data to the loan applications that come through daily.

We work with the product team as a data scientists to help create effective predictive  model used to assess the customer chances of buying the product.

## 1.5 Recording the Experimental Design
- load libraries and dataset
- clean dataset:
    - deal with duplicate and/or missing values
    - deal with outliers, where necessary
    - deal with other anomalies in the data, where necessary
- carry out exploratory data analysis
- carry out feature engineering
- carry out modeling
    - tune hyperparameters
    - feature selection
    - alternative models
- summarize and provide recommendations
- challenge the solution


## 1.6 Data Attribute Information:

Input variables:

### bank client data:
1. - age (numeric)
2. - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. - default: has credit in default? (categorical: 'no','yes','unknown')
6. - housing: has housing loan? (categorical: 'no','yes','unknown')
7. - loan: has personal loan? (categorical: 'no','yes','unknown')
###  related with the last contact of the current campaign:
8. - contact: contact communication type (categorical: 'cellular','telephone')
9. - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
### other attributes:
12. - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. - previous: number of contacts performed before this campaign and for this client (numeric)
15. - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
###  social and economic context attributes
16. - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. - cons.price.idx: consumer price index - monthly indicator (numeric)
18. - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19. - euribor3m: euribor 3 month rate - daily indicator (numeric)
20. - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21. - y - has the client subscribed a term deposit? (binary: 'yes','no')

# 2.0 Data Preparation

In [3]:
#loading required libraries

import pandas as pd # load Pandas for reading the file
import numpy as np # load numpy for computation
import seaborn as sns # load searborn for visualization
import matplotlib.pyplot as plt # lead matplotlib for visualization
#


## 2.1 File one (bank_additional_full.csv)

In [4]:
# load the train dataset
bank_df= pd.read_csv('bank_additional_full.csv',sep=";")

# preview the first 5 records
bank_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [5]:
# check dataset records
print( "The dataset has {} records". format(bank_df.shape[0]))

The dataset has 41188 records


In [6]:
#checking the number of variable(columns)
print( "The dataset has {} columns". format(bank_df.shape[1]))

The dataset has 21 columns


In [7]:
#changing unknown to null values
bank_df.replace("unknown", np.nan, inplace=True)

#confirming the changes
bank_df.head()


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [8]:
#checking for the missing values
bank_df.isnull().sum()

age                  0
job                330
marital             80
education         1731
default           8597
housing            990
loan               990
contact              0
month                0
day_of_week          0
duration             0
campaign             0
pdays                0
previous             0
poutcome             0
emp.var.rate         0
cons.price.idx       0
cons.conf.idx        0
euribor3m            0
nr.employed          0
y                    0
dtype: int64

In [9]:
#checking duplicates
bank_df.duplicated().sum()

12

In [10]:
#file size
import os
#file size
file_size = os.path.getsize("bank_additional_full.csv")

print("The file size is: {} bytes".format(file_size))

The file size is: 5834924 bytes


In [11]:
# checking customer information
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             40858 non-null  object 
 2   marital         41108 non-null  object 
 3   education       39457 non-null  object 
 4   default         32591 non-null  object 
 5   housing         40198 non-null  object 
 6   loan            40198 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

## Checking duplicates

In [12]:
#duplicates
bank_df.duplicated().sum()

12

In [13]:
#checking the target variable
bank_df["y"]. value_counts()

no     36548
yes     4640
Name: y, dtype: int64

## 2.2 File two(bank_additional.csv)

In [14]:
# load the train dataset
bank_df2= pd.read_csv('bank_additional.csv',sep=";")

# preview the first 5 records
bank_df2.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


In [15]:
# check dataset records
print( "The dataset has {} records". format(bank_df2.shape[0]))

The dataset has 4119 records


In [16]:
#checking the number of variable(columns)
print( "The dataset has {} columns". format(bank_df2.shape[1]))

The dataset has 21 columns


In [17]:
#changing unknown to null values
bank_df2.replace("unknown", np.nan, inplace=True)

#confirming the changes
bank_df2.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,,,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


In [18]:
#checking for the missing values
bank_df2.isna().sum()

age                 0
job                39
marital            11
education         167
default           803
housing           105
loan              105
contact             0
month               0
day_of_week         0
duration            0
campaign            0
pdays               0
previous            0
poutcome            0
emp.var.rate        0
cons.price.idx      0
cons.conf.idx       0
euribor3m           0
nr.employed         0
y                   0
dtype: int64

In [19]:
#checking duplicates
bank_df2.duplicated().sum().any()

False

In [20]:
#file size
import os
#file size
file_size = os.path.getsize("bank_additional.csv")

print("The file size is: {} bytes".format(file_size))

The file size is: 583898 bytes


In [21]:
#checking info
bank_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4119 entries, 0 to 4118
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             4119 non-null   int64  
 1   job             4080 non-null   object 
 2   marital         4108 non-null   object 
 3   education       3952 non-null   object 
 4   default         3316 non-null   object 
 5   housing         4014 non-null   object 
 6   loan            4014 non-null   object 
 7   contact         4119 non-null   object 
 8   month           4119 non-null   object 
 9   day_of_week     4119 non-null   object 
 10  duration        4119 non-null   int64  
 11  campaign        4119 non-null   int64  
 12  pdays           4119 non-null   int64  
 13  previous        4119 non-null   int64  
 14  poutcome        4119 non-null   object 
 15  emp.var.rate    4119 non-null   float64
 16  cons.price.idx  4119 non-null   float64
 17  cons.conf.idx   4119 non-null   f

In [22]:
#checking the target variable
bank_df2["y"]. value_counts()

no     3668
yes     451
Name: y, dtype: int64

## General Observation:

* The data cover a period from May 2008 to November 2010..
* There are 2 dataset, the second dataset is a sample of the first dataset
* There are 10 integers and 11 categorical variables
* The missing values in bot datasets are presented by "unkown" string.
* There are missing values in six variables namely, job, marital status, education, default, housing, and loan.
* There are 12 duplicates in the first dataset and no duplicates in the sample dataset.
* The targent variable is unbalanced class , "no" class has more observation than "yes" class in both dataset.
* Columns are not uniformed named for example "day_of_week", and "emp.var.rate". this need to be modified.
* All varaibles in both datasets have the right datatypes.
