# Final Project
Dataset             :  Bank marketing campaigns dataset | Opening Deposit </br>
Disusun Oleh    :  Roberto Benedict & Gretty Margaretha</br>
Kelas                 :  JCDSOL-013(B)

## A. Business Problem Understanding

**Context**  
Bank marketing campaigns dataset analysis - Opening a Term Deposit dataset is a dataset describing a Portugal bank marketing campaigns results. Conducted campaigns were based mostly on direct phone calls, offering bank client to place a term deposit.

If after all marketing efforts client had agreed to place deposit - target variable marked 'yes', otherwise 'no'

Target y (term): 

* 0 : no, disagree to place deposit
* 1 : yes, agree to place deposit

**Problem Statement :**

Proses marketing dapat memakan waktu dan sumber daya yang signifikan jika bank menargetkan semua calon client tanpa melakukan penyaringan terlebih dahulu atau targeted marketing, maka terdapat waktu dan sumber daya yang terbuang. Bank ingin meningkatkan efisiensi marketing dengan mengetahui calon client mana yang kemungkinan akan setuju untuk membuka akun tabungan berjangka (term deposit) atau deposito.

**Goals :**
1. Client features pattern for most probable potential client
    * Most important feature by correlation to target
    * Seasonality
    * Socio-economic conditions
2. Promotion cost minimization

Berdasarkan permasalahan tersebut, bank ingin memiliki kemampuan untuk memprediksi kemungkinan seorang client akan setuju untuk membuka akun tabungan berjangka atau deposito. Hal ini dapat mendukung pihak bank untuk menjalankan strategi marketing untuk client yang paling mungkin untuk tertarik agar dapat menghemat biaya, waktu, dan sumber daya.

Selain itu, bank ingin mengetahui faktor apa yang membuat seorang client ingin membuka tabungan berjangka atau deposito atau tidak, sehingga mereka dapat membuat rencana yang lebih baik dalam mendekati potensial client.

**Analytic Approach :**

Jadi yang akan kita lakukan adalah menganalisis data untuk menemukan pola yang membedakan client yang mau membuka akun tabungan berjangka atau tidak. 

Kemudian model klasifikasi akan dikembangkan untuk membantu bank agar dapat memprediksi probabilitas seorang client akan atau ingin membuka akun tabungan berjangka atau deposito di bank tersebut atau tidak.

**Library**

In [72]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [73]:
dataset = pd.read_csv('bank-additional-full.csv',delimiter=';')

In [74]:
df = dataset.copy()

In [75]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [76]:
pd.set_option('display.max_colwidth', None)
listItem = []
for col in df.columns :
    listItem.append([col, df[col].dtype, df[col].isna().sum(), round((df[col].isna().sum()/len(df[col])) * 100,2),
                    df[col].nunique(), list(df[col].drop_duplicates().values)])

dfDesc = pd.DataFrame(columns=['dataFeatures', 'dataType', 'null', 'nullPct', 'unique', 'uniqueSample'],
                     data=listItem)
dfDesc

Unnamed: 0,dataFeatures,dataType,null,nullPct,unique,uniqueSample
0,age,int64,0,0.0,78,"[56, 57, 37, 40, 45, 59, 41, 24, 25, 29, 35, 54, 46, 50, 39, 30, 55, 49, 34, 52, 58, 32, 38, 44, 42, 60, 53, 47, 51, 48, 33, 31, 43, 36, 28, 27, 26, 22, 23, 20, 21, 61, 19, 18, 70, 66, 76, 67, 73, 88, 95, 77, 68, 75, 63, 80, 62, 65, 72, 82, 64, 71, 69, 78, 85, 79, 83, 81, 74, 17, 87, 91, 86, 98, 94, 84, 92, 89]"
1,job,object,0,0.0,12,"[housemaid, services, admin., blue-collar, technician, retired, management, unemployed, self-employed, unknown, entrepreneur, student]"
2,marital,object,0,0.0,4,"[married, single, divorced, unknown]"
3,education,object,0,0.0,8,"[basic.4y, high.school, basic.6y, basic.9y, professional.course, unknown, university.degree, illiterate]"
4,default,object,0,0.0,3,"[no, unknown, yes]"
5,housing,object,0,0.0,3,"[no, yes, unknown]"
6,loan,object,0,0.0,3,"[no, yes, unknown]"
7,contact,object,0,0.0,2,"[telephone, cellular]"
8,month,object,0,0.0,10,"[may, jun, jul, aug, oct, nov, dec, mar, apr, sep]"
9,day_of_week,object,0,0.0,5,"[mon, tue, wed, thu, fri]"


### Missing Values

In [77]:
df.isna().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

### Duplicated

In [78]:
print(f'Duplicated : {df.duplicated().sum()}')
print(f'Percent Duplicated : {round(df.duplicated().sum()/len(df)*100,2)} %')

Duplicated : 12
Percent Duplicated : 0.03 %


In [79]:
dupSuspect = df.duplicated(keep=False)

In [80]:
dfCheck = df[dupSuspect].sort_values(by=list(df.columns),axis=0)
display(dfCheck.head(2),dfCheck.tail(2))

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
28476,24,services,single,high.school,no,yes,no,cellular,apr,tue,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.423,5099.1,no
28477,24,services,single,high.school,no,yes,no,cellular,apr,tue,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.423,5099.1,no


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
38255,71,retired,single,university.degree,no,no,no,telephone,oct,tue,...,1,999,0,nonexistent,-3.4,92.431,-26.9,0.742,5017.5,no
38281,71,retired,single,university.degree,no,no,no,telephone,oct,tue,...,1,999,0,nonexistent,-3.4,92.431,-26.9,0.742,5017.5,no


In [81]:
df = df.drop_duplicates()

### Handling Unknown

In [82]:
listItem = []
for col_name in df.columns:
    listItem.append([col_name, f"{len(df[df[col_name]=='unknown'])} of {len(df)}", f"{round(len(df[df[col_name]=='unknown'])/len(df)*100,2)} %"
    ])

dfUnknown = pd.DataFrame(columns=['Column Name', 'Unknown Count', 'Unknown Percentage'], data=listItem)
dfUnknown

Unnamed: 0,Column Name,Unknown Count,Unknown Percentage
0,age,0 of 41176,0.0 %
1,job,330 of 41176,0.8 %
2,marital,80 of 41176,0.19 %
3,education,1730 of 41176,4.2 %
4,default,8596 of 41176,20.88 %
5,housing,990 of 41176,2.4 %
6,loan,990 of 41176,2.4 %
7,contact,0 of 41176,0.0 %
8,month,0 of 41176,0.0 %
9,day_of_week,0 of 41176,0.0 %


1. `Job` : Could be dropped
2. `Marital` : Could be dropped
3. `Education` : Could be dropped
4. `Default` : Not dropped, could imply no prior credit history
5. `Housing` : Could be dropped
6. `Loan` : Could be dropped

In [83]:
col_select = ['job','marital','education','housing','loan']
prevLen = len(df)
df = df[~df[col_select].isin(['unknown']).any(axis=1)]
print(f"Removed rows : {prevLen-len(df)}")
print(f"Removed rows Percentage : {round((prevLen-len(df))/prevLen*100,2)} %")

Removed rows : 2942
Removed rows Percentage : 7.14 %
