<a href="https://colab.research.google.com/github/wahyunh10/Banking-Deposit-Target-Prediction/blob/main/Banking_Deposit_Target_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Import Package**

In [1]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.0.6-cp37-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 1.2 MB/s 
Installing collected packages: catboost
Successfully installed catboost-1.0.6


In [2]:
# import library
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from imblearn import under_sampling, over_sampling

from sklearn.model_selection import train_test_split

from sklearn.metrics import  accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

from sklearn.linear_model import  LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import kneighbors_graph
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier

# **Dataset**

In [3]:
df = pd.read_csv('train.csv', sep=';')
df.sample(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
7490,47,services,married,secondary,no,795,yes,no,unknown,29,may,42,3,-1,0,unknown,no
34588,35,management,single,unknown,no,121,yes,no,cellular,5,may,245,5,-1,0,unknown,no
32817,36,technician,single,secondary,yes,-487,yes,no,cellular,17,apr,1080,1,-1,0,unknown,yes
14926,51,blue-collar,divorced,primary,no,1927,no,yes,cellular,16,jul,322,4,-1,0,unknown,no
30039,53,entrepreneur,married,secondary,no,230,yes,no,cellular,4,feb,56,2,250,1,other,no
23963,36,self-employed,married,tertiary,no,10,no,no,cellular,29,aug,47,4,-1,0,unknown,no
39861,26,management,single,tertiary,no,3178,no,no,cellular,2,jun,64,1,-1,0,unknown,no
8457,38,services,married,secondary,no,823,yes,no,unknown,3,jun,132,5,-1,0,unknown,no
4131,31,technician,single,secondary,no,87,yes,no,unknown,19,may,259,3,-1,0,unknown,no
28281,37,services,married,secondary,no,1694,yes,yes,cellular,29,jan,404,2,251,6,failure,no


# **Exploratory Data Analysis (EDA)**

**Descriptive Statistic**

In [4]:
# melihat info dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


In [5]:
# cek data duplikat
df.duplicated().any()

False

In [6]:
# cek data null
df.isnull().any()

age          False
job          False
marital      False
education    False
default      False
balance      False
housing      False
loan         False
contact      False
day          False
month        False
duration     False
campaign     False
pdays        False
previous     False
poutcome     False
y            False
dtype: bool

Observation :
1. Dataset terdiri dari 17 kolom dan 45211 baris
2. Terdapat 7 data numerikal dan 10 data kategorikal
3. Dari hasil pengecekan, tidak ada issue yang mencolok pada tipe data untuk setiap kolom (tipe data sudah sesuai)
4. Tidak terdapat missing values, tetapi terdapat unknown values pada beberapa feature
5. Dataset tidak terdapat data yang duplikat
6. Terdapat beberapa fitur yang bertipe categorical sehingga harus dilakukan label encoding/one-hot encoding/feature hashing
7. Kolom yang akan menjadi feature target adalah kolom y


In [7]:
# split kolom numerical dan kolom categorical
num_cols = df.select_dtypes('number').columns.tolist()
cat_cols = df.select_dtypes('object').columns.tolist()
print(num_cols)
print(cat_cols)

['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'y']


In [11]:
# melihat deskripsi kolom numerical
df[num_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,45211.0,40.93621,10.618762,18.0,33.0,39.0,48.0,95.0
balance,45211.0,1362.272058,3044.765829,-8019.0,72.0,448.0,1428.0,102127.0
day,45211.0,15.806419,8.322476,1.0,8.0,16.0,21.0,31.0
duration,45211.0,258.16308,257.527812,0.0,103.0,180.0,319.0,4918.0
campaign,45211.0,2.763841,3.098021,1.0,1.0,2.0,3.0,63.0
pdays,45211.0,40.197828,100.128746,-1.0,-1.0,-1.0,-1.0,871.0
previous,45211.0,0.580323,2.303441,0.0,0.0,0.0,0.0,275.0


In [12]:
# melihat hubungan kolom previous dan pdays
df[(df['previous'] == 0) & (df['pdays'] == -1)].shape[0]

36954

In [17]:
# melihat kolom balance negatif
df[df['balance'] < 0].shape[0]

3766

In [18]:
# melihat jumlah data dengan balance negatid dan y yes
df[(df.y == 'yes') & (df.balance < 0)].shape[0]

210

In [19]:
# melihat data previous
df[(df.previous) > 30]

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
26668,51,entrepreneur,married,secondary,no,653,yes,no,cellular,20,nov,16,9,112,37,other,no
28498,49,management,single,tertiary,no,145,yes,no,cellular,29,jan,57,2,248,38,failure,no
28886,31,management,single,tertiary,no,358,yes,no,cellular,30,jan,68,3,256,51,failure,no
29182,40,management,married,tertiary,no,543,yes,no,cellular,2,feb,349,2,262,275,other,no
37567,39,management,married,tertiary,no,0,yes,no,cellular,14,may,11,15,261,38,failure,no
38326,46,blue-collar,married,primary,no,1085,yes,yes,cellular,15,may,523,2,353,58,other,yes
39141,44,admin.,married,secondary,no,429,yes,yes,cellular,18,may,35,3,349,32,failure,no
42422,27,student,single,secondary,no,91,no,no,telephone,4,dec,157,6,95,37,other,no
42611,35,technician,single,secondary,no,4645,yes,no,cellular,11,jan,502,3,270,40,other,no
44089,37,technician,married,secondary,no,432,yes,no,cellular,6,jul,386,3,776,55,failure,yes


Dari hasil inspeksi di atas, beberapa insight yang dapat diambil adalah:

* Kolom `age` sepertinya berdistribusi normal karena memiliki nilai mean dan median yang berdekatan
* Kolom lain memiliki mean yang cukup jauh dari median, yang berarti distribusinya tidak mendekati normal
* Kolom `pdays` menunjukkan jumlah hari yang yang telah berlalu setelah customer dihubungi pada campaign sebelumnya.
* Kolom `balance` memiliki nilai negatif sebanyak 3766, tetapi nilainya masih masuk akal
* Kolom `previous` (jumlah kontak yang dilakukan pada campaign sebelumnya) memiliki nilai maksimum yang sangat tinggi, yaitu 275 (indikasi sebagai outlier)
* Data dengan `pdays` = -1 pasti nilai `previous` = 0, dan data ini tampak mendominasi
* Perlu dilakukan rescaling pada masing-masing kolom