# Bank Marketing Dataset

## Deskripsi :
1. Data ini terkait dengan kampanye pemasaran langsung dari lembaga perbankan. Kampanye pemasaran didasarkan pada panggilan telepon.
2. Tujuan pengumpulan data tersebut adalah untuk memprediksi apakah klien akan berlangganan (ya / tidak) deposito berjangka (variabel y).

Input variables:

1 - age (numeric)
2 - job: type of job (categorical: 'admin', 'blue-collar', 'entrepreneur', 'housemaid', 'management' 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown')
3 - marital : marital status (categorical: 'divorced', 'married', 'single','unknown')
4 - education (categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')

# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone') 
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). 

# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure', 'nonexistent', 'success')

# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric) 
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')


In [None]:
#import Library
%matplotlib inline
import math
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import scipy.stats as stats
import sklearn.linear_model as linear_model
import copy
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from nltk.tokenize import word_tokenize
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC,SVR
from sklearn.linear_model import Ridge,Lasso,ElasticNet
from sklearn.neighbors import KNeighborsClassifier
import datetime
from sklearn.metrics import confusion_matrix
from fastai.structured import add_datepart
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_predict
from sklearn import tree
from sklearn.svm import LinearSVC
from sklearn.model_selection import RepeatedKFold

# ignore Deprecation Warning
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
df_train = pd.read_csv("../input/ugm1234/Data_training.csv")
df_test = pd.read_csv("../input/ugm1234/Data_testing.csv")

# Explanatory Data Analysis

In [None]:
df_train.head()

In [None]:
df_train.describe()

In [None]:
df_train.info()

- Langkah pertama , kami akan transformasi fitur target menjadi binary data , 1 menyatakan klien akan berlangganan dan 0 menyatakan klien tidak akan berlangganan

In [None]:
#Transform boolean
df_train['y'].replace({'yes':1,'no':0},inplace=True)
numerik = df_train.dtypes[df_train.dtypes != 'object'].index

In [None]:
numerik = df_train.dtypes[df_train.dtypes != 'object'].index
kategorik = df_train.dtypes[df_train.dtypes == 'object'].index
print("Terdapat ",len(numerik),"fitur numerik")
print("Terdapat ",len(kategorik),"fitur kategorik")

Terdapat 11 fitur numerik dan 10 fitur kategorik didalam dataset, untuk melakukan analisa, kami akan melakukannya secara terpisah

In [None]:
sns.catplot(x="y", y="euribor3m", data=df_train)

In [None]:
sns.catplot(x="y", y="cons.price.idx", data=df_train)

In [None]:
sns.catplot(x="y", y="cons.conf.idx", data=df_train)

In [None]:
sns.catplot(x="y", y="duration", data=df_train)

In [None]:
sns.catplot(x="y", y="nr.employed", data=df_train)

## Analisa fitur numerik



- Korelasi matriks merupakaan eksplorasi yang sangat bermanfaat untuk melihat sekilas seberapa besar pengaruh antar numerik fitur didalam dataset

In [None]:
#Output Matriks Korelasi
plt.figure(figsize = (10,7))
corr = df_train[numerik].corr()
sns.heatmap(corr,fmt = '.2f',annot = True)

Terdapat beberapa informasi yang bisa didapatkan dari hasil matriks korelasi diatas yaitu :

     1) emp.var.rate berkorelasi kuat dengan cons.price.idx ( 0.73)
     2) emp.var.rate berkorelasi kuat dengan euribor3m (0.97)
     3) emp.var.rate berkorelasi kuat dengn nr.employed
     4) euriborm berkorelasi kuat dengan nr.employed
     5) fitur yang paling kuat korelasinya dengan target(y) yaitu fitur duration
     
dari empat informasi diatas didapat bahwa emp.var.rate merupakan kandidat fitur yang termasuk kedalam social and economy context merupakan fitur yang saling berkorelasi kuat

In [None]:
print(numerik)

berdasarkan matriks korelasi diatas, kategori social and economy context merupakan fitur yang saling berkorelasi kuat satu sama lain, mari kita eksplor !

### Pair-plot fitur social and economy context

In [None]:
select  = df_train[['emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']]
sns.pairplot(select)

Dari plot diatas, informasi penting yang bisa didapatkan yaitu bahwa :

    1) euribor3m terpisah secara sempurna terhadap emp.var.rate (euribor3m < 3 maka emp.var.rate < 0 dan euribor3m >=3 maka emp.var.rate>=0)
    2) emp.var.rate terpisah secara sempurna terhadap nr.employed (emp.var.rate < 1 maka nr.employed <5150 dan emp.var.rate >=1 maka nr.employed>5150)
    3) euribor3m terpisah secara sempurna terhadap nr.employed (euribor3m < 3 maka nr.employed <=5100 dan euribor>=3 maka nr.employed > 5100)
    
Fakta diatas memberikan informasi bahwa euribor3m sudah mewakili emp.var.rate dan nr.employed

## Persentase Target

In [None]:
f,ax = plt.subplots(1,2,figsize=(15,7))
df_train['y'].value_counts().plot.pie(ax=ax[0],labels =["Tidak","Ya"],autopct='%1.1f%%')
sns.countplot(df_train['y'],ax=ax[1])
ax[1].set_xticklabels(['Tidak','Ya'])

## Age Vs Target



In [None]:
#Binning Age
df_train['age_binning'] = df_train['job']
loc = df_train[df_train['age'] <=20].index
df_train['age_binning'].iloc[loc] = "muda"
loc = df_train[(df_train['age'] >20) & (df_train['age'] <=40)].index
df_train['age_binning'].iloc[loc] = "dewasa"
loc = df_train[(df_train['age'] >40)].index
df_train['age_binning'].iloc[loc] = "tua"
plt.figure(figsize = (10,7))
sns.countplot('age_binning',hue='y',data=df_train)

Dari plot terlihat bahwa orang dewasa (umur >20 dan <40) merupakan partisipan terbanyak dalam dataset

## Campaign vs Previous

1) Campaign merupakan banyanya kontak yang dilakukan selama masa kampanye

2) Previous merupakan banyaknya kontak yang dilakukan sebelum masa kampanye

Bagaimana perbandingan keduanya ?

In [None]:
plt.figure(figsize = (10,7))
sns.violinplot(data=df_train,x='previous',y='campaign')

Dari violinplot didapat bahwa , orang yang sebelumnya belum pernah dikontak akan lebih sering dikontak pada saat kampanye

## Campaign vs Pdays

1) campaign merupakan banyaknya kontak yang dilakukan selama masa kampanye

2) Pdays merupakan banyaknya hari setelah kontak terakhir dilakukan

Bagaimana perbandingan keduanya ?

In [None]:
plt.figure(figsize = (10,7))
sns.violinplot(data=df_train,x='pdays',y='campaign')

Dari violinplot diatas didapat bahwa , orang yang dikontak pada saat masa kampanya lebih dari 40 kali belum pernah dikontak lagi setelah kontak terakhir

## Berapa lama kontak dilakukan ?

In [None]:
sns.distplot(df_train['duration'])

Rata-rata kontak dilakukan kurang dari 1000 detik atau sekitar ~ 16 menit.

## Duration Vs Target

In [None]:
plt.figure(figsize = (10,7))
sns.boxplot(data=df_train,x='y',y='duration')

## Analisa fitur kategorik

In [None]:
print(kategorik)

In [None]:
fig, axs = plt.subplots(ncols=3,figsize=(20,6))
sns.countplot(df_train['job'], ax=axs[0])
sns.countplot(df_train['marital'], ax=axs[1])
sns.countplot(df_train['education'], ax=axs[2])
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=90)
plt.show()
fig, axs = plt.subplots(ncols=3,figsize=(20,6))
sns.countplot(df_train['default'], ax=axs[0])
sns.countplot(df_train['housing'], ax=axs[1])
sns.countplot(df_train['loan'], ax=axs[2])
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=90)
plt.show()
fig, axs = plt.subplots(ncols=3,figsize=(20,6))
sns.countplot(df_train['contact'], ax=axs[0])
sns.countplot(df_train['month'], ax=axs[1])
sns.countplot(df_train['poutcome'], ax=axs[2])
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=90)
plt.show()
sns.countplot(df_train['day_of_week'])
plt.show()

countplot diatas memberikan beberapa informasi yaitu :

    1) meskipun terdapat 0 missing value pada informasi sebelumnya, ternyata terdapat fitur yang berisi 'unknown' yang sebenarnya merupakan missing data
    2) contact dilakukan paling banyak dengan menggunakan cellular
    3) university-degree merupakan education terbanyak yang terdapat didalam data
    4) poutcome terbanyak yaitu non-existent yang menyatakan kampanye sebelumnya belum pernah ada atau merupakan kampanya pertama
    5) mayoritas orang tidak sedang mempunyai personal loan
    6) jumlah 'yes' pada fitur default sangat sedikit, berpotensi menjadi bias dalam data

### Duration vs Outcome

In [None]:
plt.figure(figsize = (10,7))
sns.violinplot(y='duration',x='poutcome',data=df_train)

Karena terdapat imbalance data yaitu jumlah non-existent mendominasi, maka tidak ada informasi yang bisa didapatkan

# Feature Engineering

Dari EDA diatas didapat beberapa hasil yaitu :

    1) karena euribor3m sudah mewakili emp.var.rate dan nr.employed , maka kami men-drop fitur  emp.var.rate dan nr.employed
    2) fitur 'default' merupakan fitur yang sangat berpotensi hanya akan menjadi bias, maka kita drop
    3) fitur age yang sudah di-binning menjadi age_binning

In [None]:
df_train_drop = df_train.drop(['emp.var.rate','nr.employed','default','age'],axis=1)

## Data Transformation

Transformasi data kategorik menjadi numerik dengan menggunakan metode dummies yaitu dengan mengekstrak fitur menjadi fitur lain yang bernilai 1 jika terdapat dan 0 jika tidak terdapat

In [None]:
#Transform data
df_train_transform = pd.get_dummies(df_train_drop)
df_train_transform.head()

In [None]:
df_train_transform.shape

## Standardisasi

Standardisasi bertujuan untuk merubah fitur prediktor agar distribusinya mempunyai nilai rata-rata sebesar 0 dan standar deviasinya sebesar 1

In [None]:
#Standardize data
X_train,y_train=df_train_transform.drop('y',axis=1),df_train_transform['y']
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
from imblearn.combine import SMOTEENN
sm = SMOTEENN(random_state=7)
X_train_scaled, y_train = sm.fit_sample(X_train_scaled, y_train)

In [None]:
model = linear_model.LogisticRegression()
kfold = KFold(n_splits=10, random_state=7)
cvLR  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
print(cvLR)
print(cvLR.mean())

In [None]:
def ConfusionMatrixCV(model,X,y):
    kf = KFold(n_splits=10, random_state=7)
    conf_mat = []
    for train_index, test_index in kf.split(X):
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]
       model.fit(X_train, y_train)
       conf_mat.append(confusion_matrix(y_test, model.predict(X_test)))
    return conf_mat

In [None]:
#Confusion Matrix Logistic Regression
model = linear_model.LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False) 
conf_mat_logit = ConfusionMatrixCV(model,X_train_scaled,y_train)
len(conf_mat_logit)

In [None]:
#Confusion-Matrix RegLog
print("============================================Confusion-Matrix RegLog===================================================")
f,axs = plt.subplots(ncols=5,figsize=(20,3))
sns.heatmap(conf_mat_logit[0],ax=axs[0],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[1],ax=axs[1],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[2],ax=axs[2],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[3],ax=axs[3],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[4],ax=axs[4],fmt = '.2f',annot = True)
i=1
for ax in axs:
    plt.sca(ax)
    plt.title(f'Cross-Validation-{i}')
    i+=1
plt.show()
f,axs = plt.subplots(ncols=5,figsize=(20,3))
sns.heatmap(conf_mat_logit[5],ax=axs[0],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[6],ax=axs[1],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[7],ax=axs[2],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[8],ax=axs[3],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[9],ax=axs[4],fmt = '.2f',annot = True)
for ax in axs:
    plt.sca(ax)
    plt.title(f'Cross-Validation-{i}')
    i+=1
plt.show()

In [None]:
len(df_train)

# Modelling


In [None]:
#kategori_cek = ['poutcome','contact'] #nonexistent, telephone
drop_kategorik = ['default','month']
drop_numerik = ['euribor3m']

## 10-Fold-Cross Validation

In [None]:
#default
model = tree.DecisionTreeClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvDT  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = RandomForestClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvRF  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = linear_model.LogisticRegression()
kfold = KFold(n_splits=10, random_state=7)
cvLR  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = KNeighborsClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvKNN  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = SGDClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvSGD  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = LinearSVC()
kfold = KFold(n_splits=10, random_state=7)
cvSVC  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)

In [None]:
name = ['Decision Tree','Random Forest','Logistic Regression','KNN','SGDClassifier','LinearSVC']
data = pd.DataFrame(columns=name)
data['Decision Tree'] = cvDT
data['Random Forest'] = cvRF
data['Logistic Regression'] = cvLR
data['KNN'] = cvKNN
data['SGDClassifier'] = cvSGD
data['LinearSVC'] = cvSVC
data

In [None]:
data.plot(style='.-',figsize=(10,7))
plt.title("Cross-Validation Plot")
plt.show()

In [None]:
for i in data.columns:
    print(i,": ",data[i].mean())

## Mengisi nilai unknown value dengan mode

In [None]:
df_train_copy = copy.copy(df_train_drop)
for i in df_train_copy.columns:
    if (df_train_copy[i].dtypes == 'object'):
            df_train_copy[i].replace({'unknown':df_train_copy[i].mode()[0]},inplace=True)
            print(df_train_copy[i].unique())

## Re-Modelling

In [None]:
#Transform data
df_train_transform = pd.get_dummies(df_train_copy)
df_train_transform.head()
#Standardize data
X_train,y_train=df_train_transform.drop('y',axis=1),df_train_transform['y']
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
#default
model = tree.DecisionTreeClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvDT  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = RandomForestClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvRF  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = linear_model.LogisticRegression()
kfold = KFold(n_splits=10, random_state=7)
cvLR  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = KNeighborsClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvKNN  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = SGDClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvSGD  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = LinearSVC()
kfold = KFold(n_splits=10, random_state=7)
cvSVC  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
name = ['Decision Tree','Random Forest','Logistic Regression','KNN','SGDClassifier','LinearSVC']
data = pd.DataFrame(columns=name)
data['Decision Tree'] = cvDT
data['Random Forest'] = cvRF
data['Logistic Regression'] = cvLR
data['KNN'] = cvKNN
data['SGDClassifier'] = cvSGD
data['LinearSVC'] = cvSVC
data

In [None]:
for i in data.columns:
    print(i,": ",data[i].mean())

In [None]:
data.plot(style='.-',figsize=(10,7))
plt.title("Cross-Validation Plot")
plt.show()

## Drop Fitur Date ?

In [None]:
df_train_copy = copy.copy(df_train_drop)
df_train_copy.drop(['month','day_of_week'],axis=1,inplace=True)

In [None]:
#Transform data
df_train_transform = pd.get_dummies(df_train_copy)
df_train_transform.head()
#Standardize data
X_train,y_train=df_train_transform.drop('y',axis=1),df_train_transform['y']
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
#default
model = tree.DecisionTreeClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvDT  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = RandomForestClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvRF  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = linear_model.LogisticRegression()
kfold = KFold(n_splits=10, random_state=7)
cvLR  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = KNeighborsClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvKNN  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = SGDClassifier()
kfold = KFold(n_splits=10, random_state=7)
cvSGD  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
model = LinearSVC()
kfold = KFold(n_splits=10, random_state=7)
cvSVC  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
name = ['Decision Tree','Random Forest','Logistic Regression','KNN','SGDClassifier','LinearSVC']
data = pd.DataFrame(columns=name)
data['Decision Tree'] = cvDT
data['Random Forest'] = cvRF
data['Logistic Regression'] = cvLR
data['KNN'] = cvKNN
data['SGDClassifier'] = cvSGD
data['LinearSVC'] = cvSVC
data

In [None]:
for i in data.columns:
    print(i,": ",data[i].mean())

In [None]:
data.plot(style='.-',figsize=(10,7))
plt.title("Cross-Validation Plot")
plt.show()

Sejauh ini , secara default regresi logistik merupakan model yang paling baik dan stabil , disusul dengan random forest dan knn. SGDClassifier merupakan model yang paling tidak stabil diantara model lainnya. untuk pemilihan model selanjutnya, kami akan fokuskan kepada random forest dan regresi logistik

## Tuning parameter Regresi-logistik + Drop fitur date dengan GridSearchCV

In [None]:
#Drop fitur date
df_train_copy = copy.copy(df_train_drop)
df_train_copy.drop(['month','day_of_week'],axis=1,inplace=True)
#Standardize data
X_train,y_train=df_train_transform.drop('y',axis=1),df_train_transform['y']
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
#Modelling
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
kfold = KFold(n_splits=10, random_state=7)
model = GridSearchCV(linear_model.LogisticRegression(penalty='l2'), param_grid,cv=kfold)
model.fit(X_train_scaled,y_train)
print("best parameter : ",model.best_params_)

In [None]:
model = linear_model.LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)
kfold = KFold(n_splits=10, random_state=7)
cvLR  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
print(cvLR)
print(cvLR.mean())

## Tuning parameter RandomForest + Drop fitur date dengan RandomizedSearchCV

In [None]:
#Drop fitur date
"""
df_train_copy = copy.copy(df_train_drop)
df_train_copy.drop(['month','day_of_week'],axis=1,inplace=True)
#Standardize data
X_train,y_train=df_train_transform.drop('y',axis=1),df_train_transform['y']
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
#Modelling
random_grid = {'bootstrap': [True, False],#Drop fitur date
 'max_depth': [10, 20, 30, 40,None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600]}
kfold = KFold(n_splits=10, random_state=7)
model = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
model.fit(X_train_scaled,y_train)
print("best parameter : ",model.best_params_)
"""

In [None]:
#Drop fitur date
df_train_copy = copy.copy(df_train_drop)
df_train_copy.drop(['month','day_of_week'],axis=1,inplace=True)
df_train_transform = pd.get_dummies(df_train_copy)
df_train_transform.head()
#Standardize data
X_train,y_train=df_train_transform.drop('y',axis=1),df_train_transform['y']
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
model = RandomForestClassifier(n_estimators=200,min_samples_split=2,min_samples_leaf=2,max_features='auto',max_depth=10,bootstrap=True)
kfold = KFold(n_splits=10, random_state=7)
cvLR  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
print(cvLR)
print(cvLR.mean())

In [None]:
featimp = pd.DataFrame()
model.fit(X_train_scaled,y_train)
featimp['name'] = X_train.columns
featimp['values'] = model.feature_importances_
featimp.sort_values(by='values',ascending=True,inplace=True)

In [None]:
featimp

In [None]:
drop = featimp[:20].name
X_copy = X_train.drop(drop,axis=1)
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_copy)
kfold = KFold(n_splits=10, random_state=7)
cvLR  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
print(cvLR)
print(cvLR.mean())

In [None]:
#RegLog
drop = featimp[:15].name
X_copy = X_train.drop(drop,axis=1)
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_copy)
kfold = KFold(n_splits=10, random_state=7)
model = linear_model.LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False) 
cvLR  = cross_val_score(model, X_train_scaled,y_train, cv=kfold)
print(cvLR)
print(cvLR.mean())

## Confusion Matrix (Cross-Validation)

In [None]:
def ConfusionMatrixCV(model,X,y):
    kf = KFold(n_splits=10, random_state=7)
    conf_mat = []
    for train_index, test_index in kf.split(X):
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]
       model.fit(X_train, y_train)
       conf_mat.append(confusion_matrix(y_test, model.predict(X_test)))
    return conf_mat

In [None]:
#Confusion Matrix Logistic Regression
model = linear_model.LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False) 
conf_mat_logit = ConfusionMatrixCV(model,X_train_scaled,y_train)
len(conf_mat_logit)

In [None]:
#Confusion Matrix Random Forest
model = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf='deprecated', min_samples_split=2,
            min_weight_fraction_leaf='deprecated', n_estimators=500,
            n_jobs=None, oob_score=False, random_state=None, verbose=0,
            warm_start=False) 
conf_mat_rf = ConfusionMatrixCV(model,X_train_scaled,y_train)
len(conf_mat_rf)

In [None]:
#Confusion-Matrix RegLog
print("============================================Confusion-Matrix RegLog===================================================")
f,axs = plt.subplots(ncols=5,figsize=(20,3))
sns.heatmap(conf_mat_logit[0],ax=axs[0],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[1],ax=axs[1],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[2],ax=axs[2],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[3],ax=axs[3],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[4],ax=axs[4],fmt = '.2f',annot = True)
i=1
for ax in axs:
    plt.sca(ax)
    plt.title(f'Cross-Validation-{i}')
    i+=1
plt.show()
f,axs = plt.subplots(ncols=5,figsize=(20,3))
sns.heatmap(conf_mat_logit[5],ax=axs[0],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[6],ax=axs[1],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[7],ax=axs[2],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[8],ax=axs[3],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_logit[9],ax=axs[4],fmt = '.2f',annot = True)
for ax in axs:
    plt.sca(ax)
    plt.title(f'Cross-Validation-{i}')
    i+=1
plt.show()

In [None]:
print("============================================Confusion-Matrix RandomForest===================================================")
#Confusion-Matrix RandomForest
f,axs = plt.subplots(ncols=5,figsize=(20,3))
sns.heatmap(conf_mat_rf[0],ax=axs[0],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_rf[1],ax=axs[1],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_rf[2],ax=axs[2],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_rf[3],ax=axs[3],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_rf[4],ax=axs[4],fmt = '.2f',annot = True)
i=1
for ax in axs:
    plt.sca(ax)
    plt.title(f'Cross-Validation-{i}')
    i+=1
plt.show()
f,axs = plt.subplots(ncols=5,figsize=(20,3))
sns.heatmap(conf_mat_rf[5],ax=axs[0],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_rf[6],ax=axs[1],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_rf[7],ax=axs[2],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_rf[8],ax=axs[3],fmt = '.2f',annot = True)
sns.heatmap(conf_mat_rf[9],ax=axs[4],fmt = '.2f',annot = True)
for ax in axs:
    plt.sca(ax)
    plt.title(f'Cross-Validation-{i}')
    i+=1
plt.show()

In [None]:
#Logit-ROC Curve
import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from itertools import cycle

from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold

def ROCAUC(model,X,y):
    # Run classifier with cross-validation and plot ROC curves
    tprs = []
    aucs = []
    mean_fpr = np.linspace(0, 1, 100)
    i = 0
    kf = KFold(n_splits=10, random_state=7)
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        probas_ = model.fit(X_train, y_train).predict_proba(X_test)
        # Compute ROC curve and area the curve
        fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
        tprs.append(interp(mean_fpr, fpr, tpr))
        tprs[-1][0] = 0.0
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)
        plt.plot(fpr, tpr, lw=1, alpha=0.3,
                 label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))

        i += 1
    plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
             label='Luck', alpha=.8)

    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)
    plt.plot(mean_fpr, mean_tpr, color='b',
             label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
             lw=2, alpha=.8)

    std_tpr = np.std(tprs, axis=0)
    tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
    tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
    plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                     label=r'$\pm$ 1 std. dev.')

    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic logistic-regression')
    plt.legend(loc="lower right")
    plt.show()
model = linear_model.LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, max_iter=100, multi_class='ovr',
              n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
              tol=0.0001, verbose=0, warm_start=False) 
ROCAUC(model,X_train_scaled,y_train)

In [None]:
#random forest-ROC Curve
import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from itertools import cycle

from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold

def ROCAUC(model,X,y):
    # Run classifier with cross-validation and plot ROC curves
    tprs = []
    mean_fpr = np.linspace(0, 1, 100)
    aucs = []
    i = 0
    kf = KFold(n_splits=10, random_state=7)
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        probas_ = model.fit(X_train, y_train).predict_proba(X_test)
        # Compute ROC curve and area the curve
        fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
        tprs.append(interp(mean_fpr, fpr, tpr))
        tprs[-1][0] = 0.0
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)
        plt.plot(fpr, tpr, lw=1, alpha=0.3,
                 label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))

        i += 1
    plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
             label='Luck', alpha=.8)

    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)
    plt.plot(mean_fpr, mean_tpr, color='b',
             label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
             lw=2, alpha=.8)

    std_tpr = np.std(tprs, axis=0)
    tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
    tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
    plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                     label=r'$\pm$ 1 std. dev.')

    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic random-forest')
    plt.legend(loc="lower right")
    plt.show()
model = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf='deprecated', min_samples_split=2,
            min_weight_fraction_leaf='deprecated', n_estimators=500,
            n_jobs=None, oob_score=False, random_state=None, verbose=0,
            warm_start=False)  
ROCAUC(model,X_train_scaled,y_train)

# Prediksi Data Test

## Eksplorasi data-test

Sebelum melakukan prediksi pada data-test, kami mencoba untuk mengeksplorasi karakteristik dari data-test

In [None]:
#Eksplor varian
for i in df_test.columns:
    print(df_test[i].value_counts())

Dari hasil eksplorasi diatas didapat bahwa:

    1) contact pada data-test hanya berisi telephone
    2) pdays hanya berisi angka 999
    3) previous hanya berisi angka 0
    4) poutcome hanya berisi nonexistent
    5) cons.price.idx hanya berisi angka 93.994 & 94.465
    6) cons.conf.idx hanya berusi angka -36.4 & -41.8
    7) nr.employed hanya berisi angka 5191.0 & 5228.1
 
 Berdasarkan informasi diatas , kami akan mengambil data training dengan hanya berisi contact=telephone, pdays=999,previous=0 dan poutcome=nonexistence , untuk menghindari noise. untuk cons.price.idx,cons.conf.idx dan nr.employed akan kami drop beserta fitur default,month dan day_of_week

In [None]:
print("before :",df_train_drop.shape)

In [None]:
df_train_drop = df_train_drop[df_train_drop['contact'] == 'telephone']
df_train_drop = df_train_drop[df_train_drop['pdays'] == 999]
df_train_drop = df_train_drop[df_train_drop['previous'] == 0]
df_train_drop = df_train_drop[df_train_drop['poutcome'] == 'nonexistent']
print("after :",df_train_drop.shape)

In [None]:
df_train_drop.head()

In [None]:
for i in df_train_drop.columns:
    if (len(df_train_drop[i].unique()) == 1):
        df_train_drop.drop(i,axis=1,inplace=True)

In [None]:
df_train_drop.head()

In [None]:
df_train_drop.drop(['month','day_of_week'],axis=1,inplace=True)

In [None]:
df_train_drop.head()

In [None]:
#Binning Age
df_test['age_binning'] = df_test['job']
loc = df_test[df_test['age'] <=20].index
df_test['age_binning'].iloc[loc] = "muda"
loc = df_test[(df_test['age'] >20) & (df_test['age'] <=40)].index
df_test['age_binning'].iloc[loc] = "dewasa"
loc = df_test[(df_test['age'] >40)].index
df_test['age_binning'].iloc[loc] = "tua"

In [None]:
#Transform data
df_train_transform = pd.get_dummies(df_train_drop)
df_train_transform.head()
#Standardize data
X_train,y_train=df_train_transform.drop('y',axis=1),df_train_transform['y']
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
model = model = linear_model.LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, max_iter=100, multi_class='ovr',
              n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
              tol=0.0001, verbose=0, warm_start=False)
model.fit(X_train_scaled,y_train)

In [None]:
for i in df_test.columns:
    if (i not in df_train_drop.columns):
        df_test.drop(i,axis=1,inplace=True)

In [None]:
df_train_transform = pd.get_dummies(df_test)
df_train_transform.head()
#Standardize data
X_test=df_train_transform
sc = StandardScaler()
X_test_scaled = sc.fit_transform(X_test)

In [None]:
predict = model.predict(X_test_scaled)
predict

In [None]:
cc = pd.DataFrame()
cc['rest'] = predict
cc['rest'].value_counts()

In [None]:
submit = pd.DataFrame()
submit['y'] = predict
submit.to_csv("../working/submit.csv", index=False)