## **Problem Statement**
####
Kekhawatiran adanya keterlambatan pembayaran kartu kredit pada finanKu yang akan merugikan bisnis. Sehingga orang-orang yang memiliki potensi untuk mengalami keterlambatan bayar bisa diprediksi lebih cepat untuk menentukan strategi yang sesuai dalam menghadapi kondisi di masa mendatang.

## **Objective**

###
Membuat sebuah model yang dapat memprediksi setidaknya 60% dari pelanggan yang akan mengalami telat bayar kartu kredit ( Accuarcy & Reccal diatas 60% )

## **Variable yang tersedia**

### Dari dataset yang dimiliki terdapat beberapa data yang tersedia:
1. **Customer ID** : Unique ID Customer
2. **Branch** : Lokasi cabang nasabah terdaftar
3. **City** : Lokasi kota nasabah terdaftar
4. **Age** : Umur nasabah pada periode Observasi
5. **Avg Annual income/Month** : Rata-rata penghasilan nasabah dalam satu tahun
6. **Balance(Q1-Q4)** : Saldo mengendap yang dimiliki nasabah diakhir kuartal
7. **Num of Product(Q1-Q4)** : Jumlah kepemilikan produk nasabah di akhir kuartal
8. **HasCrCard(Q1-Q4)** : Status kepemilikan produk kartu kredit nasabah diakhir kuartal
9. **Active Member(Q1-Q4)** : Status keaktifan nasabah
10. **Unpaid Tagging** : Status Nasabah gagal bayar

# **Experiment**

### Periode tinjauan:
1. Nasabah direview selama satu tahun terakhir
2. Nasabah direview selama 6 bulan terakhir

### Penyesuaian Variable
1. Balance dilihat dari rata-rata selama horizon waktu & dilihat perubahan pada akhir tinjauan dan awal tinjauan
2. Melihat kepemilikan jumlah produk dari rata-rata, maksimum, dan minimum pada periode tinjauan
3. Status keaktifan nasabah dilihat dalam bentuk bulan


In [None]:
%pip install xgboost
%pip install jcopml
%pip install imbalanced-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report, make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from jcopml.feature_importance import mean_score_decrease

In [None]:
pd.set_option('display.max_columns', None)

## **Data for prediction**

In [None]:
df_all = pd.read_csv('FinanKu Data All.csv')
df_val = pd.read_csv('FinanKu Data Validasi.csv')


In [None]:
df_all.head()

In [None]:
df_val.head()

# **Data Understanding**



### **Sebaran nasabah berdasarkan lokasi**
1. Sebaran keseluruhan
2. Sebaran Nasabah gagal bayar

In [None]:
data1 = pd.DataFrame(
    df_all.groupby(by=["City"])['Customer ID']
    .count()
    .sort_values(ascending=False)
    .reset_index(name='Distribution by City')
)
data1

In [None]:
data2 = pd.DataFrame(\
    df_all[df_all['Unpaid Tagging']==1].groupby(by=["City"])["Customer ID"]\
        .count()\
            .sort_values(ascending=False)\
                .reset_index(name='Distribution By City'))
data2

### **Sebaran nasabah berdasarkan Usia**
1. Sebaran keseluruhan
2. Sebaran nasabah gagal bayar

In [None]:
data3 = pd.DataFrame(\
    df_all.groupby(by=["Age"])["Customer ID"]\
        .count()\
            .reset_index(name='Distribution By Age'))

data3.sort_values(by=['Age'], ascending=True, inplace=True)

data3.plot(x='Age', y='Distribution By Age', kind='bar', grid=True, xlabel='Age', ylabel= 'People', figsize=(12,7), rot=0, title='Distribution of Customers by Age',
           table=False, secondary_y=False)

In [None]:
data4 = pd.DataFrame(\
    df_all[df_all['Unpaid Tagging']==1].groupby(by=["Age"])['Customer ID']\
        .count()\
            .reset_index(name='Distribution By Age'))

data4.sort_values(by=['Age'], ascending=True, inplace=True)

data4.plot(x='Age', y='Distribution By Age', kind='bar', grid=True, xlabel='Age', ylabel= 'People', figsize=(12,7), rot=0, title='Unpaid Distribution of Customers by Age',
           table=False, secondary_y=False)

### **Menghitung Rata-Rata Saldo Nasabah**

In [None]:
df_checkbalance = df_all
df_checkbalance['Total Balance'] = df_checkbalance['Balance Q1'] + df_checkbalance['Balance Q2'] +df_checkbalance['Balance Q3'] + df_checkbalance['Balance Q4']
df_checkbalance['Avg Balance'] = df_checkbalance['Total Balance'] / 4

In [None]:
data5 = pd.DataFrame(
    df_checkbalance.groupby(by=["Unpaid Tagging"])["Avg Balance"]
        .mean()
        .reset_index(name='Avg Quarterly Balance'))
data5

In [None]:
data6 = pd.DataFrame(
    df_checkbalance.groupby(by=["Unpaid Tagging"])["Total Balance"]
        .mean()
        .reset_index(name='Avg Annual Balance'))
data6

### **Menghitung Rata-Rata Kepemilikan Produk**

In [None]:
df_checkbalance['Avg Product'] = (
	df_checkbalance['NumOfProducts Q1'] +
	df_checkbalance['NumOfProducts Q2'] +
	df_checkbalance['NumOfProducts Q3'] +
	df_checkbalance['NumOfProducts Q4']
) / 4


In [None]:
data7 = pd.DataFrame(\
    df_checkbalance.groupby(by=["Unpaid Tagging"])['Avg Product']\
        .mean()\
            .reset_index(name='Avg Product Ownership'))
data7

In [None]:
df_all = df_all.drop(columns=['Total Balance', 'Avg Balance', 'Avg Product'])

## **Data Preparation**

### **Pengecekan data duplikat dan data yang hilang**

In [None]:
df_all.duplicated().sum()

In [None]:
df_all.isnull().sum()

## **Penambahan Variable Relevan**

### **Rata-rata saldo & Perubahan saldo selama periode observasi**
Melihat saldo nasabah dalam scope observasi:
1. Experiment 1 : Rata-Rata saldo nasabah dalam 1 tahun terakhir & perubahan saldo diQ4 terhadap Q1
2. Experiment 2 : Rata-rata saldo nasabah dalam 6 bulan terakhir & Perubahan Saldo diQ4 terhadap Q2

In [None]:
df1 = df_all.copy()
df2 = df_all.copy()

In [None]:
df1['Mean Balance'] = (df1['Balance Q1'] + df1['Balance Q2'] + df1['Balance Q3'] + df1['Balance Q4']) / 4
df1['Delta Balance'] = df1['Balance Q4'] - df1['Balance Q1']
df1.head()

In [None]:
df2['Mean Balance'] = (df2['Balance Q3'] + df2['Balance Q4']) / 2
df2['Delta Balance'] = df2['Balance Q4'] - df2['Balance Q2']
df2.head()

### **Status Keaktifan**

Melihat periode nasabah yang aktif dalam scope Observasi
1. Experiment 1 : Keaktifan Nasabah ( Dalam bulan ) di 1 tahun terakhir
2. Experiment 2 : Keaktifan Nasabah ( Dalam bulan ) di 6 bulan terakhir

In [None]:
df1['Active Months'] = (df1['ActiveMember Q1'] + df1['ActiveMember Q2'] + df1['ActiveMember Q3'] + df1['ActiveMember Q4'])*3
df1.head()

In [None]:
df2['Active Months'] = (df2['ActiveMember Q3'] + df2['ActiveMember Q4'])*3
df2.head()

### **Penambahan atau pengurangan produk Holding**
Melihat fluktuasi kepemilikan produk nasabah dalam periode observasi

In [None]:
df1['Diff PH'] = df1['NumOfProducts Q4'] - df1['NumOfProducts Q1']
df1.head(20)

In [None]:
df2['Diff PH'] = df2['NumOfProducts Q4'] - df2['NumOfProducts Q2']
df2.head()

### **Mencari Lama kepemilikan kartu kredit dalam periode observasi**

In [None]:
def assign_cr1(df_all):
    if df_all['HasCrCard Q1'] == 1:
        return 12
    elif df_all['HasCrCard Q2'] == 1:
        return 9
    elif df_all['HasCrCard Q3'] == 1:
        return 6
    else:
        return 3
    return np.nan

In [None]:
df1['Vintage_CR'] = df1.apply(assign_cr1, axis=1)
df1.head()

In [None]:
df2['Vintage_CR'] = df2.apply(assign_cr1, axis=1)
df2.head()

In [None]:
df_all['Vintage_CR'] = df_all.apply(assign_cr1, axis=1)
df_all.head()

### **Penghapusan Variable**
Status kepemilikan sudah digantikan dengan lama kepemilikan kartu kredit

In [None]:
df1 = df1.drop(columns=['HasCrCard Q1', 'HasCrCard Q2', 'HasCrCard Q3', 'HasCrCard Q4'], errors='ignore')
df_all = df_all.drop(columns=['HasCrCard Q1', 'HasCrCard Q2', 'HasCrCard Q3', 'HasCrCard Q4'], errors='ignore')


<p style="font-size:16px;">Balance Per Quarter sudah digantikan dengan rata-rata saldo dalam periode observasi & selisih saldo di awal dan di akhir periode observasi</p>

In [None]:
df1 = df1.drop(columns = ['Balance Q1', 'Balance Q2', 'Balance Q3', 'Balance Q4'], errors='ignore')
df_all = df_all.drop(columns = ['Balance Q1', 'Balance Q2', 'Balance Q3', 'Balance Q4'], errors='ignore')

<p style="font_size:16px:"> Jumlah kepemilikan Produk sudah digantikan dengan fluktuasi kepemilikan produk pada periode observasi </p>

In [None]:
df1 = df1.drop(columns = ['NumOfProducts Q1', 'NumOfProducts Q2', 'NumOfProducts Q3', 'NumOfProducts Q4'], errors='ignore')
df_all = df_all.drop(columns = ['NumOfProducts Q1', 'NumOfProducts Q2', 'NumOfProducts Q3', 'NumOfProducts Q4'], errors='ignore')

<p style="font-size:16px;"> Status keaktifan nasabah per quarter sudah digantikan dengan status keaktifan di setiap bulan </p>

In [None]:
df1 = df1.drop(columns = ['ActiveMember Q1', 'ActiveMember Q2', 'ActiveMember Q3', 'ActiveMember Q4'], errors='ignore')
df_all = df_all.drop(columns = ['ActiveMember Q1', 'ActiveMember Q2', 'ActiveMember Q3', 'ActiveMember Q4'], errors='ignore')

### **Data Transformstion**
Pemisahan Variable Prediktor

In [None]:
predictor1 = df1[df1.columns.difference(['Customer ID', 'Unpaid Tagging'])]
predictor2 = df2[df2.columns.difference(['Customer ID', 'Unpaid Tagging'])]



In [None]:
predictor1.head()

In [None]:
predictor2.head()

In [None]:
print("Kolom di df1:", df1.columns.tolist())
print("Kolom di df_all:", df_all.columns.tolist())

print("Kolom yang ada di df_all tapi ga ada di df1:", set(df_all.columns) - set(df1.columns))
print("Kolom yang ada di df1 tapi ga ada di df_all:", set(df1.columns) - set(df_all.columns))


### **Melakukan encoding Untuk data category**


### Variable category:
1. Branch Code
2. City

Untuk Branch Code perlu diubah menjadi string agar dianggap sebagai data kategori

In [None]:
predictor1['Branch Code'] = predictor1['Branch Code'].astype(str)
predictor2['Branch Code'] = predictor2['Branch Code'].astype(str)

In [None]:
predictor1.info()

In [None]:
predictor2.info()

In [None]:
predictor1 = pd.get_dummies(predictor1)
predictor2 = pd.get_dummies(predictor2)

In [None]:
predictor1.head()

In [None]:
predictor2.head()

In [None]:
predname = predictor1.columns
predname_num = predictor1.columns[0:7]
predname_cat = predictor1.columns[7:31]

In [None]:
predname

In [None]:
predname_num

In [None]:
predname_cat

In [None]:
X1_num = predictor1[predname_num]
X2_num = predictor2[predname_num]
X1_cat = predictor1[predname_cat]
X2_cat = predictor2[predname_cat]

### **Standarisasi Data Numerik**

In [None]:
from sklearn.preprocessing import StandardScaler
pt = StandardScaler()
X1_num = pd.DataFrame(pt.fit_transform(X1_num))
X1_num.head()

In [None]:
X1_num.columns = predname_num
X1_num.head()

In [None]:
X2_num = pd.DataFrame(pt.fit_transform(X2_num))
X2_num.head()

In [None]:
X2_num.columns = predname_num
X2_num.head()

### **Menggabungkan dataset untuk predictor**

In [None]:
X1 = pd.concat([X1_cat, X1_num], axis = 1)
X2 = pd.concat([X2_cat, X2_num], axis = 1)

In [None]:
X1.head()

In [None]:
X2.head()

In [None]:
y1 = df1['Unpaid Tagging']
y2 = df2['Unpaid Tagging']

### **Mempersiapkan Dataset untuk Validation**

<p style=font-size:16px:> Import Data </p>

In [None]:
df1_val = pd.read_csv("FinanKu Data Validasi.csv")
df2_val = pd.read_csv("Finanku Data Validasi.csv")

In [None]:
df1_val.head()

In [None]:
df2_val.head()

### **Penambahan Variable Relevan**
Rata-Rata Balance & Perubahan Saldo

In [None]:
df1_val['Mean Balance'] = (df1_val['Balance Q2']+df1_val['Balance Q3']+df1_val['Balance Q4']+df1_val['Balance Q5'])/4
df2_val['Mean Balance'] = (df2_val['Balance Q4']+df1_val['Balance Q5'])/2

In [None]:
df1_val['Delta Balance'] = df1_val['Balance Q5']-df1_val['Balance Q2']
df2_val['Delta Balance'] = df2_val['Balance Q5']-df2_val['Balance Q3']

<p style=font-size:16px;,Bold;> Status Keaktifan </p>

In [None]:
df1_val['Active Months'] = (df1_val['ActiveMember Q2']+df1_val['ActiveMember Q3']+df1_val['ActiveMember Q4']+df1_val['ActiveMember Q5'])*3
df2_val['Active Months'] = (df2_val['ActiveMember Q4']+df1_val['ActiveMember Q5'])*3

<p style=font-size:16px;,Bold;> Penambahan atau Pengurang Product Holding </p>

In [None]:
df1_val['Diff PH'] = df1_val['NumOfProducts Q5']-df1_val['NumOfProducts Q2']
df2_val['Diff PH'] = df2_val['NumOfProducts Q5']-df1_val['NumOfProducts Q3']

<p style=font-size:16px;,Bold;> Lama Kepemilikan Kartu Kredit </p>

In [None]:
def assign_cr2(df_val):
    if df_val['HasCrCard Q2'] == 1:
        return 12
    elif df_val['HasCrCard Q3'] == 1:
        return 9
    elif df_val['HasCrCard Q4'] == 1:
        return 6
    else:
        return 3
    return np.nan

In [None]:
df1_val['Vintage_CR'] = df1_val.apply(assign_cr2, axis=1)
df2_val['Vintage_CR'] = df2_val.apply(assign_cr2, axis=1)

### **Penghapusan Variable**

In [None]:
df1_val = df1_val.drop(columns = ['HasCrCard Q5', 'HasCrCard Q2', 'HasCrCard Q3', 'HasCrCard Q4','Balance Q5', 'Balance Q2', 'Balance Q3', 'Balance Q4','NumOfProducts Q5', 'NumOfProducts Q2', 'NumOfProducts Q3', 'NumOfProducts Q4','ActiveMember Q5', 'ActiveMember Q2', 'ActiveMember Q3', 'ActiveMember Q4'])
df2_val = df2_val.drop(columns = ['HasCrCard Q5', 'HasCrCard Q2', 'HasCrCard Q3', 'HasCrCard Q4','Balance Q5', 'Balance Q2', 'Balance Q3', 'Balance Q4','NumOfProducts Q5', 'NumOfProducts Q2', 'NumOfProducts Q3', 'NumOfProducts Q4','ActiveMember Q5', 'ActiveMember Q2', 'ActiveMember Q3', 'ActiveMember Q4'])

In [None]:
df1_val.head(10)

In [None]:
df2_val.head(10)

### **Pemilihan Variable Prediktor**

In [None]:
predictor1_val = df1_val[df1_val.columns.difference(['Customer ID', 'Unpaid Tagging'])]
predictor2_val = df2_val[df2_val.columns.difference(['Customer ID', 'Unpaid Tagging'])]

In [None]:
predictor1_val['Branch Code'] = predictor1_val['Branch Code'].astype(str)
predictor2_val['Branch Code'] = predictor2_val['Branch Code'].astype(str)

In [None]:
predictor1_val = pd.get_dummies(predictor1_val)
predictor2_val = pd.get_dummies(predictor2_val)

In [None]:
predictor1_val.head()

In [None]:
X1_num_val = predictor1_val[predname_num]
X2_num_val = predictor2_val[predname_num]
X1_cat_val = predictor1_val[predname_cat]
X2_cat_val = predictor2_val[predname_cat]

In [None]:
X1_num_val = pd.DataFrame(pt.fit_transform(X1_num_val))
X1_num_val.columns = predname_num

X2_num_val = pd.DataFrame(pt.fit_transform(X2_num_val))
X2_num_val.columns = predname_num

In [None]:
X1_val = pd.concat([X1_cat, X1_num], axis = 1)
X2_val = pd.concat([X2_cat, X2_num], axis = 1)

In [None]:
X1_val.head()

In [None]:
y1_val = df1_val['Unpaid Tagging']
y2_val = df2_val['Unpaid Tagging']

### **Pengecekan Korelasi**
Variable yang berkorelasi lebih dari 0.7 akan di drop.

In [None]:
corrtest1 = X1.corr().abs()
corrtest2 = X2.corr().abs()

In [None]:
corrtest1

In [None]:
corrtest2

In [None]:
upper = corrtest1.where(np.triu(np.ones(corrtest1.shape), k=1).astype(bool))

to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

X1 = X1.drop(to_drop, axis=1)
X1_val = X1_val.drop(to_drop, axis=1)

In [None]:
X1.head()

In [None]:
upper2 = corrtest2.where(np.triu(np.ones(corrtest2.shape), k=1).astype(bool))

to_drop2 = [column for column in upper2.columns if any(upper2[column] > 0.95)]

X2 = X2.drop(to_drop2, axis=1)
X2_val = X2_val.drop(to_drop2, axis=1)

In [None]:
X2.head()

In [None]:
y2.value_counts()

In [None]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE 

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, stratify=y1, random_state=30)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, stratify=y2, random_state=30)

sm = SMOTE(random_state=30)


X1_train, y1_train = sm.fit_resample(X1_train, y1_train)
X2_train, y2_train = sm.fit_resample(X2_train, y2_train)

print("Jumlah data training X1 setelah SMOTE:", X1_train.shape)
print("Sebaran kelas y1 setelah SMOTE (harus seimbang):\n", y1_train.value_counts())

### **Modeling**
Pembangunan model akan menggunakan 3 algoritma:
1. Logistic Regression
2. Gradient Boosting
3. Random Forest

### **Logistic Regression**
Melakukan definisi parameter

In [None]:
penalty = ['l2']
tol = [0.001, 0.0001, 0.00001]
C = [100.0, 10.0, 1.00, 0.1, 0.01, 0.001]
fit_intercept = [True, False]
intercept_scaling = [1.0, 0.75, 0.5, 0.25]
class_weight = ['balanced', None]
solver = ['newton-cg', 'sag', 'lbfgs', 'saga']
max_iter=[14000]
param_distributions = dict(penalty=penalty, tol=tol, C=C, fit_intercept=fit_intercept, intercept_scaling=intercept_scaling,
                  class_weight=class_weight, solver=solver, max_iter=max_iter)

<p style=font-size:16px;,Bold;> Melakukan Pencarian Parameter Terbaik</p>

In [None]:
from sklearn.model_selection import GridSearchCV

### **Eksperimen 1**

In [None]:
import time

logreg = LogisticRegression()
grid = GridSearchCV(estimator=logreg, param_grid = param_distributions , scoring = 'recall', cv = 5, n_jobs=-1)

start_time = time.time()
grid_result = grid.fit(X1_train, y1_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' s')

### **Eksperimen 2**

In [None]:
grid2 = GridSearchCV(estimator=logreg, param_grid = param_distributions , scoring = 'recall', cv = 5, n_jobs=-1)
start_time = time.time()
grid_result2 = grid2.fit(X2_train, y2_train)
print("Best: %f using %s" % (grid_result2.best_score_, grid_result2.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' s')

### **Gradient Boosting**

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report, make_scorer,accuracy_score,precision_score,recall_score,f1_score
gbparameter = {'max_depth':[5,10,15],'gamma':[0.0,0.1,0.2,0.3], 'n_estimators':[25,50,75,100],'learning_rate':[0.05,0.1,0.2,0.3], 'scale_pos_weight':[1,3]}
score = {'accuracy':make_scorer(accuracy_score), 'precision':make_scorer(precision_score),'recall':make_scorer(recall_score), 'f1':make_scorer(f1_score)}

### Hyperparameter dalam Model Machine Learning

* **Gamma**: Nilai minimal *loss reduction* yang dibutuhkan pada saat pemecahan cabang. Semakin besar nilai gamma yang ditetapkan, model yang dibangun akan lebih konservatif dan memungkinkan terjadinya *underfitting*.

* **Learning_rate**: Tingkat penyesuaian bobot fitur. Dalam pembangunan model, setiap iterasi menghasilkan bobot untuk fitur-fitur yang dimiliki. *Learning_rate* berguna untuk membantu menyusutkan nilai bobot tersebut agar model yang dibangun tidak mengalami *overfitting*.

* **Scale_pos_weight**: Pengaturan bobot antara kelas positif (*churn*) dengan kelas negatif (*not churn*). *Hyperparameter* ini sangat berguna ketika *dataset* yang digunakan merupakan *imbalance datasets*. Nilai yang biasa digunakan yaitu jumlah dari *majority class* dibagi dengan jumlah *minority class*.

### **Eksperimen 1**

In [None]:
GB_Grid = GridSearchCV(XGBClassifier(), gbparameter, cv=5,refit='recall', verbose=0, n_jobs=-1, scoring=score)

In [None]:
start_time = time.time()
GB_result = GB_Grid.fit(X1_train, y1_train)
print("Best: %f using %s" % (GB_result.best_score_, GB_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' s')

### **Eksperimen 2**

In [None]:
GB_Grid2 = GridSearchCV(XGBClassifier(), gbparameter, cv=5,refit='recall', verbose=0, n_jobs=-1, scoring=score)

In [None]:
start_time = time.time()
GB2_result = GB_Grid2.fit(X2_train, y2_train)
print("Best: %f using %s" % (GB2_result.best_score_, GB2_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' s')

### **Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier
parameter = {'max_depth':[5,10,15,20],'max_features':['auto','sqrt','log2'], 'n_estimators':[25,50,75,100,125],'min_samples_split':[2,3,5,7]}
score = {'accuracy':make_scorer(accuracy_score), 'precision':make_scorer(precision_score),'recall':make_scorer(recall_score), 'f1':make_scorer(f1_score)}



### Hyperparameter dalam Model Machine Learning

* **Max_depth**: Jumlah maksimal pemecahan cabang atau level dalam satu pohon. Semakin besar nilai `max_depth` yang ditetapkan, model akan semakin presisi dalam menggolongkan data ke suatu kelas. Akan tetapi, semakin besar nilai `max_depth` juga akan membuat model *overfitting*.

* **Max_features**: Jumlah maksimal fitur yang dipertimbangkan ketika melakukan pemecahan cabang (*splitting node*). Sama halnya dengan `max_depth`Â¸ semakin banyak jumlah fitur yang dipertimbangkan dalam pemecahan cabang, akan semakin detail hasil yang didapatkan tetapi juga akan membuat model *overfitting* terhadap data trainingnya.

* **N_estimator**: Jumlah pohon yang akan dibangun. Semakin banyak pohon, tingkat akurasi yang didapatkan menjadi lebih baik mengingat Random Forest menggunakan konsep *majority vote* dalam melakukan klasifikasi. Akan tetapi, semakin banyak pohon yang dibangun, waktu komputasi yang dibutuhkan juga akan semakin tinggi.

* **Min_sample_split**: Jumlah sampel data minimal pada sebuah *internal node*. Nilai yang besar dapat membuat model yang dibangun lebih konservatif. Akan tetapi, jika terlalu besar dapat menyebabkan model yang dibangun *underfitting*.


### **Eksperimen 1**

In [None]:
RF_Grid = GridSearchCV(RandomForestClassifier(), parameter, cv=5,refit='recall', verbose=0, n_jobs=-1, scoring=score)
start_time = time.time()
RF_result = RF_Grid.fit(X1_train, y1_train)
print("Best: %f using %s" % (RF_result.best_score_, RF_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' s')

### **Eksperimen 2**

In [None]:
RF_Grid2 = GridSearchCV(RandomForestClassifier(), parameter, cv=5,refit='recall', verbose=0, n_jobs=-1, scoring=score)
start_time = time.time()
RF_result2 = RF_Grid2.fit(X2_train, y2_train)
print("Best: %f using %s" % (RF_result2.best_score_, RF_result2.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' s')

## **Evaluation**

### **Logistic Regression**

<p style:font-size:16px;,bold; > Eksperimen 1 </p>

In [None]:
y1_proba = grid.predict_proba(X1_test)[:, 1]

threshold = 0.3
y1_pred_adjusted = (y1_proba >= threshold).astype(int)

from sklearn import metrics
print("=== HASIL LOGISTIC REGRESSION EXP 1 Test ===")
print("Accuracy:", metrics.accuracy_score(y1_test, y1_pred_adjusted))
print("Recall:", metrics.recall_score(y1_test, y1_pred_adjusted))
print("Precision:", metrics.precision_score(y1_test, y1_pred_adjusted))

In [None]:
y1_val_proba = grid.predict_proba(X1_val)[:, 1]

threshold = 0.3
y1_pred_val_adjusted = (y1_val_proba >= threshold).astype(int)

from sklearn import metrics
print("=== HASIL VALIDASI Logistic Regression  EXP 1 ===")
print("Accuracy:", metrics.accuracy_score(y1_val, y1_pred_val_adjusted))
print("Recall:", metrics.recall_score(y1_val, y1_pred_val_adjusted))
print("Precision:", metrics.precision_score(y1_val, y1_pred_val_adjusted))

In [None]:
from jcopml.feature_importance import mean_score_decrease
df_imp1 = mean_score_decrease(X1_train, y1_train, grid, plot= True, topk=20)

### **Eksperimen 2**

In [None]:
y2_proba = grid2.predict_proba(X2_test)[:, 1]
threshold = 0.3
y2_pred_adjusted = (y2_proba >= threshold).astype(int)

from sklearn import metrics
print("=== HASIL LOGISTIC REGRESSION EXP 2 Test ===")
print("Accuracy:", metrics.accuracy_score(y2_test, y2_pred_adjusted))
print("Recall:", metrics.recall_score(y2_test, y2_pred_adjusted))
print("Precision:", metrics.precision_score(y2_test, y2_pred_adjusted))

In [None]:
y2_val_proba = grid2.predict_proba(X2_val)[:, 1] 

threshold = 0.3
y2_pred_val_adjusted = (y2_val_proba >= threshold).astype(int)

from sklearn import metrics
print("=== HASIL LOGISTIC REGRESSION EXP 2 Validasi ===")
print("Accuracy:", metrics.accuracy_score(y2_val, y2_pred_val_adjusted))
print("Recall:", metrics.recall_score(y2_val, y2_pred_val_adjusted))
print("Precision:", metrics.precision_score(y2_val, y2_pred_val_adjusted))

In [None]:
df_imp2 = mean_score_decrease(X2_train, y2_train, grid2, plot=True, topk=20)

### **Gradient Boosting**

Eksperimen 1

In [None]:
y11_proba = GB_Grid.predict_proba(X1_test)[:, 1]
threshold = 0.3
y11_pred_adjusted = (y11_proba >= threshold).astype(int)

from sklearn import metrics
print("=== GRADIENT BOOSTING EXP 1 TEST===")
print("Accuracy:", metrics.accuracy_score(y1_test, y11_pred_adjusted))
print("Recall:", metrics.recall_score(y1_test, y11_pred_adjusted))
metrics.completeness_score

In [None]:
y11_val_proba = GB_Grid.predict_proba(X1_val)[:, 1]
threshold = 0.3
y11_pred_val_adjusted = (y11_val_proba >= threshold).astype(int)

from sklearn import metrics
print("=== HASIL VALIDASI GB EXP 1 ===")
print("Accuracy:", metrics.accuracy_score(y1_val, y11_pred_val_adjusted))
print("Recall:", metrics.recall_score(y1_val, y11_pred_val_adjusted))
metrics.completeness_score(y1_val, y11_pred_val_adjusted)

In [None]:
df_imp3 = mean_score_decrease(X1_train, y1_train, GB_Grid, plot=True, topk=20)

<p style:font-size:16px;, Bold; > Eksperimen 2 </p>

In [None]:
y22_proba = GB_Grid2.predict_proba(X2_test)[:, 1]
threshold = 0.3
y22_pred_adjusted = (y22_proba >= threshold).astype(int)

from sklearn import metrics
print("=== GRADIENT BOOSTING EXP 2 TEST ===")
print("Accuracy:", metrics.accuracy_score(y2_test, y22_pred_adjusted))
print("Recall:", metrics.recall_score(y2_test, y22_pred_adjusted))
metrics.completeness_score

In [None]:
y22_val_proba = GB_Grid2.predict_proba(X2_val)[:, 1]
threshold = 0.3
y22_pred_val_adjusted = (y22_val_proba >= threshold).astype(int)

from sklearn import metrics
print("=== HASIL VALIDASI GB EXP 2  ===")
print("Accuracy:", metrics.accuracy_score(y2_val, y22_pred_val_adjusted))
print("Recall:", metrics.recall_score(y2_val, y22_pred_val_adjusted))
metrics.completeness_score(y2_val, y22_pred_val_adjusted)   

In [None]:
df_imp4 = mean_score_decrease(X2_train, y2_train, GB_Grid2, plot=True, topk=20)

### **Random Forest**
Eksperimen 1

In [None]:
y12_proba = RF_Grid.predict_proba(X1_test)[:, 1]
threshold = 0.3
y12_pred_adjusted = (y12_proba >= threshold).astype(int)

from sklearn import metrics
print("=== RANDOM FOREST EXP 1 TEST ===")
print("Accuracy:", metrics.accuracy_score(y1_test, y12_pred_adjusted))
print("Recall:", metrics.recall_score(y1_test, y12_pred_adjusted))
metrics.completeness_score

In [None]:
y12_val_proba = RF_Grid.predict_proba(X1_val)[:, 1]
threshold = 0.3
y12_pred_val_adjusted = (y12_val_proba >= threshold).astype(int)

from sklearn import metrics
print("=== RANDOM FOREST EXP 1 VALIDASI===")
print("Accuracy:", metrics.accuracy_score(y1_val, y12_pred_val_adjusted))
print("Recall:", metrics.recall_score(y1_val, y12_pred_val_adjusted))
metrics.completeness_score

In [None]:
df_imp5 = mean_score_decrease(X1_train, y1_train, RF_Grid, plot=True, topk=20)

Eksperimen 2

In [None]:
y21_proba = RF_Grid2.predict_proba(X2_test)[:, 1]
threshold = 0.3
y21_pred_adjusted = (y21_proba >= threshold).astype(int)

from sklearn import metrics
print("=== RANDOM FOREST EXP 2 TEST ===")
print("Accuracy:", metrics.accuracy_score(y2_test, y21_pred_adjusted))
print("Recall:", metrics.recall_score(y2_test, y21_pred_adjusted))
metrics.completeness_score

In [None]:
y21_val_proba = RF_Grid2.predict_proba(X2_val)[:, 1]
threshold = 0.3
y21_pred_val_adjusted = (y21_val_proba >= threshold).astype(int)

from sklearn import metrics
print("=== RANDOM FOREST EXP 2 VALIDASI ===")
print("Accuracy:", metrics.accuracy_score(y2_val, y21_pred_val_adjusted))
print("Recall:", metrics.recall_score(y2_val, y21_pred_val_adjusted))
metrics.completeness_score

In [None]:
df_imp6 = mean_score_decrease(X2_train, y2_train, RF_Grid2, plot=True, topk=20)





###  Kesimpulan

Dari semua model, rata-rata memiliki **accuracy di atas 60%** namun memiliki **recall di bawah 40%**. Artinya, masih banyak nasabah yang sebenarnya berpotensi gagal bayar namun diprediksi tidak akan gagal bayar. Sehingga bisa disampaikan bahwa dalam iterasi pembangunan model kali ini, **objektif yang diinginkan masih belum dapat tercapai.**

### Solusi Pengembangan

Solusi pengembangan kedepannya yang bisa dilakukan di antaranya:
* Memperbanyak sampel (jumlah nasabah dengan asumsi dataset yang tersedia saat ini bukan total populasi nasabah).
* Melakukan *oversampling* terhadap kelas minoritas (gagal bayar) agar pembangunan model tidak bias.
* Memperluas horizon waktu.
* Mencoba variasi variabel lainnya (menambah variabel baru, atau membuang variabel yang memiliki nilai *importance* rendah pada hasil terakhir).
* Mencoba memperluas kombinasi *hyperparameter* dalam pembangunan model.
* Mencoba algoritma *supervised machine learning* lainnya.
