# KUIS 2: PREDICTION MODEL MENGGUNAKAN NAIVE-BAYES

## Import

Pada tahap ini, saya melakukan import modul, library sklearn, dan juga dataset yang akan dipakai. Fungsi setiap model dan library telah dituliskan pada coding di bawah ini.

In [1]:
## Import modul
import numpy as np #aljabar linear
import pandas as pd #data processing

In [2]:
# Import library
from sklearn import model_selection #Prediction library
from sklearn.naive_bayes import GaussianNB # Naive Bayes Classification
from sklearn.model_selection import KFold # Cross Validation
from sklearn.model_selection import cross_val_score # Mengambil nilai hasil testing
from sklearn.model_selection import cross_validate # Mengambil hasil testing
from sklearn.model_selection import train_test_split # Repeated hold out
from sklearn.metrics import classification_report # Menampilkan hasil testing

In [3]:
#Import dataset
df = pd.read_csv('loan_data.csv')

## Praproses data

Pada tahap ini, saya melakukan pengisian missing values dan juga pengkonversian tipe data.

In [4]:
dff = df.copy()

In [5]:
dff.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [6]:
dff.isna().sum() #mencari tahu banyaknya missing values setiap variabel

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [7]:
# Mencari tahu mode (untuk  tipe data object/category) dan mean (untuk tipe data numerik) untuk mengisi missing value.

dff["Gender"].mode()

0    Male
dtype: object

In [8]:
dff["Married"].mode()

0    Yes
dtype: object

In [9]:
dff["Dependents"].mode()

0    0
dtype: object

In [10]:
dff["Self_Employed"].mode()

0    No
dtype: object

In [11]:
dff["LoanAmount"].mean()

146.41216216216216

In [12]:
dff["Loan_Amount_Term"].mode()

0    360.0
dtype: float64

In [13]:
dff["Credit_History"].mode()

0    1.0
dtype: float64

In [14]:
# Mengisi missing value setiap variabel.

dff['Gender'].fillna('Male', inplace=True)
dff['Married'].fillna('Yes', inplace=True)
dff['Dependents'].fillna('0', inplace=True)
dff['Self_Employed'].fillna('No', inplace=True)
dff['LoanAmount'].fillna('146.4', inplace=True)
dff['Loan_Amount_Term'].fillna('360', inplace=True)
dff['Credit_History'].fillna('1', inplace=True)

In [15]:
# Mengecek kembali jumlah missing value pada data setiap variabel

dff.isna().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [16]:
# Mengubah data Yes/No menjadi boolean 1/0.

dff.Married = dff.Married.map({ 'Yes' : 1, 'No' : 0})
dff.Education = dff.Education.map({ 'Graduate' : 1, 'Not Graduate' : 0})
dff.Self_Employed = dff.Self_Employed.map({ 'Yes' : 1, 'No' : 0})
dff.Loan_Status = dff.Loan_Status.map({ 'Y' : 1, 'N' : 0})

In [17]:
# Memecah data kategorikal menjadi data dummies.

dff = dff.join(pd.get_dummies(dff['Gender']))
dff = dff.join(pd.get_dummies(dff['Dependents']))
dff = dff.join(pd.get_dummies(dff['Property_Area']))
dff = dff.drop(columns = 'Gender')
dff = dff.drop(columns = 'Dependents')
dff = dff.drop(columns = 'Property_Area')
dff

Unnamed: 0,Loan_ID,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Female,Male,0,1,2,3+,Rural,Semiurban,Urban
0,LP001002,0,1,0,5849,0.0,146.4,360,1,1,0,1,1,0,0,0,0,0,1
1,LP001003,1,1,0,4583,1508.0,128,360,1,0,0,1,0,1,0,0,1,0,0
2,LP001005,1,1,1,3000,0.0,66,360,1,1,0,1,1,0,0,0,0,0,1
3,LP001006,1,0,0,2583,2358.0,120,360,1,1,0,1,1,0,0,0,0,0,1
4,LP001008,0,1,0,6000,0.0,141,360,1,1,0,1,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,0,1,0,2900,0.0,71,360,1,1,1,0,1,0,0,0,1,0,0
610,LP002979,1,1,0,4106,0.0,40,180,1,1,0,1,0,0,0,1,1,0,0
611,LP002983,1,1,0,8072,240.0,253,360,1,1,0,1,0,1,0,0,0,0,1
612,LP002984,1,1,0,7583,0.0,187,360,1,1,0,1,0,0,1,0,0,0,1


In [18]:
# Mengubah data selain numerik atau boolean menjadi numerik

dff['Married'] = dff['Married'].astype(int)
dff['Education'] = dff['Education'].astype(int)
dff['Self_Employed'] = dff['Self_Employed'].astype(float)
dff['Loan_Status'] = dff['Loan_Status'].astype(int)
dff['LoanAmount'] = dff['LoanAmount'].astype(float)
dff['Loan_Amount_Term'] = dff['Loan_Amount_Term'].astype(int)
dff['Credit_History'] = dff['Credit_History'].astype(int)

In [19]:
# Mengecek kesesuaian tipe data 

dff.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Married            614 non-null    int32  
 2   Education          614 non-null    int32  
 3   Self_Employed      614 non-null    float64
 4   ApplicantIncome    614 non-null    int64  
 5   CoapplicantIncome  614 non-null    float64
 6   LoanAmount         614 non-null    float64
 7   Loan_Amount_Term   614 non-null    int32  
 8   Credit_History     614 non-null    int32  
 9   Loan_Status        614 non-null    int32  
 10  Female             614 non-null    uint8  
 11  Male               614 non-null    uint8  
 12  0                  614 non-null    uint8  
 13  1                  614 non-null    uint8  
 14  2                  614 non-null    uint8  
 15  3+                 614 non-null    uint8  
 16  Rural              614 non

## Pemisahan data training dan testing

Pada tahap ini, dilakukan pemisahan data training dan data testing untuk nantinya dipakai melatih model.
Digunakan data training sebesar 70%.

In [20]:
fitur = ["Married", "Education", "Self_Employed", "ApplicantIncome", "CoapplicantIncome", "LoanAmount", "Loan_Amount_Term", "Credit_History", "Female", "Male", "0","1","2","3+","Rural","Semiurban","Urban"]

x = dff[fitur]
y = dff["Loan_Status"]

In [21]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, shuffle=True, random_state=42)

## Pembangunan Model Prediksi

Pada tahap ini, dilakukan pembangunan model prediksi naivebayes.

In [22]:
naivebayes = GaussianNB()
kfold = model_selection.KFold()
scoring = ['accuracy', 'f1']

## Pelatihan Model Prediksi

Pada tahap ini, dilakukan pelatihan model prediksi naivebayes yang telah dibuat dengan menggunakan data training yang telah dipecah sebelumnya.

In [23]:
naivebayes.fit(x_train, y_train) #melakukan pelatihan data training

GaussianNB()

## Evaluasi Kinerja Model

Pada tahap ini, dilakukan evaluasi kinerja model menggunakan scoring accuracy dan juga f1 score. Data yang dites sebanyak lima kali sesuai KFold.

In [24]:
print(cross_val_score(naivebayes, x, y, cv=kfold, scoring="accuracy"))
print(cross_val_score(naivebayes, x, y, cv=kfold, scoring="accuracy").mean())

[0.78861789 0.7398374  0.78861789 0.82113821 0.79508197]
0.7866586698653871


In [25]:
print(cross_val_score(naivebayes, x, y, cv=kfold, scoring="f1"))
print(cross_val_score(naivebayes, x, y, cv=kfold, scoring="f1").mean())

[0.85869565 0.82795699 0.86458333 0.87912088 0.8603352 ]
0.8581384098812327


In [26]:
# Dilakukan pengecekan confusion matrix untuk mengetahui nilai-nilai evaluasi model, seperti precision, recall, f1-score.

score_test = naivebayes.score(x_test, y_test)
predict = naivebayes.predict(x_test)
print(classification_report(y_test, predict))

              precision    recall  f1-score   support

           0       0.86      0.46      0.60        65
           1       0.77      0.96      0.85       120

    accuracy                           0.78       185
   macro avg       0.81      0.71      0.73       185
weighted avg       0.80      0.78      0.76       185



In [27]:
#Dilakukan pengecekan akurasi data untuk mengetahui akurasi paling tinggi.

train_size = [0.9, 0.8, 0.6, 0.65, 2/3]
for t in train_size :
    naivebayes = GaussianNB()
    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=t, random_state=42)
    naivebayes.fit(x_train, y_train)
    score = naivebayes.score(x_test, y_test)
    print("Accuracy Naive Bayes with train size %.2f : %.3f%%" % (t, score*100))

Accuracy Naive Bayes with train size 0.90 : 75.806%
Accuracy Naive Bayes with train size 0.80 : 78.049%
Accuracy Naive Bayes with train size 0.60 : 79.675%
Accuracy Naive Bayes with train size 0.65 : 79.535%
Accuracy Naive Bayes with train size 0.67 : 79.512%


## Kesimpulan

Dapat disimpulkan, model naive bayes dapat bekerja dengan baik dalam memprediksi loan status, dengan menggunakan data independen sesuai fitur yang telah dituliskan di atas. Hal ini dapat dilihat dari proses yang dilakukan menggunakan data training sebanyak 0.70 dan didapatkan nilai evaluasi rata-rata accuracy sebesar 0.7866586698653871 dan f1-score sebesar 0.8581384098812327. Dari proses yang sudah dilakukan, dapat disimpulkan kembali bahwa akurasi naive bayes tertinggi, sebesar 79.675% dapat didapatkan dengan menggunakan data training sebesar 60% atau 0.60.
