# Hacktiv8 Talent Fair Vol. 5 Challenge


Lion Parcel Loan Analysis

Pada Talent Fair Vol.5 saya melakukan analisis terhadap pinjaman dari Lion Parcel.

Permasalahan yang diangkat adalah mengenai profiling customer yang diberi pinjaman oleh Lion Parcel.

Berikut ini merupakan dashboard yang telah saya buat untuk mempermudah profiling customer yang saya buat dalam Looker.  [Link](https://lookerstudio.google.com/reporting/cf02e8b1-71d4-43c3-a247-9446caa1fa9f)

# 1. Import Library

In [3]:
# Import Library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

import pickle
import joblib

# 2. Data Loading

In [4]:
# Import Data

data = pd.read_csv('C:/Users/waskito/Downloads/lion-loan-train.csv')

In [5]:
# Show Data

data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [6]:
# Data Shape Check

data.shape

(614, 13)

Data yang digunakan terdiri dari 614 baris data dan 13 kolom atau fitur data.

# 3. Data Preprocessing

## 3.1 Data Info

In [7]:
# Data info

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [8]:
# Cek Unique Value Fitur Dependent

data['Dependents'].unique()

array(['0', '1', '2', '3+', nan], dtype=object)

In [9]:
# Ubah Nilai 3+ menjadi 3 pada fitur dependent

data['Dependents'] = data['Dependents'].replace('3+', '3')

In [10]:
# Cek Unique Value Fitur Dependent setelah nilai 3+ diubah menjadi 3

data['Dependents'].unique()

array(['0', '1', '2', '3', nan], dtype=object)

## 3.2 Data Duplicated

In [11]:
# Duplicated Check

print('Jumlah data terduplikasi =', data.duplicated().sum())

Jumlah data terduplikasi = 0


Tidak terdapat duplikasi pada data yang dimiliki.

## 3.3 Missing Value

### 3.3.1 Missing Value Check

In [12]:
# Missing Value Check

data.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [13]:
# Missing Value Percentage

data.isnull().mean()*100

Loan_ID              0.000000
Gender               2.117264
Married              0.488599
Dependents           2.442997
Education            0.000000
Self_Employed        5.211726
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           3.583062
Loan_Amount_Term     2.280130
Credit_History       8.143322
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

Tampilan diatas menunjukkan persentasi missing value pada data yang dimiliki. Batas toleransi untuk melakukan dropping missing value berada pada nilai < 5% berdasarkan Complete Disertaion by Statistic Solution [Link](https://www.statisticssolutions.com/dissertation-resources/missing-values-in-data/#:~:text=In%20statistical%20language%2C%20if%20the%20number%20of%20the,cases%20%28rather%20than%20do%20imputation%29%20and%20replace%20them.) sehingga perlu dilakukan handling missing value dengan imputasi.

### 3.3.2 Data Manipulation - Missing Value Handling

Pada project kali ini saya menggunakan simple imputer untuk melakukan imputasi pada missing value yang dimiliki oleh data.

In [14]:
simple_impute_mf = SimpleImputer(strategy='most_frequent')
simple_impute_median = SimpleImputer(strategy='median')
simple_impute_mean = SimpleImputer(strategy='mean')

data['Gender'] = simple_impute_mf.fit_transform(data['Gender'].values.reshape(-1,1))
data['Married'] = simple_impute_mf.fit_transform(data['Married'].values.reshape(-1,1))
data['Dependents'] = simple_impute_median.fit_transform(data['Dependents'].values.reshape(-1,1))
data['Self_Employed'] = simple_impute_mf.fit_transform(data['Self_Employed'].values.reshape(-1,1))
data['LoanAmount'] = simple_impute_mean.fit_transform(data['LoanAmount'].values.reshape(-1,1))
data['Loan_Amount_Term'] = simple_impute_mf.fit_transform(data['Loan_Amount_Term'].values.reshape(-1,1))
data['Credit_History'] = simple_impute_mf.fit_transform(data['Credit_History'].values.reshape(-1,1))

In [15]:
data['Dependents'].unique()

array([0., 1., 2., 3.])

In [16]:
# Show Missing value after drop missing value on Gender, Married, Dependents ans Self_Employed Feature

data.isna().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [17]:
# Missing Value Checking After Drop The Missing

data.isna().mean()*100

Loan_ID              0.0
Gender               0.0
Married              0.0
Dependents           0.0
Education            0.0
Self_Employed        0.0
ApplicantIncome      0.0
CoapplicantIncome    0.0
LoanAmount           0.0
Loan_Amount_Term     0.0
Credit_History       0.0
Property_Area        0.0
Loan_Status          0.0
dtype: float64

In [18]:
data.shape

(614, 13)

Handling missing value dengan simple imputer dilakuakan karena data yang dimiliki terdapat missing value dan tidak dapat dilakukan dropping karena jumlah data sangat sedikit.

## 3.4 Data Manipulation - Change Data Type

In [19]:
# Showing data info

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             614 non-null    object 
 2   Married            614 non-null    object 
 3   Dependents         614 non-null    float64
 4   Education          614 non-null    object 
 5   Self_Employed      614 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         614 non-null    float64
 9   Loan_Amount_Term   614 non-null    float64
 10  Credit_History     614 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(5), int64(1), object(7)
memory usage: 62.5+ KB


In [20]:
# Change Data type to int64

data['Dependents'].astype('int64')

0      0
1      1
2      0
3      0
4      0
      ..
609    0
610    3
611    1
612    2
613    0
Name: Dependents, Length: 614, dtype: int64

Mengubah tipe data dari fitur "Dependents" yang awalnya bertipe data String/Object menjadi integer.

## 3.5 Saving Data

In [21]:
# Saving data to csv

data.to_csv('C:/Users/waskito/Desktop/Porto/Talent-Fair/dataclean.csv', index=False)

Setelah dilakukan handling missing value dan melakukan perubahan tipe data pada beberapa fitur, data akan disimpan dan nantinya digunakan untuk dashboarding mengunakan Looker.