# Home Credit Scorecard Model 
  

## Introduction

Notebook untuk pemodelan skor kredit Home Credit Indonesia dalam Data Scientist Project-Based Internship with Rakamin.

Diberikan 7 dataset yang akan digunakan dalam pemodelan ini:
1. __application__
    
   Dataset utama yang berisi informasi pengajuan kredit. 
2. __bureau__
    
   Dataset yang berisi informasi kredit yang diambil oleh pelanggan di lembaga keuangan lain.
3. __bureau_balance__

    Dataset yang berisi informasi pembayaran kredit di lembaga keuangan lain. 
4. __credit_card_balance__

    Dataset yang berisi saldo bulanan dari kartu kredit yang dimiliki oleh pelanggan dengan Home Credit.
5. __installments_payments__

    Dataset yang berisi riwayat pembayaran cicilan kredit yang diberikan oleh Home Credit. 
6. __pos_cash_balance__

    Dataset yang berisi saldo bulanan dari kredit POS dan kredit tunai yang dimiliki oleh pelanggan dengan Home Credit.
7. __previous_application__

    Dataset yang berisi riwayat pengajuan kredit oleh pelanggan dengan Home Credit.

## Objective

Tujuan dari notebook ini adalah membangun model skor kredit menggunakan data yang diberikan. Model ini akan digunakan untuk memprediksi apakah pengajuan kredit akan disetujui atau tidak.

Data utama yang digunakan adalah application_train.csv. 


In [3]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

import pandas as pd
import sqlite3
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('ggplot')

In [4]:
conn = sqlite3.connect('hci_application.db')

## Exploratory Data Analysis

In [27]:
def fetch_random_n_rows(n, table_name, primary_key) -> pd.DataFrame:
    return pd.read_sql(f"""SELECT *
                       FROM {table_name}
                       WHERE {primary_key}
                       IN (
                        SELECT {primary_key}
                        FROM {table_name}
                        ORDER BY RANDOM()
                        LIMIT {n}
                       )""", conn)


In [14]:
df = pd.read_sql('SELECT * FROM application', conn)
print(df.shape)
df.head()

(307511, 122)


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 122 columns):
 #    Column                        Dtype  
---   ------                        -----  
 0    SK_ID_CURR                    int64  
 1    TARGET                        int64  
 2    NAME_CONTRACT_TYPE            object 
 3    CODE_GENDER                   object 
 4    FLAG_OWN_CAR                  object 
 5    FLAG_OWN_REALTY               object 
 6    CNT_CHILDREN                  int64  
 7    AMT_INCOME_TOTAL              float64
 8    AMT_CREDIT                    float64
 9    AMT_ANNUITY                   float64
 10   AMT_GOODS_PRICE               float64
 11   NAME_TYPE_SUITE               object 
 12   NAME_INCOME_TYPE              object 
 13   NAME_EDUCATION_TYPE           object 
 14   NAME_FAMILY_STATUS            object 
 15   NAME_HOUSING_TYPE             object 
 16   REGION_POPULATION_RELATIVE    float64
 17   DAYS_BIRTH                    int64  
 18   DA

In [20]:
df.describe()

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,307511.0,307511.0,307511.0,307511.0,307511.0,307499.0,307233.0,307511.0,307511.0,307511.0,...,307511.0,307511.0,307511.0,307511.0,265992.0,265992.0,265992.0,265992.0,265992.0,265992.0
mean,278180.518577,0.080729,0.417052,168797.9,599026.0,27108.573909,538396.2,0.020868,-16036.995067,63815.045904,...,0.00813,0.000595,0.000507,0.000335,0.006402,0.007,0.034362,0.267395,0.265474,1.899974
std,102790.175348,0.272419,0.722121,237123.1,402490.8,14493.737315,369446.5,0.013831,4363.988632,141275.766519,...,0.089798,0.024387,0.022518,0.018299,0.083849,0.110757,0.204685,0.916002,0.794056,1.869295
min,100002.0,0.0,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,-25229.0,-17912.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,189145.5,0.0,0.0,112500.0,270000.0,16524.0,238500.0,0.010006,-19682.0,-2760.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,278202.0,0.0,0.0,147150.0,513531.0,24903.0,450000.0,0.01885,-15750.0,-1213.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,367142.5,0.0,1.0,202500.0,808650.0,34596.0,679500.0,0.028663,-12413.0,-289.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,456255.0,1.0,19.0,117000000.0,4050000.0,258025.5,4050000.0,0.072508,-7489.0,365243.0,...,1.0,1.0,1.0,1.0,4.0,9.0,8.0,27.0,261.0,25.0


In [28]:
# get a random sample of table for only 100 rows
prev_df = fetch_random_n_rows(100, 'previous_application', 'SK_ID_PREV') 
print(prev_df.shape)
prev_df.head()

(100, 37)


Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,...,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,1733715,295943,Cash loans,,0.0,0.0,,,MONDAY,16,...,XNA,,XNA,Cash,,,,,,
1,2035648,390340,Cash loans,22860.0,450000.0,450000.0,,450000.0,TUESDAY,10,...,Consumer electronics,24.0,low_action,Cash X-Sell: low,365243.0,-180.0,510.0,365243.0,365243.0,0.0
2,2252317,346502,Revolving loans,,0.0,0.0,,,WEDNESDAY,11,...,XNA,,XNA,Card Street,,,,,,
3,2638235,438764,Consumer loans,5834.385,52362.0,57892.5,0.0,52362.0,THURSDAY,10,...,Consumer electronics,12.0,middle,POS household with interest,365243.0,-474.0,-144.0,-144.0,-138.0,0.0
4,1004156,206141,Consumer loans,11677.05,43200.0,44910.0,0.0,43200.0,WEDNESDAY,12,...,Clothing,4.0,low_action,POS industry with interest,365243.0,-444.0,-354.0,-354.0,-352.0,0.0


In [23]:
prev_df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 37 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   SK_ID_PREV                   100 non-null    int64  
 1   SK_ID_CURR                   100 non-null    int64  
 2   NAME_CONTRACT_TYPE           100 non-null    object 
 3   AMT_ANNUITY                  81 non-null     float64
 4   AMT_APPLICATION              100 non-null    float64
 5   AMT_CREDIT                   100 non-null    float64
 6   AMT_DOWN_PAYMENT             54 non-null     float64
 7   AMT_GOODS_PRICE              79 non-null     float64
 8   WEEKDAY_APPR_PROCESS_START   100 non-null    object 
 9   HOUR_APPR_PROCESS_START      100 non-null    int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  100 non-null    object 
 11  NFLAG_LAST_APPL_IN_DAY       100 non-null    int64  
 12  RATE_DOWN_PAYMENT            54 non-null     float64
 13  RATE_INTEREST_PRIMARY

In [29]:
bureau_df = fetch_random_n_rows(100, 'bureau', 'SK_ID_BUREAU')
print(bureau_df.shape)
bureau_df.head()

(100, 17)


Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,118035,5196758,Closed,currency 1,-321,0,-101.0,-138.0,,0,21510.0,0.0,,0.0,Consumer credit,-125,
1,439410,5800521,Closed,currency 1,-1226,0,-1039.0,-1039.0,0.0,0,149769.0,,,0.0,Consumer credit,-1038,
2,421040,5810431,Closed,currency 1,-2717,0,-2167.0,-2211.0,,0,225000.0,,,0.0,Consumer credit,-2207,
3,219232,5240303,Active,currency 1,-64,0,,,,0,135000.0,212625.0,,0.0,Consumer credit,-17,
4,217777,5003602,Closed,currency 1,-968,0,,-509.0,0.0,0,355500.0,0.0,0.0,0.0,Credit card,-427,


In [30]:
bureau_df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   SK_ID_CURR              100 non-null    int64  
 1   SK_ID_BUREAU            100 non-null    int64  
 2   CREDIT_ACTIVE           100 non-null    object 
 3   CREDIT_CURRENCY         100 non-null    object 
 4   DAYS_CREDIT             100 non-null    int64  
 5   CREDIT_DAY_OVERDUE      100 non-null    int64  
 6   DAYS_CREDIT_ENDDATE     92 non-null     float64
 7   DAYS_ENDDATE_FACT       63 non-null     float64
 8   AMT_CREDIT_MAX_OVERDUE  38 non-null     float64
 9   CNT_CREDIT_PROLONG      100 non-null    int64  
 10  AMT_CREDIT_SUM          100 non-null    float64
 11  AMT_CREDIT_SUM_DEBT     86 non-null     float64
 12  AMT_CREDIT_SUM_LIMIT    67 non-null     float64
 13  AMT_CREDIT_SUM_OVERDUE  100 non-null    float64
 14  CREDIT_TYPE             100 non-null    obj

In [31]:
bureau_df.describe()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
count,100.0,100.0,100.0,100.0,92.0,63.0,38.0,100.0,100.0,86.0,67.0,100.0,100.0,28.0
mean,280102.6,5879922.0,-1137.42,0.0,438.782609,-1133.952381,700.400132,0.0,393375.2,169609.3,4697.314925,0.0,-588.35,13961.690357
std,102799.432453,547649.4,836.194037,0.0,4687.470638,768.239891,2554.273431,0.0,997727.5,874747.3,27388.84428,0.0,675.506355,50114.958589
min,100530.0,5003602.0,-2917.0,0.0,-2667.0,-2667.0,0.0,0.0,0.0,0.0,0.0,0.0,-2661.0,0.0
25%,201436.25,5370380.0,-1571.75,0.0,-1231.75,-1694.5,0.0,0.0,52652.81,0.0,0.0,0.0,-845.0,0.0
50%,295281.5,5878544.0,-970.0,0.0,-289.0,-966.0,0.0,0.0,122533.0,0.0,0.0,0.0,-316.0,3271.5
75%,365386.75,6332310.0,-398.75,0.0,732.75,-498.0,0.0,0.0,230625.0,39710.25,0.0,0.0,-36.75,7212.375
max,444613.0,6837554.0,-48.0,0.0,30921.0,-44.0,11835.0,0.0,9000000.0,7928312.0,218250.0,0.0,-2.0,266836.5
