# Home Credit Analysis

Goal of this notebook is to answer:

- What factors will affect how much an individual is approved for?
- How does a previous application affect your loan amount?
- Which loan purpose, increase or decrease the amount difference?
- What start time of application, increase or decrease the amount difference? 

Data: https://www.kaggle.com/c/home-credit-default-risk/data

# Importing Data

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
from matplotlib import pyplot as pyplot

from google.colab import drive
drive.mount('/content/drive')
drive_dir = '/content/drive/Shared drives/Project 4 (MATH 3439)/Data/'

Mounted at /content/drive


#Loading the Data

## Train / Test

Not using but good to compare.

In [None]:
#df1_train = pd.read_csv(drive_dir + 'application_train.csv.zip')
#df2_test = pd.read_csv(drive_dir + 'application_test.csv.zip')

## Loading data from each data frame (6 total).

### installments_payments

In [None]:
sample_df_instalpay = pd.read_csv(drive_dir + 'installments_payments.csv.zip', nrows=100000).sample(80000)
# Load 1000 rows 200 times
for i in range(10):
    new_sample_df_instalpay = pd.read_csv(drive_dir + 'installments_payments.csv.zip', nrows=100000, skiprows=range(1,(i+1)*100000)).sample(80000)

    sample_df_instalpay = pd.concat([sample_df_instalpay, new_sample_df_instalpay])

In [None]:
#Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = sample_df_instalpay[sample_df_instalpay.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

Duplicate Rows except first occurrence based on all columns are :
   SK_ID_PREV  SK_ID_CURR  ...  AMT_INSTALMENT  AMT_PAYMENT
0     2149338      111259  ...        28730.25     28730.25

[1 rows x 8 columns]


In [None]:
sample_df_instalpay.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
85438,2599877,174840,1.0,7,-2296.0,-2292.0,10716.21,10716.21
87430,1102197,138433,0.0,85,-1396.0,-1396.0,2250.0,2250.0
16596,2012087,121332,1.0,7,-240.0,-242.0,25584.525,25584.525
86535,2792257,127624,1.0,4,-1502.0,-1510.0,3758.67,3758.67
81800,1560899,168735,2.0,10,-187.0,-180.0,32495.985,32495.985


In [None]:
sample_df_instalpay['SK_ID_PREV'].nunique()

236558

In [None]:
sample_df_instalpay.shape

(880000, 8)

In [None]:
sample_df_instalpay.to_csv(drive_dir + 'installments_payments_mini.csv', index = False )

### POS_CASH_balance

In [None]:
sample_df_poscashb = pd.read_csv(drive_dir + 'POS_CASH_balance.csv.zip', nrows=100000).sample(80000)
# Load 1000 rows 200 times
for i in range(10):
    new_sample_df_poscashb = pd.read_csv(drive_dir + 'POS_CASH_balance.csv.zip', nrows=100000, skiprows=range(1,(i+1)*100000)).sample(80000)

    sample_df_poscashb = pd.concat([sample_df_poscashb, new_sample_df_poscashb])

In [None]:
#Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = sample_df_poscashb[sample_df_poscashb.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

Duplicate Rows except first occurrence based on all columns are :
   SK_ID_PREV  SK_ID_CURR  ...  SK_DPD  SK_DPD_DEF
0     1671450      186011  ...       0           0

[1 rows x 8 columns]


In [None]:
sample_df_poscashb.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
19765,1239342,216180,-18,24.0,19.0,Active,0,0
76957,2179568,397416,-56,7.0,4.0,Active,0,0
67581,2501225,257924,-4,12.0,10.0,Active,0,0
82658,2023169,170080,-11,18.0,11.0,Active,0,0
84403,1374439,148961,-21,12.0,5.0,Active,0,0


In [None]:
sample_df_poscashb['SK_ID_PREV'].nunique()

477172

In [None]:
sample_df_poscashb.shape

(880000, 8)

In [None]:
sample_df_poscashb.to_csv(drive_dir + 'POS_CASH_balance_mini.csv', index = False )

### bureau

In [None]:
sample_df_bureau = pd.read_csv(drive_dir + 'bureau.csv.zip', nrows=100000).sample(80000)
# Load 1000 rows 200 times
for i in range(10):
    new_sample_df_bureau = pd.read_csv(drive_dir + 'bureau.csv.zip', nrows=100000, skiprows=range(1,(i+1)*100000)).sample(80000)

    sample_df_bureau = pd.concat([sample_df_bureau, new_sample_df_bureau])

In [None]:
#Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = sample_df_bureau[sample_df_bureau.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

Duplicate Rows except first occurrence based on all columns are :
   SK_ID_CURR  SK_ID_BUREAU  ... DAYS_CREDIT_UPDATE AMT_ANNUITY
0      240757       6542162  ...                 -6     10125.0

[1 rows x 17 columns]


In [None]:
sample_df_bureau.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
22634,406090,5197457,Closed,currency 1,-603,0,-453.0,-421.0,,0,118656.0,,,0.0,Consumer credit,-414,0.0
39025,171262,5811086,Active,currency 1,-291,0,-261.0,,,0,202500.0,181287.0,0.0,0.0,Credit card,-233,
2308,167780,5717336,Active,currency 1,-205,0,344.0,,0.0,0,297000.0,212571.0,0.0,0.0,Consumer credit,-15,6120.0
42492,249764,5212718,Active,currency 1,-1223,0,,,0.0,0,135000.0,115348.5,19650.51,0.0,Credit card,-18,13500.0
75796,164484,5253659,Closed,currency 1,-827,0,-274.0,-273.0,6064.29,0,98685.0,0.0,0.0,0.0,Consumer credit,-271,8657.325


In [None]:
sample_df_bureau.shape

(880000, 17)

In [None]:
sample_df_bureau.to_csv(drive_dir + 'bureau.csv_mini.csv', index = False )

### bureau_balance

In [None]:
sample_df_bureaubal = pd.read_csv(drive_dir + 'bureau_balance.csv.zip', nrows=100000).sample(80000)
# Load 1000 rows 200 times
for i in range(10):
    new_sample_df_bureaubal = pd.read_csv(drive_dir + 'bureau_balance.csv.zip', nrows=100000, skiprows=range(1,(i+1)*100000)).sample(80000)

    sample_df_bureaubal = pd.concat([sample_df_bureaubal, new_sample_df_bureaubal])

In [None]:
#Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = sample_df_bureaubal[sample_df_bureaubal.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

Duplicate Rows except first occurrence based on all columns are :
   SK_ID_BUREAU  MONTHS_BALANCE STATUS
0       5241763             -27      X


In [None]:
sample_df_bureaubal.head()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
44147,5262488,-28,X
55297,6021653,-87,0
49897,6020603,-44,C
74326,5726348,-5,0
68578,5004181,-23,0


In [None]:
sample_df_bureaubal['SK_ID_BUREAU'].nunique()

32432

In [None]:
sample_df_bureaubal.shape

(880000, 3)

In [None]:
sample_df_bureaubal.to_csv(drive_dir + 'bureau_balance_mini.csv', index = False )

### credit_card_balance

In [None]:
sample_df_creditcb = pd.read_csv(drive_dir + 'credit_card_balance.csv.zip', nrows=100000).sample(80000)
# Load 1000 rows 200 times
for i in range(10):
    new_sample_df_creditcb = pd.read_csv(drive_dir + 'credit_card_balance.csv.zip', nrows=100000, skiprows=range(1,(i+1)*100000)).sample(80000)

    sample_df_creditcb = pd.concat([sample_df_creditcb, new_sample_df_creditcb])

In [None]:
sample_df_creditcb.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_INST_MIN_REGULARITY,AMT_PAYMENT_CURRENT,AMT_PAYMENT_TOTAL_CURRENT,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
94319,1730221,401839,-25,0.0,225000,,0.0,,,0.0,,0.0,0.0,0.0,0.0,,0,,,0.0,Active,0,0
67327,2657934,366032,-33,135972.405,135000,0.0,0.0,0.0,0.0,6750.0,6750.0,6750.0,131812.695,135972.405,135972.405,0.0,0,0.0,0.0,44.0,Active,0,0
33661,2411238,257442,-4,181374.75,180000,0.0,0.0,0.0,0.0,9096.48,9900.0,9900.0,172581.48,180045.855,180045.855,0.0,0,0.0,0.0,17.0,Active,0,0
25298,1698374,353507,-4,468008.64,450000,24750.0,24750.0,0.0,0.0,22285.305,22500.0,22500.0,447956.46,463213.89,463213.89,1.0,1,0.0,0.0,31.0,Active,0,0
39766,1214346,322324,-87,72486.135,90000,0.0,0.0,0.0,0.0,4500.0,7200.0,7200.0,69947.055,72486.135,72486.135,0.0,0,0.0,0.0,14.0,Active,0,0


In [None]:
#Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = sample_df_creditcb[sample_df_creditcb.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

Duplicate Rows except first occurrence based on all columns are :
   SK_ID_PREV  SK_ID_CURR  ...  SK_DPD  SK_DPD_DEF
0     2710147      381634  ...       0           0

[1 rows x 23 columns]


In [None]:
sample_df_creditcb['SK_ID_PREV'].nunique()

98320

In [None]:
df_credprev_nunique = sample_df_creditcb[sample_df_creditcb['SK_ID_PREV']==2657934]

In [None]:
df_credprev_nunique.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_INST_MIN_REGULARITY,AMT_PAYMENT_CURRENT,AMT_PAYMENT_TOTAL_CURRENT,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
67327,2657934,366032,-33,135972.405,135000,0.0,0.0,0.0,0.0,6750.0,6750.0,6750.0,131812.695,135972.405,135972.405,0.0,0,0.0,0.0,44.0,Active,0,0
14838,2657934,366032,-62,135343.395,135000,0.0,0.0,0.0,0.0,6750.0,6750.0,6750.0,131107.995,135343.395,135343.395,0.0,0,0.0,0.0,17.0,Active,0,0
2297,2657934,366032,-7,145839.735,135000,900.0,900.0,0.0,0.0,6750.0,6975.0,6975.0,134798.805,145839.735,145839.735,1.0,1,0.0,0.0,70.0,Active,1,1
59009,2657934,366032,-16,148062.42,135000,0.0,0.0,0.0,0.0,6750.0,0.0,0.0,134993.97,148062.42,148062.42,0.0,0,0.0,0.0,61.0,Active,1,1
21033,2657934,366032,-49,133779.6,135000,0.0,0.0,0.0,0.0,6750.0,9000.0,9000.0,129613.68,133779.6,133779.6,0.0,0,0.0,0.0,30.0,Active,0,0


In [None]:
sample_df_creditcb['SK_ID_CURR'].nunique()

97692

In [None]:
sample_df_creditcb.shape

(880000, 23)

In [None]:
sample_df_creditcb.to_csv(drive_dir + 'credit_card_balance_mini.csv', index = False )

### previous_application

In [None]:
sample_df_app = pd.read_csv(drive_dir + 'previous_application.csv.zip', nrows=100000).sample(80000)
# Load 10000 rows 200 times
for i in range(10):
    new_sample_df_app = pd.read_csv(drive_dir + 'previous_application.csv.zip', nrows=100000, skiprows=range(1,(i+1)*100000)).sample(80000)

    sample_df_app = pd.concat([sample_df_app, new_sample_df_app])

In [None]:
sample_df_app.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,FLAG_LAST_APPL_PER_CONTRACT,NFLAG_LAST_APPL_IN_DAY,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,NAME_CASH_LOAN_PURPOSE,NAME_CONTRACT_STATUS,DAYS_DECISION,NAME_PAYMENT_TYPE,CODE_REJECT_REASON,NAME_TYPE_SUITE,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,CHANNEL_TYPE,SELLERPLACE_AREA,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
80082,2138331,249458,Consumer loans,18324.045,184275.0,200088.0,0.0,184275.0,WEDNESDAY,14,Y,1,0.0,,,XAP,Approved,-418,Cash through the bank,XAP,Unaccompanied,Repeater,Audio/Video,POS,XNA,Country-wide,2500,Consumer electronics,12.0,low_action,POS household without interest,365243.0,-387.0,-57.0,-57.0,-50.0,0.0
42593,1989262,446664,Consumer loans,6990.615,52155.0,49684.5,10431.0,52155.0,WEDNESDAY,12,Y,1,0.188975,,,XAP,Approved,-1769,Cash through the bank,XAP,Family,New,Consumer Electronics,POS,XNA,Regional / Local,260,Consumer electronics,8.0,middle,POS household with interest,365243.0,-1738.0,-1528.0,-1558.0,-1555.0,0.0
54142,2007043,170000,Consumer loans,5825.205,25192.44,29763.0,1.44,25192.44,WEDNESDAY,12,Y,1,5.3e-05,,,XAP,Approved,-849,Cash through the bank,XAP,Family,Repeater,Photo / Cinema Equipment,POS,XNA,Regional / Local,135,Consumer electronics,6.0,high,POS household with interest,365243.0,-818.0,-668.0,-668.0,-656.0,0.0
39009,2392160,335249,Cash loans,,0.0,0.0,,,MONDAY,13,Y,1,,,,XNA,Canceled,-66,XNA,XAP,,Repeater,XNA,XNA,XNA,Credit and cash offices,0,XNA,,XNA,Cash,,,,,,
45060,2808823,193643,Cash loans,19698.12,216000.0,237384.0,,216000.0,THURSDAY,8,Y,1,,,,XNA,Refused,-607,Cash through the bank,HC,,Repeater,XNA,Cash,x-sell,Country-wide,58,Connectivity,18.0,high,Cash X-Sell: high,,,,,,


In [None]:
#Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = sample_df_app[sample_df_app.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

Duplicate Rows except first occurrence based on all columns are :
   SK_ID_PREV  SK_ID_CURR  ... DAYS_TERMINATION  NFLAG_INSURED_ON_APPROVAL
0     1792520      267933  ...           -356.0                        1.0

[1 rows x 37 columns]


In [None]:
sample_df_app['SK_ID_PREV'].nunique()

879999

In [None]:
sample_df_app.shape

(880000, 37)

In [None]:
df_app_nunique = sample_df_app[sample_df_app['SK_ID_CURR']==388788]

In [None]:
df_app_nunique.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,FLAG_LAST_APPL_PER_CONTRACT,NFLAG_LAST_APPL_IN_DAY,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,NAME_CASH_LOAN_PURPOSE,NAME_CONTRACT_STATUS,DAYS_DECISION,NAME_PAYMENT_TYPE,CODE_REJECT_REASON,NAME_TYPE_SUITE,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,CHANNEL_TYPE,SELLERPLACE_AREA,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
70959,1238555,388788,Consumer loans,29337.615,219982.5,152482.5,67500.0,219982.5,SUNDAY,14,Y,1,0.334179,,,XAP,Approved,-2521,XNA,XAP,Family,New,Computers,POS,XNA,Stone,1593,Consumer electronics,6.0,high,POS household with interest,365243.0,-2487.0,-2337.0,-2337.0,-2332.0,0.0
88546,1845466,388788,Consumer loans,2413.71,20281.5,20808.0,1350.0,20281.5,FRIDAY,18,Y,1,0.066354,,,XAP,Approved,-1984,Cash through the bank,XAP,Unaccompanied,Repeater,Mobile,POS,XNA,Country-wide,50,Connectivity,12.0,high,POS mobile with interest,365243.0,-1953.0,-1623.0,-1623.0,-1618.0,0.0


In [None]:
sample_df_app['NAME_CONTRACT_STATUS'].unique()

array(['Approved', 'Canceled', 'Refused', 'Unused offer'], dtype=object)

In [None]:
sample_df_app.to_csv(drive_dir + 'previous_application_mini.csv', index = False )

# Merging

In [None]:
result = pd.merge(sample_df_app, sample_df_poscashb, how='left', on=['SK_ID_PREV'])

In [None]:
result.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR_x,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,FLAG_LAST_APPL_PER_CONTRACT,NFLAG_LAST_APPL_IN_DAY,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,NAME_CASH_LOAN_PURPOSE,NAME_CONTRACT_STATUS_x,DAYS_DECISION,NAME_PAYMENT_TYPE,CODE_REJECT_REASON,NAME_TYPE_SUITE,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,CHANNEL_TYPE,SELLERPLACE_AREA,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL,SK_ID_CURR_y,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS_y,SK_DPD,SK_DPD_DEF
0,2138331,249458,Consumer loans,18324.045,184275.0,200088.0,0.0,184275.0,WEDNESDAY,14,Y,1,0.0,,,XAP,Approved,-418,Cash through the bank,XAP,Unaccompanied,Repeater,Audio/Video,POS,XNA,Country-wide,2500,Consumer electronics,12.0,low_action,POS household without interest,365243.0,-387.0,-57.0,-57.0,-50.0,0.0,,,,,,,
1,1989262,446664,Consumer loans,6990.615,52155.0,49684.5,10431.0,52155.0,WEDNESDAY,12,Y,1,0.188975,,,XAP,Approved,-1769,Cash through the bank,XAP,Family,New,Consumer Electronics,POS,XNA,Regional / Local,260,Consumer electronics,8.0,middle,POS household with interest,365243.0,-1738.0,-1528.0,-1558.0,-1555.0,0.0,,,,,,,
2,2007043,170000,Consumer loans,5825.205,25192.44,29763.0,1.44,25192.44,WEDNESDAY,12,Y,1,5.3e-05,,,XAP,Approved,-849,Cash through the bank,XAP,Family,Repeater,Photo / Cinema Equipment,POS,XNA,Regional / Local,135,Consumer electronics,6.0,high,POS household with interest,365243.0,-818.0,-668.0,-668.0,-656.0,0.0,,,,,,,
3,2392160,335249,Cash loans,,0.0,0.0,,,MONDAY,13,Y,1,,,,XNA,Canceled,-66,XNA,XAP,,Repeater,XNA,XNA,XNA,Credit and cash offices,0,XNA,,XNA,Cash,,,,,,,,,,,,,
4,2808823,193643,Cash loans,19698.12,216000.0,237384.0,,216000.0,THURSDAY,8,Y,1,,,,XNA,Refused,-607,Cash through the bank,HC,,Repeater,XNA,Cash,x-sell,Country-wide,58,Connectivity,18.0,high,Cash X-Sell: high,,,,,,,,,,,,,


In [None]:
result = result.dropna(how='any')

In [None]:
result.shape

(4, 44)

In [None]:
result.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR_x,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,FLAG_LAST_APPL_PER_CONTRACT,NFLAG_LAST_APPL_IN_DAY,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,NAME_CASH_LOAN_PURPOSE,NAME_CONTRACT_STATUS_x,DAYS_DECISION,NAME_PAYMENT_TYPE,CODE_REJECT_REASON,NAME_TYPE_SUITE,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,CHANNEL_TYPE,SELLERPLACE_AREA,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL,SK_ID_CURR_y,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS_y,SK_DPD,SK_DPD_DEF
213566,1468627,394419,Consumer loans,7650.405,82917.0,74623.5,8293.5,82917.0,THURSDAY,14,Y,1,0.108933,0.141774,0.56871,XAP,Approved,-824,XNA,XAP,Unaccompanied,New,Construction Materials,POS,XNA,Stone,90,Construction,12.0,middle,POS industry with interest,365243.0,-789.0,-459.0,-639.0,-632.0,0.0,394419.0,-23.0,12.0,8.0,Active,0.0,0.0
1016540,1897803,259229,Consumer loans,24029.82,260437.5,234391.5,26046.0,260437.5,SATURDAY,16,Y,1,0.108918,0.14176,0.56871,XAP,Approved,-768,XNA,XAP,Unaccompanied,Refreshed,Audio/Video,POS,XNA,Stone,67,Consumer electronics,12.0,middle,POS household with interest,365243.0,-728.0,-398.0,-578.0,-571.0,0.0,259229.0,-21.0,12.0,8.0,Active,0.0,0.0
1016541,1897803,259229,Consumer loans,24029.82,260437.5,234391.5,26046.0,260437.5,SATURDAY,16,Y,1,0.108918,0.14176,0.56871,XAP,Approved,-768,XNA,XAP,Unaccompanied,Refreshed,Audio/Video,POS,XNA,Stone,67,Consumer electronics,12.0,middle,POS household with interest,365243.0,-728.0,-398.0,-578.0,-571.0,0.0,259229.0,-22.0,12.0,9.0,Active,0.0,0.0
1071324,2677591,159789,Consumer loans,4235.085,45900.0,41310.0,4590.0,45900.0,TUESDAY,13,Y,1,0.108909,0.141859,0.56871,XAP,Approved,-718,XNA,XAP,Family,New,Furniture,POS,XNA,Stone,70,Furniture,12.0,middle,POS industry with interest,365243.0,-679.0,-349.0,-589.0,-579.0,0.0,159789.0,-20.0,12.0,9.0,Active,0.0,0.0


# Cleaning the Data

In [None]:
sample_df_app.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,FLAG_LAST_APPL_PER_CONTRACT,NFLAG_LAST_APPL_IN_DAY,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,NAME_CASH_LOAN_PURPOSE,NAME_CONTRACT_STATUS,DAYS_DECISION,NAME_PAYMENT_TYPE,CODE_REJECT_REASON,NAME_TYPE_SUITE,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,CHANNEL_TYPE,SELLERPLACE_AREA,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
44971,1923284,317363,Cash loans,49305.825,499500.0,519282.0,,499500.0,FRIDAY,10,Y,1,,,,XNA,Approved,-266,Cash through the bank,XAP,,Repeater,XNA,Cash,x-sell,Credit and cash offices,-1,XNA,12.0,low_normal,Cash X-Sell: low,365243.0,-236.0,94.0,-146.0,-138.0,1.0
763,1923756,198658,Consumer loans,13215.105,106941.15,116350.65,0.0,106941.15,SUNDAY,17,Y,1,0.0,,,XAP,Approved,-343,Cash through the bank,XAP,,Repeater,Mobile,POS,XNA,Country-wide,50,Connectivity,10.0,low_normal,POS mobile without interest,365243.0,-308.0,-38.0,-38.0,-35.0,0.0
23379,2084613,201578,Cash loans,22617.855,207000.0,232168.5,,207000.0,SATURDAY,13,Y,1,,,,XNA,Refused,-361,Cash through the bank,HC,Unaccompanied,Repeater,XNA,Cash,x-sell,Credit and cash offices,-1,XNA,12.0,low_normal,Cash X-Sell: low,,,,,,
93314,1843857,227309,Revolving loans,2250.0,45000.0,45000.0,,45000.0,SATURDAY,15,Y,1,,,,XAP,Approved,-257,XNA,XAP,Unaccompanied,Repeater,XNA,Cards,walk-in,AP+ (Cash loan),3,XNA,0.0,XNA,Card Street,-250.0,-211.0,365243.0,365243.0,365243.0,0.0
15754,1930776,214211,Cash loans,8446.725,67500.0,81157.5,,67500.0,TUESDAY,14,Y,1,,,,XNA,Approved,-557,Cash through the bank,XAP,,Repeater,XNA,Cash,x-sell,Credit and cash offices,0,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-527.0,-197.0,-197.0,-195.0,1.0


There are no missing values, as shown below:

In [None]:
for c in sample_df_instalpay.columns:
  num_missing = sample_df_instalpay[c].isna().sum()
  if num_missing > 0:
    print(f'{c}: {num_missing} ({100*num_missing / sample_df_instalpay.shape[0]:.2f}%) missing values')

In [None]:
sample_df_poscashb.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
99822,1292450,416728,-8,36.0,9.0,Active,0,0
93824,2755361,339018,-17,12.0,11.0,Active,0,0
47831,2748646,312916,-40,6.0,6.0,Active,0,0
96530,1481716,326683,-9,24.0,4.0,Active,0,0
77297,2525737,409921,-13,12.0,10.0,Active,0,0


In [None]:
for c in sample_df_poscashb.columns:
  num_missing = sample_df_poscashb[c].isna().sum()
  if num_missing > 0:
    print(f'{c}: {num_missing} ({100*num_missing / sample_df_poscashb.shape[0]:.2f}%) missing values')

CNT_INSTALMENT: 403 (0.18%) missing values
CNT_INSTALMENT_FUTURE: 402 (0.18%) missing values


- `CNT_INSTALMENT`:Filled missing values with 0 because no credit history
- `CNT_INSTALMENT_FUTURE`: Filled missing values with 0 because no previous credit

In [None]:
sample_df_poscashb = sample_df_poscashb.fillna(0)

In [None]:
sample_df_bureau.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
71290,419493,5248091,Closed,currency 1,-1680,0,-948.0,-947.0,,0,180000.0,0.0,0.0,0.0,Consumer credit,-941,
96267,206103,6537713,Closed,currency 1,-341,0,-287.0,-287.0,,0,0.0,0.0,0.0,0.0,Credit card,-283,
55983,214218,5233626,Closed,currency 1,-966,0,-600.0,-686.0,,0,108540.0,0.0,0.0,0.0,Consumer credit,-685,
83669,133011,5728008,Active,currency 1,-90,0,1023.0,,,0,157500.0,157284.0,0.0,0.0,Credit card,-7,
41358,204862,5211336,Active,currency 1,-978,0,849.0,,,0,2700000.0,1666521.0,0.0,0.0,Consumer credit,-57,


Checking for missing values

In [None]:
for c in sample_df_bureau.columns:
  num_missing = sample_df_bureau[c].isna().sum()
  if num_missing > 0:
    print(f'{c}: {num_missing} ({100*num_missing / sample_df_bureau.shape[0]:.2f}%) missing values')

DAYS_CREDIT_ENDDATE: 13672 (6.21%) missing values
DAYS_ENDDATE_FACT: 81909 (37.23%) missing values
AMT_CREDIT_MAX_OVERDUE: 145387 (66.08%) missing values
AMT_CREDIT_SUM: 1 (0.00%) missing values
AMT_CREDIT_SUM_DEBT: 32803 (14.91%) missing values
AMT_CREDIT_SUM_LIMIT: 76921 (34.96%) missing values
AMT_ANNUITY: 152724 (69.42%) missing values


- DAYS_CREDIT_ENDDATE: Remaining duration of CB credit (in days) at the time of application in Home Credit 
- DAYS_ENDDATE_FACT: Days since CB credit ended at the time of application in Home Credit (only for closed credit) 
- AMT_CREDIT_MAX_OVERDUE: Maximal amount overdue on the Credit Bureau credit so far (at application date of loan in our sample) 
- AMT_CREDIT_SUM_DEBT: Current debt on Credit Bureau credit 
- AMT_CREDIT_SUM_LIMIT: Current credit limit of credit card reported in Credit Bureau 
- AMT_ANNUITY: Annuity of the Credit Bureau credit

In [None]:
sample_df_bureau = sample_df_bureau.fillna(0)

In [None]:
sample_df_bureaubal.head()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
39134,5258568,-95,X
74328,5726348,-7,0
47448,6020189,-29,C
92417,5235796,-10,C
30346,5230761,0,0


In [None]:
for c in sample_df_bureaubal.columns:
  num_missing = sample_df_bureaubal[c].isna().sum()
  if num_missing > 0:
    print(f'{c}: {num_missing} ({100*num_missing / sample_df_bureaubal.shape[0]:.2f}%) missing values')

In [None]:
sample_df_creditcb.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_INST_MIN_REGULARITY,AMT_PAYMENT_CURRENT,AMT_PAYMENT_TOTAL_CURRENT,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
90214,1239151,196844,-36,0.0,0,0.0,0.0,0.0,0.0,0.0,45.675,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,16.0,Active,0,0
34375,1308598,237350,-43,72880.38,67500,900.0,900.0,0.0,0.0,3375.0,3375.0,3375.0,66195.81,72880.38,72880.38,1.0,1,0.0,0.0,27.0,Active,1,1
56217,1460002,209577,-3,0.0,135000,,0.0,,,0.0,,0.0,0.0,0.0,0.0,,0,,,0.0,Active,0,0
18431,2164166,427657,-10,0.0,90000,,0.0,,,0.0,,0.0,0.0,0.0,0.0,,0,,,0.0,Active,0,0
12663,2417748,291479,-20,0.0,495000,,0.0,,,0.0,,0.0,0.0,0.0,0.0,,0,,,0.0,Active,0,0


In [None]:
for c in sample_df_creditcb.columns:
  num_missing = sample_df_creditcb[c].isna().sum()
  if num_missing > 0:
    print(f'{c}: {num_missing} ({100*num_missing / sample_df_creditcb.shape[0]:.2f}%) missing values')

AMT_DRAWINGS_ATM_CURRENT: 50132 (22.79%) missing values
AMT_DRAWINGS_OTHER_CURRENT: 50132 (22.79%) missing values
AMT_DRAWINGS_POS_CURRENT: 50132 (22.79%) missing values
AMT_INST_MIN_REGULARITY: 13467 (6.12%) missing values
AMT_PAYMENT_CURRENT: 50762 (23.07%) missing values
CNT_DRAWINGS_ATM_CURRENT: 50132 (22.79%) missing values
CNT_DRAWINGS_OTHER_CURRENT: 50132 (22.79%) missing values
CNT_DRAWINGS_POS_CURRENT: 50132 (22.79%) missing values
CNT_INSTALMENT_MATURE_CUM: 13467 (6.12%) missing values


Never had credit?

In [None]:
sample_df_creditcb = sample_df_creditcb.fillna('unknown')

In [None]:
sample_df_app.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,FLAG_LAST_APPL_PER_CONTRACT,NFLAG_LAST_APPL_IN_DAY,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,NAME_CASH_LOAN_PURPOSE,NAME_CONTRACT_STATUS,DAYS_DECISION,NAME_PAYMENT_TYPE,CODE_REJECT_REASON,NAME_TYPE_SUITE,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,CHANNEL_TYPE,SELLERPLACE_AREA,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
44971,1923284,317363,Cash loans,49305.825,499500.0,519282.0,,499500.0,FRIDAY,10,Y,1,,,,XNA,Approved,-266,Cash through the bank,XAP,,Repeater,XNA,Cash,x-sell,Credit and cash offices,-1,XNA,12.0,low_normal,Cash X-Sell: low,365243.0,-236.0,94.0,-146.0,-138.0,1.0
763,1923756,198658,Consumer loans,13215.105,106941.15,116350.65,0.0,106941.15,SUNDAY,17,Y,1,0.0,,,XAP,Approved,-343,Cash through the bank,XAP,,Repeater,Mobile,POS,XNA,Country-wide,50,Connectivity,10.0,low_normal,POS mobile without interest,365243.0,-308.0,-38.0,-38.0,-35.0,0.0
23379,2084613,201578,Cash loans,22617.855,207000.0,232168.5,,207000.0,SATURDAY,13,Y,1,,,,XNA,Refused,-361,Cash through the bank,HC,Unaccompanied,Repeater,XNA,Cash,x-sell,Credit and cash offices,-1,XNA,12.0,low_normal,Cash X-Sell: low,,,,,,
93314,1843857,227309,Revolving loans,2250.0,45000.0,45000.0,,45000.0,SATURDAY,15,Y,1,,,,XAP,Approved,-257,XNA,XAP,Unaccompanied,Repeater,XNA,Cards,walk-in,AP+ (Cash loan),3,XNA,0.0,XNA,Card Street,-250.0,-211.0,365243.0,365243.0,365243.0,0.0
15754,1930776,214211,Cash loans,8446.725,67500.0,81157.5,,67500.0,TUESDAY,14,Y,1,,,,XNA,Approved,-557,Cash through the bank,XAP,,Repeater,XNA,Cash,x-sell,Credit and cash offices,0,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-527.0,-197.0,-197.0,-195.0,1.0


In [None]:
for c in sample_df_app.columns:
  num_missing = sample_df_app[c].isna().sum()
  if num_missing > 0:
    print(f'{c}: {num_missing} ({100*num_missing / sample_df_app.shape[0]:.2f}%) missing values')

AMT_ANNUITY: 46850 (21.30%) missing values
AMT_DOWN_PAYMENT: 112482 (51.13%) missing values
AMT_GOODS_PRICE: 48174 (21.90%) missing values
RATE_DOWN_PAYMENT: 112482 (51.13%) missing values
RATE_INTEREST_PRIMARY: 219221 (99.65%) missing values
RATE_INTEREST_PRIVILEGED: 219221 (99.65%) missing values
NAME_TYPE_SUITE: 107027 (48.65%) missing values
CNT_PAYMENT: 46850 (21.30%) missing values
PRODUCT_COMBINATION: 35 (0.02%) missing values
DAYS_FIRST_DRAWING: 84769 (38.53%) missing values
DAYS_FIRST_DUE: 84769 (38.53%) missing values
DAYS_LAST_DUE_1ST_VERSION: 84769 (38.53%) missing values
DAYS_LAST_DUE: 84769 (38.53%) missing values
DAYS_TERMINATION: 84769 (38.53%) missing values
NFLAG_INSURED_ON_APPROVAL: 84769 (38.53%) missing values
