---
# Capstone Project: Home Credit Default Risk


author: Yue Gao
date: 2022/07/01
---

# Introduction of Home Credit Default Risk Project

## Background of Business situations
- What does Home Credit Group do ?
 Home Credit is an internation non-bank financial services company that provides credit and loan products to consumers,especially to population with little or no enough qualified credit history.
- How data generated from customers help Home Credit Group to provide its products and services ?
 Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
- Why predictions of customers' repayment ability are important for Home Credit ?
 As we mentioned, the most customers of Home Credit are people who have little or no enough qualified credit history and score, which means they were not qualified to get a loan/credit in traditional bank-format financial institutions. In the other words, this group of customers may be labelled high-rick in traditional bank-format financial institutions. Therefore, if Home Credit can predict customers' repayment ability upon their applications and make more proper decisions which can balance the business profits and potential customers' default risks.



## Business Mode of Home Credit
Product Purchase scenario:

<img src="https://www.homecredit.net/~/media/Images/H/Home-Credit-Group/content-images/right-content-images/hcg-product-flow.png?la=en"
     alt="business mode of home credit"
     style="float: center; margin-right: 10px;" />

1. Consumers submit their application to the shop place behalf of Home Credit.
2. The shop place receives the application and sends it to Home Credit.
3. Home Credit receives the application and makes a decision on the application and send the result to the shop place.
4. If the application is approved, the shop place receives the amount of payment of the item.Otherwise, the shop place will not receive the amount of payment.
5. Once the application is approved, the shop place will release the item to the consumer. Otherwise, the shop place will not release the item to the consumer.
6. After the item is released to the consumer, the consumer will pay monthly installments to the Home Credit.

## Define the Business Problem

Home Credit mainly depends on the predictions of customers' default risk to decide whether to approve customers' applications or not. This prediction results in logistical form are True or False, and in numeric form are 0 or 1. True/0 stands for the customer will get default, and 1/1 stands for the customer will not get default.

## Goals of Home Credit Default Risk Project

While Home Credit is currently using various statistical and machine learning methods to make these predictions, Home Credit provide 70,000 dollars rewards on Kaggle Competition platform to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful. Meanwhile, Home Credit will also ensure that the customers who are not capable of repayment are not given loans. Statistical speaking, Home Credit is looking for the machine learning classification model that can detect the cases of false positive and false negative effectively. However, if financial industry businesses are sensitive to risks of bad debt, they may want to a machine learning classification model that can improve the detective ability of false negative cases.

## Added Value of Home Credit Default Risk Project
From Home Credit perspectives, good classification models can help them to identify the customers who have proper repayment ability  and make borrowing with lower risks.
From financial services users perspectives, good classification models can help them to get financial services with lower threshold of credit backgrounds

In [1]:
# Import the necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import csv

# Data Acquisition, Quality & Completeness

### Data Acquisition

The whole data set is available at [Home Credit Default Risk Data Set](https://www.kaggle.com/c/home-credit-default-risk/data).
The data sets were issued by Home Credit Group officially to collect multiple solutions for how to make their financial services more efficient and better.
However, some features were confidential to public due to customers privacy protection. It is almost impossible to receive the full version of datasets from financial industry. Moreover, the data sets were issued four years ago, which may not reflect the current situation.

### Datasets Description

#### application_{train|test}.csv

- This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
- Static data for all applications. One row represents one loan in our data sample.

#### bureau.csv

- All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).
- For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.

#### bureau_balance.csv

- Monthly balances of previous credits in Credit Bureau.
- This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.

#### POS_CASH_balance.csv

- Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
- This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.

#### credit_card_balance.csv

- Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
- This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.

#### previous_application.csv

- All previous applications for Home Credit loans of clients who have loans in our sample.
- There is one row for each previous application related to loans in our data sample.

#### installments_payments.csv

- Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
- There is a) one row for every payment that was made plus b) one row each for missed payment.
- One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.

#### HomeCredit_columns_description.csv

- This file contains descriptions for the columns in the various data files.

## How each table is constructed:
As we see the image below, the data is connected to each other by some common columns.
For example, `bureau.csv` and `bureau_balance.csv` are connected by the column `SK_ID_BUREAU` while `installments_payments.csv` and `previous_application.csv` is connected by the column `SK_ID_PREV`.

<img src="https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 10px;" />

## Data Selection
To make the EDA and model construction more efficient, we will use the following data selection criteria:
1. Avoid the tables contain duplicated information.
`bureau.csv` and `bureau_balance.csv` may contain duplicated information about credit history for the same client. Therefore, we pay more attention on the table of `bureau` which connects the main table of `application_train.csv`.
2. Keep the tables with the most proxy information compared to the main tables.
For example, the table of `bureau.csv` and `previous_application.csv` are connected to the main table by the column of `SK_ID_CURR` which is the primary key of the main table. Therefore, we will use the table of `bureau.csv` and `previous_application.csv` as supplementary information combining main table to construct the model and do exploratory data analysis.

After the data selection, we will make EDA and model structure construction more efficient and simpler. At the meantime, it is easier to interpret the data and visualize the data and make readers understand the data.

In [2]:
# Load the descriptive data of tables and columns
df_list = []
with open("HomeCredit_columns_description.csv", 'r',errors='ignore') as file:
    reader = csv.reader(file)
    for row in reader:
        df_list.append(pd.DataFrame({'table_name':row[1],'col_name':row[2],\
                                     'description':row[3]},index=list('A')))
df_description = pd.concat(df_list,axis=0).reset_index()
df_description.drop(columns='index',inplace=True)

In [3]:
# Create a function that can check descriptions of columns in a table
def view_description_columns(table,data=df_description,column=None):
    """
    This function is used to view the description of columns in a table.
    :param table: The target table name.
    :param data: The dataframe of description of columns.
    :param column: The target column to view the description. If None, the function will view all columns.
    :return: The description of columns in a table.
    """
    if column is not None:
        return data[(data['table_name']==table)&(df_description['col_name']==column)][['col_name','description']]
    else:
        return data[data['table_name']==table][['col_name','description']]
    

### Previous Application Columns Description

In [4]:
# Read Previous Application table
previous_application = pd.read_csv('previous_application.csv')

In [5]:
# Use the function of view_description_columns to extract the description of columns in the table of previous application
previous_application_description = view_description_columns(table='previous_application.csv',data=df_description)

In [6]:
# View the first 10 descriptions of columns in the table of previous application
previous_application_description.head(10)

Unnamed: 0,col_name,description
174,SK_ID_PREV,ID of previous credit in Home credit related t...
175,SK_ID_CURR,ID of loan in our sample
176,NAME_CONTRACT_TYPE,"Contract product type (Cash loan, consumer loa..."
177,AMT_ANNUITY,Annuity of previous application
178,AMT_APPLICATION,For how much credit did client ask on the prev...
179,AMT_CREDIT,Final credit amount on the previous applicatio...
180,AMT_DOWN_PAYMENT,Down payment on the previous application
181,AMT_GOODS_PRICE,Goods price of good that client asked for (if ...
182,WEEKDAY_APPR_PROCESS_START,On which day of the week did the client apply ...
183,HOUR_APPR_PROCESS_START,Approximately at what day hour did the client ...


### Bureau Columns description

In [7]:
# Read Bureau table
bureau = pd.read_csv('bureau.csv')

In [8]:
# Use the function of view_description_columns to extract the description of columns in the table of bureau
bureau_description = view_description_columns('bureau.csv',data=df_description)

In [9]:
# View the first 10 descriptions of columns in the table of bureau
bureau_description.head(10)

Unnamed: 0,col_name,description
123,SK_ID_CURR,ID of loan in our sample - one loan in our sam...
124,SK_BUREAU_ID,Recoded ID of previous Credit Bureau credit re...
125,CREDIT_ACTIVE,Status of the Credit Bureau (CB) reported credits
126,CREDIT_CURRENCY,Recoded currency of the Credit Bureau credit
127,DAYS_CREDIT,How many days before current application did c...
128,CREDIT_DAY_OVERDUE,Number of days past due on CB credit at the ti...
129,DAYS_CREDIT_ENDDATE,Remaining duration of CB credit (in days) at t...
130,DAYS_ENDDATE_FACT,Days since CB credit ended at the time of appl...
131,AMT_CREDIT_MAX_OVERDUE,Maximal amount overdue on the Credit Bureau cr...
132,CNT_CREDIT_PROLONG,How many times was the Credit Bureau credit pr...


### Application Columns Description

In [10]:
# Read Application table
application_train = pd.read_csv('application_train.csv')

In [11]:
# Use the function of view_description_columns to extract the description of columns in the table of application_train
application_test = pd.read_csv('application_test.csv')

In [12]:
# Use the function of view_description_columns to extract the description of columns in the table of application_test
application_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
