# <center> Credit Risk Modeling <center> #
<center> Hieu Nguyen <center>

<div class="alert alert-block alert-info">
The project focuses on the probability of loan defaults in the banking industry. It is my practice regarding a predictive probablity-related problem using some machine learning models. There are three main credit risk models on theory (exposure at default, probability of default, and loss given default). In my analysis, I will focus solely on the probability of default. <br> <br> The goal is to build/review some models that lenders can use to help make the best financial decisions on dealing with high-risk borrowers. In particularly, I will try to predict the probability that somebody will experience financial distress in the next two years. <br><br> Later, I will compare these models with non-ML models (hopefully I have the time and effort)
</div>

Content: <br>
* Libraries
* Data description
* Exploratory data analysis
* Oversampling
* Data evaluation (correlation matrix)
* Modeling
    * K-Means
    * Logistics regression
    * Random forest
    * Gradient boosting
* Model performance
    * Confusion matrix
    * Accuracy
    * Precision, recall, and F-measure
    * Receiver Operating Characteristics Curve (ROC), Precision-Recall Curve, and AUC
* Post analysis
    * Feature analysis
    * Model analysis and discussion


## Libraries

In [8]:
import pandas as pd
from pathlib import Path
import os

In [None]:
# set seeds
seed = 3001

## Data description ##

The four used datasets were from: https://www.kaggle.com/competitions/GiveMeSomeCredit/data. <br>
* Data dictionary (xls)
* Credit score training data (csv)
* Credit score test data (csv)
* Sample entry (?) (csv) 


I downloaded all of these data and uploaded to my github. 

In [18]:
base_dir = os.getcwd()

#data dictionary (since it is an Excel file, I had to download it and process it locally in my machine)
dict_path = "data/Data Dictionary.xls"
dic = pd.read_excel(dict_path, skiprows=1)
dic


Unnamed: 0,Variable Name,Description,Type
0,SeriousDlqin2yrs,Person experienced 90 days past due delinquenc...,Y/N
1,RevolvingUtilizationOfUnsecuredLines,Total balance on credit cards and personal lin...,percentage
2,age,Age of borrower in years,integer
3,NumberOfTime30-59DaysPastDueNotWorse,Number of times borrower has been 30-59 days p...,integer
4,DebtRatio,"Monthly debt payments, alimony,living costs di...",percentage
5,MonthlyIncome,Monthly income,real
6,NumberOfOpenCreditLinesAndLoans,Number of Open loans (installment like car loa...,integer
7,NumberOfTimes90DaysLate,Number of times borrower has been 90 days or m...,integer
8,NumberRealEstateLoansOrLines,Number of mortgage and real estate loans inclu...,integer
9,NumberOfTime60-89DaysPastDueNotWorse,Number of times borrower has been 60-89 days p...,integer


In [15]:
# import credit data directly from my github
train_url = "https://raw.githubusercontent.com/quanghieu31/credit-risk-modeling/main/data/cs-training.csv?token=GHSAT0AAAAAABVGOACGNSFTT45LDTUSXEFGYZF3QEQ"
train_data = pd.read_csv(train_url)
test_url = "https://raw.githubusercontent.com/quanghieu31/credit-risk-modeling/main/data/cs-test.csv?token=GHSAT0AAAAAABVGOACH6TI5MZFFII5BFQWEYZF3AXA"
test_data = pd.read_csv(test_url)

We can see that our label is the variable *SeriousDlqin2yrs* or Person experienced 90 days past due delinquency with binary values (Yes-1 and No-0). The first thought might be that running a logistics regression makes sense here which is true, and I will also utilize other models to tackle this. But first, let's explore and clean the data. 

## Exploratory data analysis ##

In [29]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   ID                                    150000 non-null  int64  
 1   SeriousDlqin2yrs                      150000 non-null  int64  
 2   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 3   age                                   150000 non-null  int64  
 4   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 5   DebtRatio                             150000 non-null  float64
 6   MonthlyIncome                         120269 non-null  float64
 7   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 8   NumberOfTimes90DaysLate               150000 non-null  int64  
 9   NumberRealEstateLoansOrLines          150000 non-null  int64  
 10  NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 11  

Observation: *NumberOfDependents* is in float64 type (shouldn't it be integer?). The first column *Unnamed: 0*'s name is not very pleasing to my eyes, so I will change it to *ID* (ID of recorded people who were having loans). There are no other columns with categorical values (except for our label column), they all have continuous values.  

In [25]:
train_data = train_data.rename(columns={'Unnamed: 0': 'ID'})
train_data.describe()

Unnamed: 0,ID,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
count,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,120269.0,150000.0,150000.0,150000.0,150000.0,146076.0
mean,75000.5,0.06684,6.048438,52.295207,0.421033,353.005076,6670.221,8.45276,0.265973,1.01824,0.240387,0.757222
std,43301.414527,0.249746,249.755371,14.771866,4.192781,2037.818523,14384.67,5.145951,4.169304,1.129771,4.155179,1.115086
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,37500.75,0.0,0.029867,41.0,0.0,0.175074,3400.0,5.0,0.0,0.0,0.0,0.0
50%,75000.5,0.0,0.154181,52.0,0.0,0.366508,5400.0,8.0,0.0,1.0,0.0,0.0
75%,112500.25,0.0,0.559046,63.0,0.0,0.868254,8249.0,11.0,0.0,2.0,0.0,1.0
max,150000.0,1.0,50708.0,109.0,98.0,329664.0,3008750.0,58.0,98.0,54.0,98.0,20.0


Observation: ... (to be continued)