# Using Machine Learning to Credit Scoring
#### Rafael Buck

## 1. Introduction

Banks and fintechs have a crucial role in modern economies and the people empowerment. For markets and society accelerate their activities, individuals and companies need access to credit. 

A credit score is a numerical expression based on a level analysis of a person's credit records, to represent the creditworthiness of an individual. Traditionally, a credit score was primarily based on credit report information typically sourced from credit bureaus. However, with the proliferation of data science, institutions of any size can develop their credit scoring system and sharpen them for applications to their target markets.

The goal of this analysis is to build a model that borrowers can use to help make the best financial decisions.

## 2. Business Understanding

In the calculation of loan risks, the banks have to take into account some variables: the probability of default (PD), which means that the borrower is not likely to honor its debt; the bank's exposure at default (EAD) and the loss given default (LGD). In this analysis, we will use machine learning to predicts the probability of default (PD).

So, the questions to be answared is: "*Given a loan application, will it be paid or not?*"

## 3. Data Understanding

We will use a real Dataset from __[Lending Club](https://www.lendingclub.com)__. Lending Club is a US peer-to-peer lending company, headquartered in San Francisco, California. Lending Club is the world's largest peer-to-peer lending platform. The company states that $33 billion in loans had been originated through its platform up to 31 December 2017 (https://www.lendingclub.com/info/statistics.action).

In [24]:
# TODO: import libs
import pandas as pd 
import numpy as np
import datetime

In [25]:
# TODO: load dataset
loans = pd.read_csv("LoanStats_2017Q4.csv", skiprows=1, low_memory=False, infer_datetime_format = True) # Ignore the first row (is a comment)
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118650 entries, 0 to 118649
Columns: 145 entries, id to settlement_term
dtypes: float64(107), object(38)
memory usage: 131.3+ MB


In [26]:
half_count = len(loans)/2
loans = loans.dropna(thresh=half_count,axis=1) # Drop any column with more than 50% missing values
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118650 entries, 0 to 118649
Columns: 102 entries, loan_amnt to debt_settlement_flag
dtypes: float64(77), object(25)
memory usage: 92.3+ MB


In [28]:
loans.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,disbursement_method,debt_settlement_flag
0,35000.0,35000.0,35000.0,60 months,11.99%,778.38,B,B5,Project Manager,< 1 year,...,50.0,0.0,0.0,73825.0,51125.0,33000.0,35182.0,N,Cash,N
1,6000.0,6000.0,6000.0,36 months,7.35%,186.23,A,A4,Business Development,1 year,...,50.0,0.0,0.0,42988.0,6100.0,8300.0,18388.0,N,Cash,N
2,40000.0,40000.0,40000.0,36 months,6.08%,1218.33,A,A2,Editor/Writer,< 1 year,...,0.0,0.0,0.0,596402.0,53711.0,58000.0,20902.0,N,Cash,N
3,10000.0,10000.0,10000.0,60 months,23.88%,286.99,E,E2,EMT,10+ years,...,100.0,0.0,0.0,42900.0,34122.0,27100.0,0.0,N,Cash,N
4,27000.0,27000.0,27000.0,60 months,9.93%,572.75,B,B2,Data Scientist,3 years,...,0.0,0.0,0.0,166381.0,99838.0,48500.0,113281.0,N,Cash,N


In [34]:
loans['loan_status'].unique()

array(['Current', 'Fully Paid', 'Late (31-120 days)', 'Late (16-30 days)',
       'In Grace Period', 'Charged Off', nan], dtype=object)

## 4. Data Exploration

In [None]:
# TODO: find relationships, analyze outliers, and create new ones

## 5. Model

In [None]:
# TODO: try feature selection, scikit-learn models

## 6. Validation

In [None]:
# TODO: cross validation, model tuning

## 7. References

> Lending Club Statistics: https://www.lendingclub.com/info/download-data.action