# Loan Default prediction
Using scikit-learn and historical performance data to predict whether a loan will default.

## Financial product overview

** Asset Backed Securities**

First let's will learn about a very popular way to fund loans when the total size of the portfolio might be in the neighborhood of $1B. One of the widely used structure is called Asset Backed Securities. Let's take a look at a general structure for it
![Securitization](img\securitization.png)

As we mentioned in the previous lecture there are multiple approaches to assess usually depend on the granularity of data available.
* Aggregated historical performance data can be used to create a high level assessment on different parts of the portfolio e.g. loans with more than 30 months maturity and more then 95% debt to income ratio are risky and we estimate 4% of them to default.
* Loan level historical performance data can be used to create a predictive model to make a prediction on each loan individually.

*Just to note, data for Asset Backed Securities are usually private and confidential. They contain information that can not be shared outside the group doing the analysis. For this reason the dataset we are using is not actually data from an ABS but rather the LendingClub public personal loan dataset. For this short lecture it does not matter if the portfolio is securitized or not as we are working with the raw data.*

## Load already cleaned data from previous lecture

In [30]:
import pandas as pd
data = pd.read_pickle('Lendmark_clean.pkl')

In [31]:
data.head()

Unnamed: 0,loan_amnt,term,int_rate,funded_amnt,grade,annual_inc,dti,delinq_2yrs,emp_length,home_ownership,tax_liens,defaulted
0,5000.0,36,10.65,5000.0,2,24000.0,27.65,0.0,10,RENT,0.0,0
1,2500.0,60,15.27,2500.0,3,30000.0,1.0,0.0,0,RENT,0.0,1
2,2400.0,36,15.96,2400.0,3,12252.0,8.72,0.0,10,RENT,0.0,0
3,10000.0,36,13.49,10000.0,3,49200.0,20.0,0.0,10,RENT,0.0,0
4,3000.0,60,12.69,3000.0,2,80000.0,17.94,0.0,1,RENT,0.0,0


## (Optional for regression) Convert categorical variables to individual binaries
Many models can't deal with categorical type data and can only accept numerical variables thus we need to encode contained information numerically somehow.

In [32]:
data.home_ownership.unique()

array(['RENT', 'OWN', 'MORTGAGE', 'OTHER', 'NONE'], dtype=object)

**Something to think about:** Why would it be a bad idea to encode this variable with integer values 1-5?

The usual best practice is to encode a categorical variable of 5 unique values to 5 distinct binary columns (having values of 1 or 0). The resulting variable are sometimes called "dummy variables".

There is a built in Pandas function that can help us achieve this. Let's see how it works:

In [33]:
pd.get_dummies(data.home_ownership, prefix='home_ownership', prefix_sep='_').head()

Unnamed: 0,home_ownership_MORTGAGE,home_ownership_NONE,home_ownership_OTHER,home_ownership_OWN,home_ownership_RENT
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,0,0,0,0,1


Let's create a version of the data wich contains the dummy variables but not the original one.

In [34]:
cols = data.columns.tolist()
cols.remove('home_ownership')

data_dummy = pd.get_dummies(data.home_ownership, prefix='home_ownership', prefix_sep='_')
data_dummy = data[cols].join(data_dummy)
data_dummy.head()

Unnamed: 0,loan_amnt,term,int_rate,funded_amnt,grade,annual_inc,dti,delinq_2yrs,emp_length,tax_liens,defaulted,home_ownership_MORTGAGE,home_ownership_NONE,home_ownership_OTHER,home_ownership_OWN,home_ownership_RENT
0,5000.0,36,10.65,5000.0,2,24000.0,27.65,0.0,10,0.0,0,0,0,0,0,1
1,2500.0,60,15.27,2500.0,3,30000.0,1.0,0.0,0,0.0,1,0,0,0,0,1
2,2400.0,36,15.96,2400.0,3,12252.0,8.72,0.0,10,0.0,0,0,0,0,0,1
3,10000.0,36,13.49,10000.0,3,49200.0,20.0,0.0,10,0.0,0,0,0,0,0,1
4,3000.0,60,12.69,3000.0,2,80000.0,17.94,0.0,1,0.0,0,0,0,0,0,1


## (Optional) ?Variable selection? - using some method

## Train-test split

In [36]:
from sklearn.model_selection import train_test_split

In [38]:
x_cols = data.columns.tolist()
x_cols.remove('defaulted')
y_col = 'defaulted'

X = data[x_cols].values
y = data[y_col].values

In [45]:
print('Shape of X: {}'.format(X.shape))
print('Shape of y: {}'.format(y.shape))

Shape of X: (39747L, 11L)
Shape of y: (39747L,)


In [54]:
print('Features of row 1: {}'.format(X[0,:]))
print('Response variable of row 1: {}'.format(y[0]))

Features of row 1: [5000.0 36L 10.65 5000.0 2L 24000.0 27.65 0.0 10L 'RENT' 0.0]
Response variable of row 1: 0


In [59]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [61]:
print('Shape of X_train: {}'.format(X_train.shape))
print('Shape of X_test: {}'.format(X_test.shape))
print('Shape of y_train: {}'.format(y_train.shape))
print('Shape of y_test: {}'.format(y_test.shape))

Shape of X_train: (31797L, 11L)
Shape of X_test: (7950L, 11L)
Shape of y_train: (31797L,)
Shape of y_test: (7950L,)


In [62]:
print('Features of row 1: {}'.format(X_train[0,:]))
print('Response variable of row 1: {}'.format(y_train[0]))

Features of row 1: [15000.0 60L 20.62 15000.0 6L 70000.0 15.46 0.0 4L 'RENT' 0.0]
Response variable of row 1: 0


In [64]:
print('Features of row 1: {}'.format(X_test[0,:]))
print('Response variable of row 1: {}'.format(y_test[0]))

Features of row 1: [3000.0 36L 13.48 3000.0 3L 25000.0 22.94 0.0 0L 'RENT' 0.0]
Response variable of row 1: 0


## Predictions with different models: Logistic Regression, Decision Tree, Random Forest, Naive Bayes

## Visualize Decision Tree

## Check feature importance ouput of Random Forest

## Check the performance of models on the test set

## Union together train and test datasets and do X fold CrossValidation using some models

## Check which models performed best on this dataset. Which one would we use in production?