# Credit Modelling

Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. You can read more about their marketplace [here](https://www.lendingclub.com/public/how-peer-lending-works.action).

Each borrower fills out a comprehensive application, and Lending Club evaluates each borrower's credit score using past historical data to assign an interest rate to the borrower. A higher interest rate means that the borrower is riskier and more unlikely to pay back the loan, and vice versa. 

If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition the requested amount. Many loans aren't completely paid off on time, however, and some borrowers default on the loan.

There is a [data dictionary](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit) which contains information on the different column names towards the bottom of the page. The LoanStats sheet describes the approved loans datasets and the RejectStats describes the rejected loans datasets. Since rejected applications don't appear on the Lending Club marketplace and aren't available for investment, we'll be focusing on data on approved loans only.

The approved loans datasets contain information on current loans, completed loans, and defaulted loans. The goal of this project is to build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not. 

In [51]:
import pandas as pd
import numpy as np

In [22]:
loans_2007 = pd.read_csv('loans_2007.csv')
print('{} columns in the dataset'.format(len(loans_2007.columns)))

52 columns in the dataset


In [23]:
loans_2007.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


## Data cleaning

After analyzing the first 18 columns, we can conclude that the following features need to be removed:

- **id**: randomly generated field by Lending Club for unique identification purposes only
- **member_id**: also a randomly generated field by Lending Club for unique identification purposes only
- **funded_amnt**: leaks data from the future (after the loan is already started to be funded)
- **funded_amnt_inv**: also leaks data from the future (after the loan is already started to be funded)
- **grade**: contains redundant information as the interest rate column (int_rate)
- **sub_grade**: also contains redundant information as the interest rate column (int_rate)
- **emp_title**: requires other data and a lot of processing to potentially be useful
- **issue_d**: leaks data from the future (after the loan is already completed funded)

In [24]:
cols = ['id', 'member_id', 'funded_amnt', 'funded_amnt_inv', 'grade', 'sub_grade', 'emp_title', 'issue_d']
loans_2007 = loans_2007.drop(cols, axis=1)

From the next set of 18 columns, we need to drop the following columns:

- **zip_code**: redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in)
- **out_prncp**: leaks data from the future, (after the loan already started to be paid off)
- **out_prncp_inv**: also leaks data from the future, (after the loan already started to be paid off)
- **total_pymnt**: also leaks data from the future, (after the loan already started to be paid off)
- **total_pymnt_inv**: also leaks data from the future, (after the loan already started to be paid off)
- **total_rec_prncp**: also leaks data from the future, (after the loan already started to be paid off)

In [25]:
cols = ['zip_code', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp']
loans_2007 = loans_2007.drop(cols, axis=1)

In the last group of columns, we need to drop the following columns:

- **total_rec_int**: leaks data from the future, (after the loan already started to be paid off),
- **total_rec_late_fee**: also leaks data from the future, (after the loan already started to be paid off),
- **recoveries**: also leaks data from the future, (after the loan already started to be paid off),
- **collection_recovery_fee**: also leaks data from the future, (after the loan already started to be paid off),
- **last_pymnt_d**: also leaks data from the future, (after the loan already started to be paid off),
- **last_pymnt_amnt**: also leaks data from the future, (after the loan already started to be paid off).

In [26]:
cols = ['total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'last_pymnt_d']
loans_2007 = loans_2007.drop(cols, axis=1)
loans_2007.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,initial_list_status,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,Fully Paid,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,Charged Off,n,...,f,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,Fully Paid,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,Fully Paid,n,...,f,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,3000.0,60 months,12.69%,67.79,1 year,RENT,80000.0,Source Verified,Current,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


Just by becoming familiar with the columns in the dataset, we were able to reduce the number of columns from 52 to 32 columns. 

We should use the loan_status column, since it's the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower. Currently, this column contains text values and we need to convert it to a numerical one for training a model.

In [27]:
loans_2007['loan_status'].value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

There are 8 different possible values for the loan_status column. Explanations about most of the different loan statuses are available on the [Lending Club website](https://help.lendingclub.com/hc/en-us/articles/215488038-What-do-the-different-Note-statuses-mean-).

From the investor's perspective, we're interested in trying to predict which loans will be paid off on time and which ones won't be. Only the Fully Paid and Charged Off values describe the final outcome of the loan. The other values describe loans that are still on going and where the jury is still out on if the borrower will pay back the loan on time or not.

Since we're interested in being able to predict which of these 2 values a loan will fall under, we can treat the problem as a binary classification one. Let's remove all the loans that don't contain either **Fully Paid** and **Charged Off** as the loan's status and then transform the **Fully Paid** values to 1 for the positive case and the Charged Off values to 0 for the negative case. 

Lastly, one thing we need to keep in mind is the class imbalance between the positive and negative cases. While there are 33,136 loans that have been fully paid off, there are only 5,634 that were charged off. This class imbalance is a common problem in binary classification and during training, the model ends up having a strong bias towards predicting the class with more observations in the training set and will rarely predict the class with less observations. The stronger the imbalance, the more biased the model becomes.

In [28]:
loans_2007 = loans_2007[(loans_2007['loan_status'] == 'Fully Paid') | (loans_2007['loan_status'] == 'Charged Off')]

mapping_dict = {
    'loan_status': {
        'Fully Paid': 1,
        'Charged Off': 0
    }
}

loans_2007 = loans_2007.replace(mapping_dict)

Finally, we will look for columns that contain only one unique value and remove them. These columns won't be useful for the model since they don't add any information to each loan application. In addition, removing these columns will reduce the number of columns we'll need to explore further.

In [29]:
drop_columns = []
for col in loans_2007.columns:
    col_series = loans_2007[col].dropna().unique()
    if len(col_series) == 1:
        drop_columns.append(col)
        
loans_2007 = loans_2007.drop(drop_columns, axis=1)
print('The following columns have been removed: \n', drop_columns)

The following columns have been removed: 
 ['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


## Feature preparation

Here we'll prepare the data for machine learning by focusing on handling missing values, converting categorical columns to numeric columns, and removing any other extraneous columns we encounter throughout this process.

In [31]:
loans = loans_2007.copy()
null_counts = loans.isnull().sum().sort_values(ascending=False)
print(null_counts)

emp_length              1036
pub_rec_bankruptcies     697
revol_util                50
title                     11
last_credit_pull_d         2
purpose                    0
term                       0
int_rate                   0
installment                0
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
total_acc                  0
loan_amnt                  0
dtype: int64


Domain knowledge tells us that employment length is frequently used in assessing how risky a potential borrower is, so we'll keep this column despite its relatively large amount of missing values.



In [33]:
loans['pub_rec_bankruptcies'].value_counts(normalize=True, dropna=False)

0.0    0.939438
1.0    0.042456
NaN    0.017978
2.0    0.000129
Name: pub_rec_bankruptcies, dtype: float64

We see that this column offers very little variability, nearly 94% of values are in the same category. It probably won't have much predictive value. let's drop it. In addition, we'll remove the remaining rows containing null values.

In [34]:
loans = loans.drop('pub_rec_bankruptcies', axis=1)
loans = loans.dropna()
loans.dtypes.value_counts()

object     11
float64    10
int64       1
dtype: int64

While the numerical columns can be used natively with scikit-learn, the object columns that contain text need to be converted to numerical data types.

In [35]:
object_columns_df = loans.select_dtypes(include=['object'])
object_columns_df.head()

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Jun-2016
1,60 months,15.27%,< 1 year,RENT,Source Verified,car,bike,GA,Apr-1999,9.4%,Sep-2013
2,36 months,15.96%,10+ years,RENT,Not Verified,small_business,real estate business,IL,Nov-2001,98.5%,Jun-2016
3,36 months,13.49%,10+ years,RENT,Source Verified,other,personel,CA,Feb-1996,21%,Apr-2016
5,36 months,7.90%,3 years,RENT,Source Verified,wedding,My wedding loan I promise to pay back,AZ,Nov-2004,28.3%,Jan-2016


Some of the columns seem like they represent categorical values, but we should confirm by checking the number of unique values in those columns:

- **home_ownership**: home ownership status, can only be 1 of 4 categorical values according to the data dictionary,
- **verification_status**: indicates if income was verified by Lending Club,
- **emp_length**: number of years the borrower was employed upon time of application,
- **term**: number of payments on the loan, either 36 or 60,
- **addr_state**: borrower's state of residence,
- **purpose**: a category provided by the borrower for the loan request,
- **title**: loan title provided the borrower,

There are also some columns that represent numeric values, that need to be converted:

- **int_rate**: interest rate of the loan in %,
- **revol_util**: revolving line utilization rate or the amount of credit the borrower is using relative to all available credit, read more here.

Lastly, some of the columns contain date values that would require a good amount of feature engineering for them to be potentially useful:

- **earliest_cr_line**: The month the borrower's earliest reported credit line was opened,
- **last_credit_pull_d**: The most recent month Lending Club pulled credit for this loan.

Since these date features require some feature engineering for modeling purposes, let's remove these date columns from the Dataframe.

In [40]:
loans['home_ownership'].value_counts()

RENT        18112
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64

In [41]:
loans['emp_length'].value_counts()

10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64

In [42]:
loans['term'].value_counts()

 36 months    28234
 60 months     9441
Name: term, dtype: int64

In [44]:
## first 5 values
loans['addr_state'].value_counts().head()

CA    6776
NY    3614
FL    2704
TX    2613
NJ    1776
Name: addr_state, dtype: int64

In [45]:
loans['purpose'].value_counts()

debt_consolidation    17751
credit_card            4911
other                  3711
home_improvement       2808
major_purchase         2083
small_business         1719
car                    1459
wedding                 916
medical                 655
moving                  552
house                   356
vacation                348
educational             312
renewable_energy         94
Name: purpose, dtype: int64

The **home_ownership**, **verification_status**, **emp_length**, and **term** columns each contain a few discrete categorical values. We should encode these columns as dummy variables and keep them.

It seems like the **purpose** and **title** columns do contain overlapping information but we'll keep the **purpose** column since it contains a few discrete values. In addition, the **title** column has data quality issues since many of the values are repeated with slight modifications

In [46]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

loans = loans.drop(['last_credit_pull_d', 'addr_state', 'title', 'earliest_cr_line'], axis=1)
loans['int_rate'] = loans['int_rate'].str.rstrip('%').astype('float')
loans['revol_util'] = loans['revol_util'].str.rstrip('%').astype('float')
loans = loans.replace(mapping_dict)

In [47]:
cols = ['home_ownership', 'verification_status', 'purpose', 'term']
dummy_df = pd.get_dummies(loans[cols])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cols, axis=1)

## Making predictions

We established that this is a binary classification problem. Before continuing with the predictions, we need to pick an error metric. We should optimize for:

- high recall (true positive rate)
- low fall-out (false positive rate)

\begin{equation}
FPR=\dfrac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}
\end{equation}

\begin{equation}
TPR=\dfrac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
\end{equation}


In [48]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37675 entries, 0 to 39785
Data columns (total 38 columns):
loan_amnt                              37675 non-null float64
int_rate                               37675 non-null float64
installment                            37675 non-null float64
emp_length                             37675 non-null object
annual_inc                             37675 non-null float64
loan_status                            37675 non-null int64
dti                                    37675 non-null float64
delinq_2yrs                            37675 non-null float64
inq_last_6mths                         37675 non-null float64
open_acc                               37675 non-null float64
pub_rec                                37675 non-null float64
revol_bal                              37675 non-null float64
revol_util                             37675 non-null float64
total_acc                              37675 non-null float64
home_ownership_MORTGAGE   

In [63]:
# warnings filter
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)
simplefilter(action='ignore', category=UserWarning)

### Logistic regression

A good first algorithm to apply to binary classification problems is logistic regression, for the following reasons:

- it's quick to train and we can iterate more quickly,
- it's less prone to overfitting than more complex models like decision trees,
- it's easy to interpret.

In [68]:
def check_accuracy(predictions):
    tn = len(loans[(predictions == 0) & (loans['loan_status'] == 0)])
    tp = len(loans[(predictions == 1) & (loans['loan_status'] == 1)])
    fn = len(loans[(predictions == 0) & (loans['loan_status'] == 1)])
    fp = len(loans[(predictions == 1) & (loans['loan_status'] == 0)])
    fpr = fp / (fp + tn)
    tpr = tp / (tp + fn)
    print('False Positive Rate: {0} \nTrue Positive Rate: {1}'.format(fpr, tpr))
    
    return fpr, tpr

In [69]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

lr = LogisticRegression()
features = loans.drop('loan_status', axis=1)
target = loans['loan_status']

predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

fpr, tpr = check_accuracy(predictions)

False Positive Rate: 0.9986179664363277 
True Positive Rate: 0.9984268484530676


### Class imbalance

Unfortunately, even through we're not using accuracy as an error metric, the classifier is, and it isn't accounting for the imbalance in the classes. There are a few ways to get a classifier to correct for imbalanced classes. The two main ways are:

1. Use oversampling and undersampling to ensure that the classifier gets input that has a balanced number of each class.
2. Tell the classifier to penalize misclassifications of the less prevalent class more than the other class.

We'll look into oversampling and undersampling first. They involve taking a sample that contains equal numbers of rows where loan_status is 0, and where loan_status is 1. This way, the classifier is forced to make actual predictions, since predicting all 1s or all 0s will only result in 50% accuracy at most.

The downside of this technique is that since it has to preserve an equal ratio, you have to either:

- Throw out many rows of data. If we wanted equal numbers of rows where loan_status is 0 and where loan_status is 1, one way we could do that is to delete rows where loan_status is 1.
- Copy rows multiple times. One way to equalize the 0s and 1s is to copy rows where loan_status is 0.
- Generate fake data. One way to equalize the 0s and 1s is to generate new rows where loan_status is 0.

Unfortunately, none of these techniques are especially easy. The second method we mentioned earlier, telling the classifier to penalize certain rows more, is actually much easier to implement using scikit-learn. We can do this by setting the class_weight parameter to balanced when creating the LogisticRegression instance. This tells scikit-learn to penalize the misclassification of the minority class during the training process

In [70]:
lr = LogisticRegression(class_weight='balanced')
predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

fpr, tpr = check_accuracy(predictions)

False Positive Rate: 0.622309970384995 
True Positive Rate: 0.6303749344520189


We significantly improved false positive rate in the last screen by balancing the classes, which reduced true positive rate. Our true positive rate is now around 66%, and our false positive rate is around 39%. From a conservative investor's standpoint, it's reassuring that the false positive rate is lower because it means that we'll be able to do a better job at avoiding bad loans than if we funded everything. However, we'd only ever decide to fund 66% of the total loans (true positive rate), so we'd immediately reject a good amount of loans.

We can try to lower the false positive rate further by assigning a harsher penalty for misclassifying the negative class. While setting class_weight to balanced will automatically set a penalty based on the number of 1s and 0s in the column, we can also set a manual penalty. 

In [71]:
penalty = {
    0: 10,
    1: 1
}

lr = LogisticRegression(class_weight=penalty)
predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

fpr, tpr = check_accuracy(predictions)

False Positive Rate: 0.2246791707798618 
True Positive Rate: 0.2270582066072365


Assigning manual penalties lowered the false positive rate to 9%, and thus lowered our risk. Note that this comes at the expense of true positive rate. While we have fewer false positives, we're also missing opportunities to fund more loans and potentially make more money. Given that we're approaching this as a conservative investor, this strategy makes sense, but it's worth keeping in mind the tradeoffs.

### Random forest

Random forests are able to work with nonlinear data, and learn complex conditionals. Logistic regressions are only able to work with linear data. Training a random forest algorithm may enable us to get more accuracy due to columns that correlate nonlinearly with **loan_status**.

In [72]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(class_weight='balanced', random_state=1)
predictions = cross_val_predict(rf, features, target, cv=3)
predictions = pd.Series(predictions)

fpr, tpr = check_accuracy(predictions)

False Positive Rate: 0.9636722606120435 
True Positive Rate: 0.9630637126376508


Unfortunately, using a random forest classifier didn't improve our false positive rate. The model is likely weighting too heavily on the 1 class, and still mostly predicting 1s. We could fix this by applying a harsher penalty for misclassifications of 0s.

Ultimately, our best model had a false positive rate of nearly 9%, and a true positive rate of nearly 24%. For a conservative investor, this means that they make money as long as the interest rate is high enough to offset the losses from 9% of borrowers defaulting, and that the pool of 24% of borrowers is large enough to make enough interest money to offset the losses.

If we had randomly picked loans to fund, borrowers would have defaulted on 14.5% of them, and our model is better than that, although we're excluding more loans than a random strategy would. 

## Next steps...

Here are some potential next steps:

- tweak the penalties further.
- try models other than a random forest and logistic regression.
- use some of the columns we discarded to generate better features.
- ensemble multiple models to get more accurate predictions.
- tune the parameters of the algorithm to achieve higher performance.