In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier

Importing cleaned and prepared data from the lats script, Preparing the Features.  
As we prepared the data, we removed columns that had data leakage issues, contained redundant information, or required additional processing to turn into useful features. We cleaned features that had formatting issues, and converted categorical columns to dummy variables.  

In [2]:
loans= pd.read_csv("data/cleaned_loans_2007.csv")
loans.info()
# loan_status =0, means that loan not paid and 1 means otherwise

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38649 entries, 0 to 38648
Data columns (total 40 columns):
Unnamed: 0                             38649 non-null int64
Unnamed: 0.1                           38649 non-null int64
loan_amnt                              38649 non-null float64
int_rate                               38649 non-null float64
installment                            38649 non-null float64
emp_length                             38649 non-null int64
annual_inc                             38649 non-null float64
loan_status                            38649 non-null int64
dti                                    38649 non-null float64
delinq_2yrs                            38649 non-null float64
inq_last_6mths                         38649 non-null float64
open_acc                               38649 non-null float64
pub_rec                                38649 non-null float64
revol_bal                              38649 non-null float64
revol_util                     

We are a conservative investor, who is only interested in borrowers paying back on time. So, our main objective is to make money, we want to fund enough loans that are paid off on time to offset our losses from loans that aren't paid off. Our error metric will help us determine if our algorithm will make us money or lose us money.  
In this case, we're primarily concerned with false positives and false negatives. We would want to minimize risk, and avoid false positives as much as possible. We'd be more okay with missing out on opportunities (false negatives) than they would be with funding a risky loan (false positives).

## Logistic Regression

We know that there was an imbalance in classes which would affect the model's result badly so to overcome this, we should use the following ways: 
- Use oversampling and undersampling to ensure that the classifier gets input that has a balanced number of each class.
- Tell the classifier to penalize misclassifications of the less prevalent class more than the other class.


The first option is quite difficult to achieve so we're going to go with the second option as it is easier to implement with scikit-learn as well.  

We can do this by setting the class_weight parameter to balanced when creating the LogisticRegression instance. This tells scikit-learn to penalize the misclassification of the minority class during the training process. The penalty means that the logistic regression classifier pays more attention to correctly classifying rows where loan_status is 0. This lowers accuracy when loan_status is 1, but raises accuracy when loan_status is 0.

In [3]:
cols = loans.columns
train_cols = cols.drop("loan_status")
features = loans[train_cols]
target = loans["loan_status"]

lr = LogisticRegression(class_weight="balanced")
predictions= cross_val_predict(lr, features, target, cv=3)

predictions= pd.Series(predictions)

tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

fpr= fp/(fp+tn)
tpr= tp/(tp+fn)

print(fpr, tpr)

0.44311266826479806 0.6878950219707458


We can see the TPR is higher than the FPR, which is good and what we want. But this also means that as a conservative investor we'd decide to fund 69% of the total loans, rejecting a good amount of loans.  
To improve the TPR further we could **_manually adjust the penalty_** to make it harsher, the scikit learn puts a penalty value of the ratio of the number of 1s/ number of 0s when set to balanced.

In [4]:
penalty = {
    0: 10,
    1: 1
}

lr = LogisticRegression(class_weight=penalty)
predictions= cross_val_predict(lr, features, target, cv=3)

predictions= pd.Series(predictions)

tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

fpr= fp/(fp+tn)
tpr= tp/(tp+fn)

print(fpr, tpr)

0.20302415637101234 0.3507494131102149


It looks like introducing the manual penalty decreased the FPR to 20% from 45%.  
There is always more scope to play around with penalty values and get better accuracy but we can try a different model now, like a Random Forest.  
The Logistic Regression models are only able to work with linear data but Random Forests can work better by working with all those features that are non-linearly related to the target.

## Random Forests

In [5]:
rfc= RandomForestClassifier(random_state=1, class_weight="balanced") 
# random_state is set to 1 so that the predictions don't vary due to random chance
predictions= cross_val_predict(rfc, features, target, cv=3)

predictions= pd.Series(predictions)

tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

fpr= fp/(fp+tn)
tpr= tp/(tp+fn)

print(fpr, tpr)

0.6123916651300019 0.6205080358755192


**_Introducing harsher penalty_**

In [6]:
penalty = {
    0: 10,
    1: 1
}

rfc= RandomForestClassifier(random_state=1, class_weight=penalty) 
# random_state is set to 1 so that the predictions don't vary due to random chance
predictions= cross_val_predict(rfc, features, target, cv=3)

predictions= pd.Series(predictions)

tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

fpr= fp/(fp+tn)
tpr= tp/(tp+fn)

print(fpr, tpr)

0.6369168356997972 0.6471738999578643


## Conclusion
Unfortunately, using a random forest classifier (even with harsher penalties) didn't improve our false positive rate. The model is likely weighting too heavily on the 1 class, and still mostly predicting 1s.  
Our best model was the Logistic Regression with 20% FPR. For a conservative investor, this means that they make money as long as the interest rate is high enough to offset the losses from 7% of borrowers defaulting, and that the pool of 20% of borrowers is large enough to make enough interest money to offset the losses.

## Future Scope
- We can tweak the penalties further.
- We can try models other than a random forest and logistic regression.
- We can use some of the columns we discarded to generate better features.
- We can ensemble multiple models to get more accurate predictions.
- We can tune the parameters of the algorithm to achieve higher performance.