INTRO

In this project, we'll be working with financial lending data from Lending Club. Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return.

Most investors use a portfolio strategy to invest small amounts in many loans, with healthy mixes of low, medium, and interest loans. In this course, we'll focus on the mindset of a conservative investor who only wants to invest in the loans that have a good chance of being paid off on time. To do that, we'll need to first understand the features in the dataset and then experiment with building machine learning models that reliably predict if a loan will be paid off or not.

Before diving into the datasets themselves, let's get familiar with the data dictionary. The LoanStats sheet describes the approved loans datasets and the RejectStats describes the rejected loans datasets. Since rejected applications don't appear on the Lending Club marketplace and aren't available for investment, we'll be focusing on data on approved loans only.

In this mission, we'll focus on approved loans data from 2007 to 2011, since a good number of the loans have already finished. In the datasets for later years, many of the loans are current and still being paid off.


DATA CLEANING

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier

#from sklearn.linear_model import LinearRegression
#from sklearn.tree import DecisionTreeRegressor

#from sklearn.ensemble import RandomForestRegressor
#from sklearn.datasets import make_regression


In [2]:
df = pd.read_csv("loans_2007.csv").reset_index(drop=True)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [3]:
#drop columns that leak info from the future or are not useful
df = df.drop(columns = ['id',
'member_id',
'funded_amnt',
'funded_amnt_inv',
'grade',
'sub_grade',
'emp_title',
'issue_d',
'zip_code',
'out_prncp',
'out_prncp_inv',
'total_pymnt',
'total_pymnt_inv',
'total_rec_prncp',
'total_rec_int',
'total_rec_late_fee',
'recoveries',
'collection_recovery_fee',
'last_pymnt_d',
'last_pymnt_amnt'])

In [4]:
df.head()
#number of columns dropped from 52 to 32!

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,initial_list_status,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,Fully Paid,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,Charged Off,n,...,f,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,Fully Paid,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,Fully Paid,n,...,f,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,3000.0,60 months,12.69%,67.79,1 year,RENT,80000.0,Source Verified,Current,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


We should use the loan_status column as the target variable, since it's the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower. Currently, this column contains text values and we need to convert it to a numerical one for training a model

In [5]:
df['loan_status'].value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

From the investor's perspective, we're interested in trying to predict which loans will be paid off on time and which ones won't be. Only the Fully Paid and Charged Off values describe the final outcome of the loan. The other values describe loans that are still on going and where the jury is still out on if the borrower will pay back the loan on time or not. While the Default status resembles the Charged Off status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance. 

Since we're interested in being able to predict which of these 2 values a loan will fall under, we can treat the problem as a binary classification one. Let's remove all the loans that don't contain either Fully Paid and Charged Off as the loan's status and then transform the Fully Paid values to 1 for the positive case and the Charged Off values to 0 for the negative case.

Lastly, one thing we need to keep in mind is the class imbalance between the positive and negative cases. While there are 33,136 loans that have been fully paid off, there are only 5,634 that were charged off. This class imbalance is a common problem in binary classification and during training, the model ends up having a strong bias towards predicting the class with more observations in the training set and will rarely predict the class with less observations. The stronger the imbalance, the more biased the model becomes. There are a few different ways to tackle this class imbalance, which we'll explore later.

In [6]:
#use Dataframe method replace to replace loan_status column with 1 or 0
#and remove all other rows where loan status is not fully payed or charged off

df2 = df.loc[(df['loan_status'] == 'Fully Paid') | (df['loan_status'] == 'Charged Off')]

mapping_dict = {
    "loan_status": {
        "Fully Paid": 1,
        "Charged Off": 0
    }
}
df2 = df2.replace(mapping_dict).reset_index(drop=True)


In [7]:
#remove columns with only 1 unique value
drop_columns = []
df2_columns = df2.columns

for col in df2:
    col_series =  df2[col].dropna().unique()
    if len(col_series) == 1:
        drop_columns.append(col)
        
        
df2 = df2.drop(columns = drop_columns)

#It looks we we were able to remove 9 more columns since they only contained 1 unique value.

In [8]:
#drop null values
null_counts = df2.isnull().sum()
null_counts

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1036
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64

PREPARING THE FEATURES


Domain knowledge tells us that employment length is frequently used in assessing how risky a potential borrower is, so we'll keep this column despite its relatively large amount of missing values.

Let's inspect the values of the column pub_rec_bankruptcies

In [9]:
print(df2.pub_rec_bankruptcies.value_counts(normalize=True, dropna=False))

0.0    0.939438
1.0    0.042456
NaN    0.017978
2.0    0.000129
Name: pub_rec_bankruptcies, dtype: float64


We see that this column offers very little variability, nearly 94% of values are in the same category. It probably won't have much predictive value. Let's drop the entire column. 

And also drop only the rows with NaN values for emp_length , title, revol_util, and last_credit_pull_d

In [10]:
df2 = df2.drop(columns = ['pub_rec_bankruptcies'])
df3 = df2.dropna(axis =0).reset_index(drop=True)

df3.dtypes.value_counts()
#22 columns left!

object     11
float64    10
int64       1
dtype: int64

While the numerical columns can be used natively with scikit-learn, the object columns that contain text need to be converted to numerical data types.

In [11]:
object_df = df3.select_dtypes(include=['object'])
object_df.head(2)

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Jun-2016
1,60 months,15.27%,< 1 year,RENT,Source Verified,car,bike,GA,Apr-1999,9.4%,Sep-2013


In [12]:
#Let's explore the unique value counts of the columnns that seem like they contain categorical values
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state', 'title', 'purpose']

for col in cols:
    print(col)
    print(df3[col].value_counts())
    print('\n')

home_ownership
RENT        18112
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64


verification_status
Not Verified       16281
Verified           11856
Source Verified     9538
Name: verification_status, dtype: int64


emp_length
10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64


term
 36 months    28234
 60 months     9441
Name: term, dtype: int64


addr_state
CA    6776
NY    3614
FL    2704
TX    2613
NJ    1776
IL    1447
PA    1442
VA    1347
GA    1323
MA    1272
OH    1149
MD    1008
AZ     807
WA     788
CO     748
NC     729
CT     711
MI     678
MO     648
MN     581
NV     466
SC     454
WI     427
OR     422
LA     420
AL     420
KY     311
OK     285
UT     249
KS     249
AR     229
DC     209
RI     194
NM     180
WV     164
HI     162


The home_ownership, verification_status, emp_length, and term columns each contain a few discrete categorical values. We should encode these columns as dummy variables and keep them.

It seems like the purpose and title columns do contain overlapping information but we'll keep the purpose column since it contains a few discrete values. In addition, the title column has data quality issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation and Debt Consolidation Loan and debt consolidation).

Lastly, some of the columns contain date values that would require a good amount of feature engineering for them to be potentially useful:
- earliest_cr_line: The month the borrower's earliest reported credit line was opened,
- last_credit_pull_d: The most recent month Lending Club pulled credit for this loan.

Since these date features require some feature engineering for modeling purposes, let's remove these date columns from the Dataframe.

Lastly, the addr_state column contains many discrete values and we'd need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. Let's remove this column from consideration.

In [13]:
df3 = df3.drop(columns = ['last_credit_pull_d', 'addr_state', 'title', 'earliest_cr_line'])
df3['int_rate'] = df3['int_rate'].str.strip('%').astype('float')
df3['revol_util'] = df3['revol_util'].str.replace('%','').astype('float')

In [14]:
df3[['int_rate','revol_util']]

Unnamed: 0,int_rate,revol_util
0,10.65,83.7
1,15.27,9.4
2,15.96,98.5
3,13.49,21.0
4,7.90,28.3
...,...,...
37670,8.07,13.1
37671,10.28,26.9
37672,8.07,19.4
37673,7.43,0.7


In [15]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

df3 = df3.replace(mapping_dict)

In [16]:
print(df3.dtypes)
#float conversion successful

loan_amnt              float64
term                    object
int_rate               float64
installment            float64
emp_length              object
home_ownership          object
annual_inc             float64
verification_status     object
loan_status              int64
purpose                 object
dti                    float64
delinq_2yrs            float64
inq_last_6mths         float64
open_acc               float64
pub_rec                float64
revol_bal              float64
revol_util             float64
total_acc              float64
dtype: object


Let's now encode the home_ownership, verification_status, purpose, and term columns as dummy variables so we can use them in our model. We first need to use the Pandas get_dummies method to return a new Dataframe containing a new column for each dummy variable:

In [17]:
# Returns a new Dataframe containing 1 column for each dummy variable.
dummy_df = pd.get_dummies(df3[['home_ownership', 'verification_status', 'purpose', "term"]])
#add back to dataframe
df4 = pd.concat([df3, dummy_df], axis=1)
#drop the original non-dummy columns
df4 = df4.drop(['home_ownership', 'verification_status', 'purpose', "term"], axis=1)

In [18]:
for col in ['home_ownership_MORTGAGE',
'home_ownership_NONE',
'home_ownership_OTHER',
'home_ownership_OWN',
'home_ownership_RENT',
'verification_status_Not Verified',
'verification_status_Source Verified',
'verification_status_Verified',
'purpose_car',
'purpose_credit_card',
'purpose_debt_consolidation',
'purpose_educational',
'purpose_home_improvement',
'purpose_house',
'purpose_major_purchase',
'purpose_medical',
'purpose_moving',
'purpose_other',
'purpose_renewable_energy',
'purpose_small_business',
'purpose_vacation',
'purpose_wedding',
'term_ 36 months',
'term_ 60 months' ]:
    df4[col]=df4[col].apply(lambda x: int(x))

In [19]:
df4.head(2)
#the data is now ready for modelling!!!

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,...,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,...,0,0,0,0,0,0,0,0,1,0
1,2500.0,15.27,59.83,0,30000.0,0,1.0,0.0,5.0,3.0,...,0,0,0,0,0,0,0,0,0,1


MACHINE LEARNING

We established that this is a binary classification problem in the first mission of this course, and we converted the loan_status column to 0s and 1s as a result. Before diving in and selecting an algorithm to apply to the data, we should select an error metric.

In this case, we're primarily concerned with false positives and false negatives. Both of these are different types of misclassifications. With a false positive, we predict that a loan will be paid off on time, but it actually isn't. This costs us money, since we fund loans that lose us money. With a false negative, we predict that a loan won't be paid off on time, but it actually would be paid off on time. This loses us potential money, since we didn't fund a loan that actually would have been paid off.

Since we're viewing this problem from the standpoint of a conservative investor, we need to treat false positives differently than false negatives. A conservative investor would want to minimize risk, and avoid false positives as much as possible. They'd be more okay with missing out on opportunities (false negatives) than they would be with funding a risky loan (false positives).

Lastly, there's a class imbalance in our target column, loan_status. There are about 6 times as many loans that were paid off on time (positive case, label of 1) than those that weren't (negative case, label of 0). Imbalances can cause issues with many machine learning algorithms, where they appear to have high accuracy, but actually aren't learning from the training data. Because of its potential to cause issues, we need to keep the class imbalance in mind as we build machine learning models.

A good first algorithm to apply to binary classification problems is logistic regression, for the following reasons:

- it's quick to train and we can iterate more quickly,
- it's less prone to overfitting than more complex models like decision trees,
- it's easy to interpret.

In [20]:
df4['emp_length'] = df4['emp_length'].apply(lambda x: int(x))
df4.info()
df4=df4.reset_index(drop=True)
#no null obj, so we are ready to go!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37675 entries, 0 to 37674
Data columns (total 38 columns):
loan_amnt                              37675 non-null float64
int_rate                               37675 non-null float64
installment                            37675 non-null float64
emp_length                             37675 non-null int64
annual_inc                             37675 non-null float64
loan_status                            37675 non-null int64
dti                                    37675 non-null float64
delinq_2yrs                            37675 non-null float64
inq_last_6mths                         37675 non-null float64
open_acc                               37675 non-null float64
pub_rec                                37675 non-null float64
revol_bal                              37675 non-null float64
revol_util                             37675 non-null float64
total_acc                              37675 non-null float64
home_ownership_MORTGAGE    

In [21]:
df4.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,...,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,...,0,0,0,0,0,0,0,0,1,0
1,2500.0,15.27,59.83,0,30000.0,0,1.0,0.0,5.0,3.0,...,0,0,0,0,0,0,0,0,0,1
2,2400.0,15.96,84.33,10,12252.0,1,8.72,0.0,2.0,2.0,...,0,0,0,0,0,1,0,0,1,0
3,10000.0,13.49,339.31,10,49200.0,1,20.0,0.0,1.0,10.0,...,0,0,0,1,0,0,0,0,1,0
4,5000.0,7.9,156.46,3,36000.0,1,11.2,0.0,3.0,9.0,...,0,0,0,0,0,0,0,1,1,0


In [48]:
#preparing feature and target columns
features = df4.drop(columns =['loan_status'])
target = df4['loan_status']

# Instantiate model object.
lr = LogisticRegression(max_iter=2000)
# Make predictions using 3-fold cross-validation.
predictions = cross_val_predict(lr, features, target, cv=2)
predictions = pd.Series(predictions)

#compute true positive rates & false positive rates
tp_filter = (predictions == 1) & (df4["loan_status"] == 1)
tp = len(predictions[tp_filter])

fp_filter = (predictions == 1) & (df4["loan_status"] == 0)
fp = len(predictions[fp_filter])

tn_filter = (predictions ==0) & (df4['loan_status']==0)
tn = len(predictions[tn_filter])

fn_filter = (predictions==0) & (df4['loan_status']==1)
fn = len(predictions[fn_filter])

# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)

print(tpr)
print(fpr)

0.9975840921761754
0.995360920393394


In [30]:
predictions.head(10)

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int64

As expected, even though we're not using accuracy as an error metric, the classifier is, and it isn't accounting for the imbalance in the classes. There are a few ways to get a classifier to correct for imbalanced classes. The two main ways are:

- Use oversampling and undersampling to ensure that the classifier gets input that has a balanced number of each class.
- Tell the classifier to penalize misclassifications of the less prevalent class more than the other class.

Oversampling and undersampling involves taking a sample that contains equal numbers of rows where loan_status is 0, and where loan_status is 1. This way, the classifier is forced to make actual predictions, since predicting all 1s or all 0s will only result in 50% accuracy at most.

The downside of this technique is that since it has to preserve an equal ratio, you have to either:

- Throw out many rows of data. If we wanted equal numbers of rows where loan_status is 0 and where loan_status is 1, one way we could do that is to delete rows where loan_status is 1.
- Copy rows multiple times. One way to equalize the 0s and 1s is to copy rows where loan_status is 0.
- Generate fake data. One way to equalize the 0s and 1s is to generate new rows where loan_status is 0.

Unfortunately, none of these techniques are especially easy. The second method we mentioned earlier, telling the classifier to penalize certain rows more, is actually much easier to implement using scikit-learn.

We can do this by setting the class_weight parameter to balanced when creating the LogisticRegression instance. This tells scikit-learn to penalize the misclassification of the minority class during the training process. The penalty means that the logistic regression classifier pays more attention to correctly classifying rows where loan_status is 0. This lowers accuracy when loan_status is 1, but raises accuracy when loan_status is 0. This would mean that for the classifier, correctly classifying a row where loan_status is 0 is 6 times more important than correctly classifying a row where loan_status is 1.

In [52]:
#logistic regression model again but with class weight balance

# Instantiate model object.
lr = LogisticRegression(class_weight = "balanced", max_iter=1000)
# Make predictions using 3-fold cross-validation.
predictions = cross_val_predict(lr, features, target, cv=2)
predictions = pd.Series(predictions)

#compute true positive rates & false positive rates
fp_filter = (predictions == 1) & (df4["loan_status"] == 0)
fp = len(predictions[fp_filter])

tp_filter = (predictions == 1) & (df4["loan_status"] == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions==0) & (df4['loan_status']==1)
fn = len(predictions[fn_filter])

tn_filter = (predictions ==0) & (df4['loan_status']==0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)

print(tpr)
print(fpr)

0.582357678250635
0.37149749489701245


We can try to lower the false positive rate further by assigning a harsher penalty for misclassifying the negative class. While setting class_weight to balanced will automatically set a penalty based on the number of 1s and 0s in the column, we can also set a manual penalty. In the last screen, the penalty scikit-learn imposed for misclassifying a 0 would have been around 5.89 (since there are 5.89 times as many 1s as 0s). We can manually define the class weight.

In [58]:
penalty = {
    0: 10,
    1: 1
}

# Instantiate model object.
lr = LogisticRegression(class_weight = penalty, max_iter=2000)
# Make predictions using 3-fold cross-validation.
predictions = cross_val_predict(lr, features, target, cv=2)
predictions = pd.Series(predictions)

#compute true positive rates & false positive rates
fp_filter = (predictions == 1) & (df4["loan_status"] == 0)
fp = len(predictions[fp_filter])

tp_filter = (predictions == 1) & (df4["loan_status"] == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions==0) & (df4['loan_status']==1)
fn = len(predictions[fn_filter])

tn_filter = (predictions ==0) & (df4['loan_status']==0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)

print(tpr)
print(fpr)

0.16824629870532118
0.08146223789200223


It looks like assigning manual penalties lowered the false positive rate to 9%, and thus lowered our risk. Note that this comes at the expense of true positive rate. While we have fewer false positives, we're also missing opportunities to fund more loans and potentially make more money. Given that we're approaching this as a conservative investor, this strategy makes sense, but it's worth keeping in mind the tradeoffs.

In [54]:
#random forest model

# Instantiate model object.
rf = RandomForestClassifier(class_weight = "balanced", random_state =1) #Set random_state to 1, so the predictions don't vary due to random chance.
# Make predictions using 3-fold cross-validation.
predictions = cross_val_predict(rf, features, target, cv=3)
predictions = pd.Series(predictions)

#compute true positive rates & false positive rates
fp_filter = (predictions == 1) & (df4["loan_status"] == 0)
fp = len(predictions[fp_filter])

tp_filter = (predictions == 1) & (df4["loan_status"] == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions==0) & (df4['loan_status']==1)
fn = len(predictions[fn_filter])

tn_filter = (predictions ==0) & (df4['loan_status']==0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)

print(tpr)
print(fpr)

0.9977079848850895
0.9918352198923733


Unfortunately, using a random forest classifier didn't improve our false positive rate. The model is likely weighting too heavily on the 1 class, and still mostly predicting 1s. We could fix this by applying a harsher penalty for misclassifications of 0s.

Ultimately, our best model had a false positive rate of nearly 8%, and a true positive rate of nearly 17%. For a conservative investor, this means that they make money as long as the interest rate is high enough to offset the losses from 8% of borrowers defaulting, and that the pool of 17% of borrowers is large enough to make enough interest money to offset the losses.