# Predicting Loan Default Risk

The approved loans datasets contain information on current loans, completed loans, and defaulted loans. Let's now define the problem statement for this machine learning project:

Can we build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not?
Before we can start doing machine learning, we need to define what features we want to use and which column represents the target column we want to predict. Let's start by reading in the dataset and exploring it.

## Data Exploration

In [1]:
# reading in the dataset
import pandas as pd
loans_2007 = pd.read_csv('loans_2007.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
loans_2007

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42533,72176,70868.0,2525.0,2525.0,225.0,36 months,9.33%,80.69,B,B3,...,82.03,May-2007,,1.0,INDIVIDUAL,,,,,
42534,71623,70735.0,6500.0,6500.0,0.0,36 months,8.38%,204.84,A,A5,...,205.32,Aug-2007,,1.0,INDIVIDUAL,,,,,
42535,70686,70681.0,5000.0,5000.0,0.0,36 months,7.75%,156.11,A,A3,...,156.39,Feb-2015,,1.0,INDIVIDUAL,,,,,
42536,Total amount funded in policy code 1: 471701350,,,,,,,,,,...,,,,,,,,,,


In [3]:
#deleting the last two lines. It just has some summary data which we don't need for our analysis
loans_2007 = loans_2007.drop(index=[42536,42537],axis=0)
loans_2007

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42531,73582,73096.0,3500.0,3500.0,225.0,36 months,10.28%,113.39,C,C1,...,0.00,Feb-2013,,1.0,INDIVIDUAL,,,,,
42532,72998,72992.0,1000.0,1000.0,0.0,36 months,9.64%,32.11,B,B4,...,32.41,Sep-2014,,1.0,INDIVIDUAL,,,,,
42533,72176,70868.0,2525.0,2525.0,225.0,36 months,9.33%,80.69,B,B3,...,82.03,May-2007,,1.0,INDIVIDUAL,,,,,
42534,71623,70735.0,6500.0,6500.0,0.0,36 months,8.38%,204.84,A,A5,...,205.32,Aug-2007,,1.0,INDIVIDUAL,,,,,


## Data Cleaning

The next step in our analysis is to get a better understanding of each column so that we can decide what features will be useful for our analysis and which features can be discarded. Since there are 52 columns, we will break them into groups of 3 and review ~18 columns at a time with the help of the [data dictionary](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit)

After manually reviewing the first 18 columns, we can conclude that the following features need to be removed:

- `id`: randomly generated field by Lending Club for unique identification purposes only
- `member_id`: also a randomly generated field by Lending Club for unique identification purposes only
- `funded_amnt` : leaks data from the future (after the loan is already started to be funded)
- `funded_amnt_inv` : also leaks data from the future (after the loan is already started to be funded)
- `grade` : contains redundant information as the interest rate column (`int_rate`)
- `sub_grade` : also contains redundant information as the interest rate column (`int_rate`)
- `emp_title` : requires other data and a lot of processing to potentially be useful
- `issue_d`: leaks data from the future (after the loan is already completely funded)

Let's now drop these columns from the Dataframe before moving onto the next group of columns.

In [4]:
#dropping columns identified above
loans_2007.drop(['id','member_id','funded_amnt','funded_amnt_inv','grade','sub_grade','emp_title','issue_d'],axis=1,inplace=True)

Let's review the next set of 18 columns.

Within this group of columns, we need to drop the following columns:

- `zip_code`: redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in)
- `out_prncp`: leaks data from the future, (after the loan already started to be paid off)
- `out_prncp_inv`: also leaks data from the future, (after the loan already started to be paid off)
- `total_pymnt`: also leaks data from the future, (after the loan already started to be paid off)
- `total_pymnt_inv`: also leaks data from the future, (after the loan already started to be paid off)
- `total_rec_prncp`: also leaks data from the future, (after the loan already started to be paid off)

Let's now drop these columns from the Dataframe before moving onto the next group of columns

In [5]:
# dropping columns
loans_2007.drop(['zip_code','out_prncp','out_prncp_inv','total_pymnt','total_pymnt_inv','total_rec_prncp'],axis=1,inplace=True)

In the last group of columns, we need to drop the following columns:

- `total_rec_int`: leaks data from the future, (after the loan already started to be paid off),
- `total_rec_late_fee`: also leaks data from the future, (after the loan already started to be paid off),
- `recoveries`: also leaks data from the future, (after the loan already started to be paid off),
- `collection_recovery_fee`: also leaks data from the future, (after the loan already started to be paid off),
- `last_pymnt_d`: also leaks data from the future, (after the loan already started to be paid off),
- `last_pymnt_amnt`: also leaks data from the future, (after the loan already started to be paid off).

All these columns leak data from the future since they're describing aspects of the loan after it's already been fully funded and started to be paid off by the borrower. We won't have these data points available in the dataset that we are trying to make predictions on. Thus they need to be removed.

In [8]:
# dropping the columns
loans_2007.drop(['total_rec_int','total_rec_late_fee','recoveries','collection_recovery_fee','last_pymnt_d','last_pymnt_amnt'],axis=1,inplace=True)

print("now of cols:", loans_2007.shape[1])
loans_2007

now of cols: 32


Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,initial_list_status,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,Fully Paid,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,Charged Off,n,...,f,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,Fully Paid,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,Fully Paid,n,...,f,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,3000.0,60 months,12.69%,67.79,1 year,RENT,80000.0,Source Verified,Current,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42531,3500.0,36 months,10.28%,113.39,< 1 year,RENT,180000.0,Not Verified,Does not meet the credit policy. Status:Fully ...,n,...,f,Feb-2013,,1.0,INDIVIDUAL,,,,,
42532,1000.0,36 months,9.64%,32.11,< 1 year,RENT,12000.0,Not Verified,Does not meet the credit policy. Status:Fully ...,n,...,f,Sep-2014,,1.0,INDIVIDUAL,,,,,
42533,2525.0,36 months,9.33%,80.69,< 1 year,RENT,110000.0,Not Verified,Does not meet the credit policy. Status:Fully ...,n,...,f,May-2007,,1.0,INDIVIDUAL,,,,,
42534,6500.0,36 months,8.38%,204.84,< 1 year,NONE,,Not Verified,Does not meet the credit policy. Status:Fully ...,n,...,f,Aug-2007,,1.0,INDIVIDUAL,,,,,


Let's set out target columns. Since we are tyring to predict wether the buyer will default or not, we need to find a column that gives the status of past loans. the `loan_status` column contains this information. Let's see what values are there in the field. 

In [9]:
print(loans_2007['loan_status'].value_counts())

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64


From the investor's perspective, we're interested in trying to predict which loans will be paid off on time and which ones won't be. By looking at the data dictionary we can see that only the `Fully Paid` and `Charged Off` values describe the final outcome of the loan. The other values describe loans that are still ongoing and where the jury is still out on if the borrower will pay back the loan on time or not. While the `Default` status resembles the `Charged Off` status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance.

In [10]:
# removing all rows that are not 'Fully Paid' or 'Charged Off'
mapping_dict = {'Fully Paid' : 1,'Charged Off': 0}

drop_index=loans_2007[(loans_2007['loan_status']!='Fully Paid') & (loans_2007['loan_status']!='Charged Off') ].index
loans_2007=loans_2007.drop(index=drop_index)
loans_2007['loan_status'].replace(mapping_dict,inplace=True)
loans_2007


Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,initial_list_status,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,1,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,0,n,...,f,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,1,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,1,n,...,f,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
5,5000.0,36 months,7.90%,156.46,3 years,RENT,36000.0,Source Verified,1,n,...,f,Jan-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39781,2500.0,36 months,8.07%,78.42,4 years,MORTGAGE,110000.0,Not Verified,1,n,...,f,Jun-2010,,1.0,INDIVIDUAL,0.0,,0.0,,
39782,8500.0,36 months,10.28%,275.38,3 years,RENT,18000.0,Not Verified,1,n,...,f,Jul-2010,,1.0,INDIVIDUAL,0.0,,0.0,,
39783,5000.0,36 months,8.07%,156.84,< 1 year,MORTGAGE,100000.0,Not Verified,1,n,...,f,Jun-2007,,1.0,INDIVIDUAL,0.0,,0.0,,
39784,5000.0,36 months,7.43%,155.38,< 1 year,MORTGAGE,200000.0,Not Verified,1,n,...,f,Jun-2007,,1.0,INDIVIDUAL,0.0,,0.0,,


Let's not try to remove columns that only contain one value. Such columns don't have any variance and don't help with predictions. 

In [11]:
drop_columns = []

for i in loans_2007.columns:
    temp = loans_2007[i].dropna()
    if len(temp.unique()) == 1:
        drop_columns.append(i)

        
loans_2007.drop(drop_columns,axis=1,inplace=True)

# print columns we have dropped
print(drop_columns)

['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


## Feature Engineering

As the first step of our feature engineering process, let's ensure there are no missing values in the dataset

In [12]:
loans_2007.isnull().sum()

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1036
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64

While most of the columns have no missing values, two columns have fifty or less rows with missing values, and two columns, `emp_length` and `pub_rec_bankruptcies`, contain a relatively high amount of missing values.

Domain knowledge tells us that employment length is frequently used in assessing how risky a potential borrower is, so we'll keep this column despite its relatively large amount of missing values.

Let's inspect the values of the column `pub_rec_bankruptcies`.

In [14]:
print(loans_2007.pub_rec_bankruptcies.value_counts(normalize=True, dropna=False))

0.0    0.939438
1.0    0.042456
NaN    0.017978
2.0    0.000129
Name: pub_rec_bankruptcies, dtype: float64


We see that this column offers very little variability, nearly 94% of values are in the same category. It probably won't have much predictive value. Let's drop it.

In [16]:
# dropping 'pub_rec_bankruptcies' column
loans_2007.drop('pub_rec_bankruptcies',axis=1,inplace=True)

Next, let's drop all rows containing null values.

In [18]:
loans_2007=loans_2007.dropna()

Next, Let's see the the number of fields in each datatype

In [19]:
print(loans_2007.dtypes.value_counts())

object     11
float64    10
int64       1
dtype: int64


We can see some object datatypes. Let's look at them more closely and see what we can do about them.

In [21]:
object_columns_df = loans_2007.select_dtypes(include=['object'])

print(object_columns_df.iloc[0])

term                     36 months
int_rate                    10.65%
emp_length               10+ years
home_ownership                RENT
verification_status       Verified
purpose                credit_card
title                     Computer
addr_state                      AZ
earliest_cr_line          Jan-1985
revol_util                   83.7%
last_credit_pull_d        Jun-2016
Name: 0, dtype: object


Based on the first row's values for purpose and title, it seems like these columns could reflect the same information. Let's explore the unique value counts separately to confirm if this is true.

Lastly, some of the columns contain date values that would require a good amount of feature engineering for them to be potentially useful:

- `earliest_cr_line`: The month the borrower's earliest reported credit line was opened,
- `last_credit_pull_d`: The most recent month Lending Club pulled credit for this loan.

Since these date features require some feature engineering for modeling purposes, let's remove these date columns from the Dataframe.

In [25]:
print(loans_2007['title'].value_counts())

print(loans_2007['purpose'].value_counts())

Debt Consolidation                         2068
Debt Consolidation Loan                    1599
Personal Loan                               624
Consolidation                               488
debt consolidation                          466
                                           ... 
My Student Loan                               1
Medical, Consolidation, & Kauai, Oh My!       1
Getting it done                               1
POFF                                          1
Karilyn                                       1
Name: title, Length: 18881, dtype: int64
debt_consolidation    17751
credit_card            4911
other                  3711
home_improvement       2808
major_purchase         2083
small_business         1719
car                    1459
wedding                 916
medical                 655
moving                  552
house                   356
vacation                348
educational             312
renewable_energy         94
Name: purpose, dtype: int64


It seems like the `purpose` and `title` columns do contain overlapping information but we'll keep the `purpose` column since it contains a few discrete values. In addition, the `title` column has data quality issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation and Debt Consolidation Loan and debt consolidation).

There are also some columns that represent numeric values, that need to be converted:

- `int_rate`: interest rate of the loan in %,
- `revol_util`: revolving line utilization rate or the amount of credit the borrower is using relative to all available credit,

`home_ownership`, `verification_status`, `emp_length`, `term`, `addr_state` appear to contain categorical values. Let's explore them further

In [23]:
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']


for i in cols:
    print(loans_2007[i].value_counts(),"\n")

RENT        18112
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64 

Not Verified       16281
Verified           11856
Source Verified     9538
Name: verification_status, dtype: int64 

10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64 

 36 months    28234
 60 months     9441
Name: term, dtype: int64 

CA    6776
NY    3614
FL    2704
TX    2613
NJ    1776
IL    1447
PA    1442
VA    1347
GA    1323
MA    1272
OH    1149
MD    1008
AZ     807
WA     788
CO     748
NC     729
CT     711
MI     678
MO     648
MN     581
NV     466
SC     454
WI     427
OR     422
AL     420
LA     420
KY     311
OK     285
UT     249
KS     249
AR     229
DC     209
RI     194
NM     180
WV     164
HI     162
NH     157
DE     110
MT      77
WY      76
AK      76
SD     

The `home_ownership`, `verification_status`, `emp_length`, `term`, and `addr_state` columns all contain multiple discrete values. We should clean the `emp_length` column and treat it as a numerical one since the values have ordering. `addr_state` has multiple values and converting them to dummy variables will add a lot of columns to our dataset. Let's drop this for now. 

In [27]:
mapping_dict = {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }

loans_2007.drop(['last_credit_pull_d','addr_state','title','earliest_cr_line'],axis=1,inplace=True)


loans_2007['int_rate']=loans_2007['int_rate'].str.rstrip('%').astype('float')
loans_2007["revol_util"] = loans_2007["revol_util"].str.rstrip("%").astype("float")

loans_2007['emp_length']=loans_2007['emp_length'].map(mapping_dict)

  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [28]:
loans_2007

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc
0,5000.0,36 months,10.65,162.87,10,RENT,24000.0,Verified,1,credit_card,27.65,0.0,1.0,3.0,0.0,13648.0,83.7,9.0
1,2500.0,60 months,15.27,59.83,0,RENT,30000.0,Source Verified,0,car,1.00,0.0,5.0,3.0,0.0,1687.0,9.4,4.0
2,2400.0,36 months,15.96,84.33,10,RENT,12252.0,Not Verified,1,small_business,8.72,0.0,2.0,2.0,0.0,2956.0,98.5,10.0
3,10000.0,36 months,13.49,339.31,10,RENT,49200.0,Source Verified,1,other,20.00,0.0,1.0,10.0,0.0,5598.0,21.0,37.0
5,5000.0,36 months,7.90,156.46,3,RENT,36000.0,Source Verified,1,wedding,11.20,0.0,3.0,9.0,0.0,7963.0,28.3,12.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39781,2500.0,36 months,8.07,78.42,4,MORTGAGE,110000.0,Not Verified,1,home_improvement,11.33,0.0,0.0,13.0,0.0,7274.0,13.1,40.0
39782,8500.0,36 months,10.28,275.38,3,RENT,18000.0,Not Verified,1,credit_card,6.40,1.0,1.0,6.0,0.0,8847.0,26.9,9.0
39783,5000.0,36 months,8.07,156.84,0,MORTGAGE,100000.0,Not Verified,1,debt_consolidation,2.30,0.0,0.0,11.0,0.0,9698.0,19.4,20.0
39784,5000.0,36 months,7.43,155.38,0,MORTGAGE,200000.0,Not Verified,1,other,3.72,0.0,0.0,17.0,0.0,85607.0,0.7,26.0


Let's now encode the home_ownership, verification_status, purpose, and term columns as dummy variables so we can use them in our model.

In [29]:
cols = ['home_ownership','verification_status','purpose','term']

dummy_df = pd.get_dummies(loans_2007[cols])

loans_2007 = pd.concat([loans_2007,dummy_df],axis=1)

loans_2007.drop(cols,axis=1,inplace=True)

In [30]:
loans_2007

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,...,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,...,0,0,0,0,0,0,0,0,1,0
1,2500.0,15.27,59.83,0,30000.0,0,1.00,0.0,5.0,3.0,...,0,0,0,0,0,0,0,0,0,1
2,2400.0,15.96,84.33,10,12252.0,1,8.72,0.0,2.0,2.0,...,0,0,0,0,0,1,0,0,1,0
3,10000.0,13.49,339.31,10,49200.0,1,20.00,0.0,1.0,10.0,...,0,0,0,1,0,0,0,0,1,0
5,5000.0,7.90,156.46,3,36000.0,1,11.20,0.0,3.0,9.0,...,0,0,0,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39781,2500.0,8.07,78.42,4,110000.0,1,11.33,0.0,0.0,13.0,...,0,0,0,0,0,0,0,0,1,0
39782,8500.0,10.28,275.38,3,18000.0,1,6.40,1.0,1.0,6.0,...,0,0,0,0,0,0,0,0,1,0
39783,5000.0,8.07,156.84,0,100000.0,1,2.30,0.0,0.0,11.0,...,0,0,0,0,0,0,0,0,1,0
39784,5000.0,7.43,155.38,0,200000.0,1,3.72,0.0,0.0,17.0,...,0,0,0,1,0,0,0,0,1,0


## Treating Class Imbalance & Picking an Error Metric

For the purpose of this project, let's assume we are doing prediction for a conservative investor. A conservative investor will be very particular about not investing in buyers that have a high chance of defaulting. 

If we see the loan_status column, we can see that there is a class imbalance. There are 6 times as many loans that were paid off on time (1), than loans that weren't paid off on time (0). This causes a major issue when we use accuracy as a metric. This is because due to the class imbalance, a classifier can predict 1 for every row, and still have high accuracy. Instead we will use `False Positve Rate` and `False Negative Rate` as our error metric. We should train our model in a way that optimizes for **high** `True Postive Rate`, and **low** `False Positve Rate`


Conservative investors would want to minimize risk, and avoid false positives as much as possible. They'd be more okay with missing out on opportunities (false negatives) than they would be with funding a risky loan (false positives)


As for the class imbalance, let's take care of it by assigning the **class_weight** parameter as **balanced**. This will penalize the model for  misclassifications of the less prevalent class more than the other class.


Let't start by using a Logistic Regression model.

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
lr = LogisticRegression(class_weight='balanced')

features=(loans_2007.drop('loan_status',axis=1)).columns
target = 'loan_status'
predictions=cross_val_predict(lr,loans_2007[features],loans_2007[target],cv=3)

predictions = pd.Series(predictions)


# calculating false positives

fp_filter = (predictions==1) & (loans_2007['loan_status']==0)
fp = len(predictions[fp_filter])

# calculating false negatives

fn_filter = (predictions==0) & (loans_2007['loan_status']==1)
fn = len(predictions[fn_filter])

# calculating true positives

tp_filter = (predictions==1) & (loans_2007['loan_status']==1)
tp = len(predictions[tp_filter])

# calculating true negative

tn_filter = (predictions==0) & (loans_2007['loan_status']==0)
tn = len(predictions[tn_filter])



fpr = fp/(fp+tn)
tpr = tp/(tp+fn)

print(fpr)
print(tpr)





0.6159921026653504
0.6288345568956476


Our false positve rate in about 61% , which is still high. Let's try to reduce this further by applying harsher penalties. We will impose a penalty of 10 for misclassifying a 0, and a penalty of 1 for misclassifying a 1.

In [50]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

penalty = {0:10,1:1}

lr = LogisticRegression(class_weight=penalty)

predictions=cross_val_predict(lr,loans_2007[features],loans_2007[target],cv=3)

predictions = pd.Series(predictions)


# calculating false positives

fp_filter = (predictions==1) & (loans_2007['loan_status']==0)
fp = len(predictions[fp_filter])

# calculating false negatives

fn_filter = (predictions==0) & (loans_2007['loan_status']==1)
fn = len(predictions[fn_filter])

# calculating true positives

tp_filter = (predictions==1) & (loans_2007['loan_status']==1)
tp = len(predictions[tp_filter])

# calculating true negative

tn_filter = (predictions==0) & (loans_2007['loan_status']==0)
tn = len(predictions[tn_filter])



fpr = fp/(fp+tn)
tpr = tp/(tp+fn)

print(fpr)
print(tpr)

predictions_lr = predictions




0.22724580454096743
0.2305977975878343


It looks like assigning manual penalties lowered the false positive rate to 22.7%, and thus lowered our risk. But comes at the expense of true positive rate. While we have fewer false positives, we're also missing opportunities to fund more loans and potentially make more money. Since we are approaching this as a conservative investor, this strategy makes sense. 

Next, let's try to use a Random Forest model to see if we get a better performance. 

In [48]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

penalty = {0:10,1:1}
rf = RandomForestClassifier(random_state=1,class_weight=penalty)

predictions=cross_val_predict(rf,loans_2007[features],loans_2007[target],cv=3)

predictions = pd.Series(predictions)


# calculating false positives

fp_filter = (predictions==1) & (loans_2007['loan_status']==0)
fp = len(predictions[fp_filter])

# calculating false negatives

fn_filter = (predictions==0) & (loans_2007['loan_status']==1)
fn = len(predictions[fn_filter])

# calculating true positives

tp_filter = (predictions==1) & (loans_2007['loan_status']==1)
tp = len(predictions[tp_filter])

# calculating true negative

tn_filter = (predictions==0) & (loans_2007['loan_status']==0)
tn = len(predictions[tn_filter])



fpr = fp/(fp+tn)
tpr = tp/(tp+fn)

print(fpr)
print(tpr)





0.9660414610069101
0.9646368641845832


That's not good at all! Let's do some hyperparameter optimization and increase the penalty and see if the model performs better

In [51]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

penalty = {0:15,1:1}
rf = RandomForestClassifier(max_depth=5,random_state=1,class_weight=penalty)

predictions=cross_val_predict(rf,loans_2007[features],loans_2007[target],cv=3)

predictions = pd.Series(predictions)


# calculating false positives

fp_filter = (predictions==1) & (loans_2007['loan_status']==0)
fp = len(predictions[fp_filter])

# calculating false negatives

fn_filter = (predictions==0) & (loans_2007['loan_status']==1)
fn = len(predictions[fn_filter])

# calculating true positives

tp_filter = (predictions==1) & (loans_2007['loan_status']==1)
tp = len(predictions[tp_filter])

# calculating true negative

tn_filter = (predictions==0) & (loans_2007['loan_status']==0)
tn = len(predictions[tn_filter])



fpr = fp/(fp+tn)
tpr = tp/(tp+fn)

print(fpr)
print(tpr)

predictions_rf = predictions




0.15202369200394866
0.15443104352385947




Nice! We managed to reduce the false positive rate to 15% by using Random Forest. As mentioned above, our true positve rate has also come down. It may work for a conservative investor since the predictions lower the risk of risky investments but it is not ideal as we are missing opportunities to make more money. 

## Conclusion

We managed to bring down the False Positve rate to 15% by using Random Forest making it the best performing model for this project. For a conservative investor, this means that they make money as long as the interest rate is high enough to offset the losses from 15.2% of borrowers defaulting, and that the pool of 15.4% of borrowers is large enough to make enough interest money to offset the losses.

## Next Steps

I will be working towards improving this model further by including the features I had dropped earlier or selecting better features and see if it improves the performance of the model. I will also try ensembling the prediction or using SVM or Neural Network to see if I get better results. 