### Instructions
The lecture uses random forest to predict the state of the loan with data taken from Lending Club (2015). With minimal feature engineering, they were able to get an accuracy of 98% with cross validation.  However, the accuracies had a lot of variance, ranging from 98% to 86%, indicating there are lots of useless features.  

I am tasked with 1) removing as many features as possible without dropping the average below 90% accuracy in a 10 fold cross validation and 2) if the first task is possible without using anything related to payment amount or outstanding principal. 

### 1 - Import Data
In this dataset, there are 420k+ rows and 110 features and the target variable (loan status).  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings('ignore')

  _nan_object_mask = _nan_object_array != _nan_object_array


In [2]:
df = pd.read_csv('LoanStats3d.csv', skipinitialspace=True, header=1)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421097 entries, 0 to 421096
Columns: 111 entries, id to total_il_high_credit_limit
dtypes: float64(85), object(26)
memory usage: 356.6+ MB


The the last two rows of the dataset holds no data, so these rows will be deleted.

In [4]:
df.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
421092,36271333,38982739.0,13000.0,13000.0,13000.0,60 months,15.99%,316.07,D,D2,...,0.0,3.0,100.0,50.0,1.0,0.0,51239.0,34178.0,10600.0,33239.0
421093,36490806,39222577.0,12000.0,12000.0,12000.0,60 months,19.99%,317.86,E,E3,...,1.0,2.0,95.0,66.7,0.0,0.0,96919.0,58418.0,9700.0,69919.0
421094,36271262,38982659.0,20000.0,20000.0,20000.0,36 months,11.99%,664.2,B,B5,...,0.0,1.0,100.0,50.0,0.0,1.0,43740.0,33307.0,41700.0,0.0
421095,Total amount funded in policy code 1: 6417608175,,,,,,,,,,...,,,,,,,,,,
421096,Total amount funded in policy code 2: 1944088810,,,,,,,,,,...,,,,,,,,,,


In [5]:
df = df[:-2]

### 2 - Removing Features
In the lecture, they removed any columns with missing values.  I'm not sure this is the best method, as there could be valuable information in the missing values.  Instead, the method I employ is to identify the categorical features.  If there are less than 30 unique values, then I create dummy variables out of them.  If there are more than 30 unique values, I use panda's ability to map each unique value to a numeric value, allowing me to retain all columns and rows.

In [6]:
cat_col = [col for col in df.columns if df[col].dtype == 'object']
num_col = [col for col in df.columns if df[col].dtype != 'object']
cat_col.remove('loan_status')

In [7]:
dummy_df = pd.DataFrame()
for col in cat_col:
    if df[col].nunique() < 30:
        dummy_df = pd.concat([dummy_df, pd.get_dummies(df[col], prefix = col, drop_first=True)], axis = 1)
        cat_col.remove(col)

For whatever reason, the id and interest rates are labeled as 'objects'.  The following is to convert them into numeric features.

In [8]:
df['id'] = pd.to_numeric(df['id'], errors='coerce')
df['int_rate'] = pd.to_numeric(df['int_rate'].str.strip('%'), errors='coerce')

cat_col.remove('id')
cat_col.remove('int_rate')

Using Panda's codes function is as simple as converting the objects into categorical dtypes (instead of objects).  Then add one to the codes as null values are given a value of -1, which random forest will not take. 

In [9]:
for col in cat_col + ['loan_status']:
    df[col] = df[col].astype('category')
    df[col] = df[col].cat.codes+1

In [10]:
df_combined = pd.concat([df[cat_col+num_col], df['loan_status'], dummy_df], axis = 1)

In [11]:
combined_cols_lst = list(df_combined.columns)
combined_cols_lst.remove('loan_status')

At this point, I have 136 features.  How do we remove the features that do not help predict the loan status?  One way is to find the features that are highly correlated with the loan status.  Below I've found 9 features that have a correlation of at least 0.15.  

In [12]:
print('There are {} features.'.format(len(combined_cols_lst)))

There are 136 features.


In [13]:
important_cols = [col for col in combined_cols_lst if df_combined[[col, 'loan_status']].corr().abs()['loan_status'][0] > 0.15]

In [14]:
important_cols

['last_pymnt_d',
 'out_prncp',
 'out_prncp_inv',
 'total_pymnt',
 'total_pymnt_inv',
 'total_rec_prncp',
 'recoveries',
 'collection_recovery_fee',
 'last_pymnt_amnt']

### 3 - Random Forest Classifier
I'm finally ready to apply the data to a random forest classifier.  I will be using a 10 fold cross validation, the same as the lecture for comparison.  Recall that in the lecture, the average accuracy was ~97%, but it had a range of ~11%.  **On the other hand, this model with only 9 features has an accuracy of ~97%, but a range of ~2.5%. **

In [15]:
rfc = ensemble.RandomForestClassifier()
X = df_combined[important_cols]
Y = df_combined['loan_status']

cv = cross_val_score(rfc, X, Y, cv = 10)

In [16]:
print('The cross validation score has a range of {:0.3f} and mean of {:0.3f}'.format(cv.max() - cv.min(), cv.mean()))

The cross validation score has a range of 0.025 and mean of 0.972


#### 3.1  - Removing Payment Amount and Outstanding Principal
The second question to answer is if is is possible to have an accuracy above 90% without using features related to payment amounts or outstanding principals.  Looking at the features deemed 'important', there are only three that are not related to payment amount or principals.  Of these three features, two of them have very low correlations.  My guess is it will be pretty difficult to achieve 90% accuracy. 

In [28]:
for col in important_cols:
    print(col, df_combined[[col, 'loan_status']].corr().abs()['loan_status'][0])

last_pymnt_d 0.317289555141
out_prncp 0.218553748244
out_prncp_inv 0.218595214349
total_pymnt 0.346931943687
total_pymnt_inv 0.346914103008
total_rec_prncp 0.411788669787
recoveries 0.162988104241
collection_recovery_fee 0.163651718919
last_pymnt_amnt 0.492987539961


In [17]:
important_cols_2 = ['total_rec_prncp',
 'recoveries',
 'collection_recovery_fee']

As expected, the average accuracy is ~86% and is not able to meet the target accuracy.  

In [18]:
rfc2 = ensemble.RandomForestClassifier()
X2 = df_combined[important_cols_2]
Y2 = df_combined['loan_status']

cv2 = cross_val_score(rfc2, X2, Y2, cv = 10)

In [19]:
print('The cross validation score has a range of {:0.3f} and mean of {:0.3f}'.format(cv2.max() - cv2.min(), cv2.mean()))

The cross validation score has a range of 0.068 and mean of 0.864
