# Random Forests 

Random forests is a supervised learning algorithm that is comprised of decision trees which are created from randomly selected data samples. The algorithm gets prediciton from each tree and selects the best solution by votes. The prediction result with the most votes becomes the final prediction. 

__Ensemble learning__ (or "ensembling") is simply the process of combining several models to solve a prediction problem, with the goal of producing a combined model that is more accurate than any individual model. For __classification__ problems, the combination is often done by majority vote. For __regression__ problems, the combination is often done by taking an average of the predictions. 

One popular method is __bootstrap aggregration/bagging__ where we take a subset of the data and train a model on each subset. Then the subsets are allowed to simultaneously vote on the outcome. This increases predictive accuracy by reducing the variance, similar to how cross-validation reduces the variance associated with the test set approach (for estimating out-of-sample error) by splitting many times an averaging the results.

Rather than building muiltple models, __boosting__ uses the output of one model as an input into the next forming a a serial/daisy-chained process. 

The last category is __stacking__, which incorperates bagging and boosting. In the first phase, multiple models are trained in parallel. Then, those models are used as inputs into a final model to give a prediction. 

### Advantages
* Random forests is considered as a highly accurate and robust method because of the number of decision trees participating in the process.
* It does not suffer from the overfitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases.
* The algorithm can be used in both classification and regression problems.
* Random forests can also handle missing values by using median values to replace continuous variables, and computing the proximity-weighted average of missing values.

### Disadvantages 
* Random forests is slow in generating predictions because it has multiple decision trees. 
* The model is difficult to interpret compared to a decision tree, where you can easily make a decision by following the path in the tree.

### Finding important features 
Random forests also offer good feature selection indictors by showing relative importance of each feature in a prediction. It uses gini index to describe the explanatory power of a the variable. If the decrease of impurity is large after the binary split, then the variable is signigicant. 

## Model Example 
We will be building a model on the [Lending Club](https://www.lendingclub.com/info/download-data.action) 2015 dataset to predict the state of a loan given. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
%matplotlib inline

In [2]:
# Import Data

yr2015 = pd.read_csv('LoanStats3d_securev1.csv',
                    skipinitialspace=True,
                    header=1,
                    skipfooter=2)


  


In [3]:
yr2015.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
421090,36371250,,10000,10000,10000.0,36 months,11.99%,332.1,B,B5,...,,,,N,,,,,,
421091,36441262,,24000,24000,24000.0,36 months,11.99%,797.03,B,B5,...,,,,N,,,,,,
421092,36271333,,13000,13000,13000.0,60 months,15.99%,316.07,D,D2,...,,,,N,,,,,,
421093,36490806,,12000,12000,12000.0,60 months,19.99%,317.86,E,E3,...,,,,N,,,,,,
421094,36271262,,20000,20000,20000.0,36 months,11.99%,664.2,B,B5,...,,,,N,,,,,,


Looks like there are many rows with missing data, but that is ok since random forests can work with that. 

In [4]:
yr2015.dtypes

id                         int64
member_id                float64
loan_amnt                  int64
funded_amnt                int64
funded_amnt_inv          float64
                          ...   
settlement_status         object
settlement_date           object
settlement_amount        float64
settlement_percentage    float64
settlement_term          float64
Length: 150, dtype: object

Since there are 150 attributes in this dataset, let's start to determine our model features by exploring the categorical data first. 

## Data Cleaning

When selecting categorical variables for our model, we will use get_dummy function, which is memory intensive if there are many stinctive values. To reduce the complexity of our model, we will take a look at all our categorical variables and convert those with over 30 distinctive values to numeric values. 

In [5]:
categorical = yr2015.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

term
2
int_rate
111
grade
7
sub_grade
35
emp_title
120812
emp_length
11
home_ownership
4
verification_status
3
issue_d
12
loan_status
7
pymnt_plan
2
url
421095
desc
34
purpose
14
title
27
zip_code
914
addr_state
49
earliest_cr_line
668
revol_util
1211
initial_list_status
2
last_pymnt_d
55
next_pymnt_d
5
last_credit_pull_d
56
application_type
2
verification_status_joint
1
hardship_flag
2
hardship_type
1
hardship_reason
9
hardship_status
3
hardship_start_date
30
hardship_end_date
31
payment_plan_start_date
31
hardship_loan_status
4
debt_settlement_flag
2
debt_settlement_flag_date
48
settlement_status
3
settlement_date
51


There are a couple of columns, such as emp_title and revol_util that have more than a thousand distinctive values. Lets drop the ones with over 30 unique values, converting to numeric where it makes sense. In doing this there's a lot of code that gets written to just see if the numeric conversion makes sense. It's a manual process that we'll abstract away and just include the conversion.

In [6]:
# Convert ID and Interest Rate to numeric.
yr2015['id'] = pd.to_numeric(yr2015['id'], errors='coerce')
yr2015['int_rate'] = pd.to_numeric(yr2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
yr2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc','last_pymnt_d','last_credit_pull_d',
            'hardship_end_date','payment_plan_start_date','debt_settlement_flag_date'], 1, inplace=True)

In [7]:
pd.get_dummies(yr2015)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,...,settlement_date_Nov-2017,settlement_date_Nov-2018,settlement_date_Oct-2015,settlement_date_Oct-2016,settlement_date_Oct-2017,settlement_date_Oct-2018,settlement_date_Sep-2015,settlement_date_Sep-2016,settlement_date_Sep-2017,settlement_date_Sep-2018
0,68355089,,24700,24700,24700.0,11.99,820.28,65000.0,16.06,1,...,0,0,0,0,0,0,0,0,0,0
1,66310712,,35000,35000,35000.0,14.85,829.90,110000.0,17.06,0,...,0,0,0,0,0,0,0,0,0,0
2,68407277,,3600,3600,3600.0,13.99,123.03,55000.0,5.91,0,...,0,0,0,0,0,0,0,0,0,0
3,68476807,,10400,10400,10400.0,22.45,289.91,104433.0,25.37,1,...,0,0,0,0,0,0,0,0,0,0
4,68341763,,20000,20000,20000.0,10.78,432.66,63000.0,10.78,0,...,0,0,0,0,0,0,0,0,0,0
5,68426831,,11950,11950,11950.0,13.44,405.18,34000.0,10.20,0,...,0,0,0,0,0,0,0,0,0,0
6,68426545,,16000,16000,16000.0,12.88,363.07,70000.0,26.40,0,...,0,0,0,0,0,0,0,0,0,0
7,68476668,,20000,20000,20000.0,9.17,637.58,180000.0,14.67,0,...,0,0,0,0,0,0,0,0,0,0
8,68338832,,1400,1400,1400.0,12.88,47.10,64000.0,34.95,0,...,0,0,0,0,0,0,0,0,0,0
9,66624733,,18000,18000,18000.0,19.48,471.70,150000.0,9.39,0,...,0,0,0,0,0,0,0,0,0,0


## Iteration 1

We will run the random forest classifier with all numeric and some categorical variables that have distinctive values less than 30. 

In [8]:
# Instantiating the model

rfc = ensemble.RandomForestClassifier()

X = yr2015.drop('loan_status', 1)
X = pd.get_dummies(X)
# Dropping NA instead of imputing because data is probably rich enough
X = X.dropna(axis=1)
Y = yr2015['loan_status']

cross_val_score(rfc, X, Y, cv=10)



array([0.99024078, 0.9896234 , 0.99043005, 0.99076229, 0.99121328,
       0.99111829, 0.9914266 , 0.99116536, 0.9915216 , 0.9914266 ])

The score cross validation reports is the accuracy of the tree. Here we're about 99% accurate.

However, we did not refine the model so there maybe a few potential problems. Let's try to trim down as much data as possible without dropping below an average of 90% accuracy in a 10-fold cross validation.

## Iteration 2

Let's try to identify features with the most gini importance and use those variables as features. 


In all feature selection procedures, it is a good practice to select the features by examining only the training set. This is to avoid overfitting.

In [10]:
from sklearn.model_selection import train_test_split

# Split the data into 20% test and 80% training
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

#### Train a Random Forest Classifer 

Here I will do the model fitting and feature selection altogether in one line of code.
* Firstly, I specify the random forest instance, indicating the number of trees.
* Then I use selectFromModel object from sklearn to automatically select the features.


In [11]:
rfc.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

#### Identify and Select Most Important Features

In [12]:
rfc1_fi = rfc.feature_importances_
indicies = np.argsort(rfc1_fi)
feat_names = X.columns

In [31]:
# Function to print the name and gini importance of each feature
def feat_importance(feat_names, model):
    for feature in zip(feat_names, model.feature_importances_):
        print(feature)

In [32]:
feat_importance(feat_names, rfc)

('id', 0.0015280826259228648)
('loan_amnt', 0.012873405134428046)
('funded_amnt', 0.015997063452009724)
('funded_amnt_inv', 0.005191760142525622)
('int_rate', 0.0016693503056198701)
('installment', 0.013637818565100721)
('annual_inc', 0.001478887941333025)
('delinq_2yrs', 0.0003453288213054866)
('fico_range_low', 0.0009214420020791519)
('fico_range_high', 0.0008319111279550576)
('inq_last_6mths', 0.0004229116944076966)
('open_acc', 0.0009095573808778446)
('pub_rec', 0.0002538342520389341)
('revol_bal', 0.001774133006955125)
('total_acc', 0.0011152326812746342)
('out_prncp', 0.12241392310488304)
('out_prncp_inv', 0.04300599800035037)
('total_pymnt', 0.017547902797376655)
('total_pymnt_inv', 0.02656794278232938)
('total_rec_prncp', 0.08384471466987911)
('total_rec_int', 0.021224309765299147)
('total_rec_late_fee', 0.002039659726045958)
('recoveries', 0.11400193787572148)
('collection_recovery_fee', 0.08628146154170543)
('last_pymnt_amnt', 0.06768318392930682)
('last_fico_range_high', 0.1

In [33]:
from sklearn.feature_selection import SelectFromModel
# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.01
sfm = SelectFromModel(rfc, threshold=0.01)

# Train the selector
sfm.fit(X_train, y_train)

SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
        max_features=None, norm_order=1, prefit=False, threshold=0.01)

In [35]:
# Print the names of the most important features
for feature_list_index in sfm.get_support(indices=True):
    print(feat_names[feature_list_index])

funded_amnt
installment
out_prncp
out_prncp_inv
total_pymnt
total_pymnt_inv
total_rec_prncp
total_rec_int
recoveries
collection_recovery_fee
last_pymnt_amnt
last_fico_range_high
last_fico_range_low
term_ 36 months
term_ 60 months
next_pymnt_d_Aug-2019
next_pymnt_d_Jul-2019
debt_settlement_flag_Y


Let's use the base features into the next model. 

### Create A Data Subset With Only The Most Important Features

In [36]:
# Transform the data to create a new dataset containing only the most important features
# Note: We have to apply the transform to both the training X and test X data.
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)

### Train A New Random Forest Classifier Using Only Most Important Features

In [38]:
# Create a new random forest classifier for the most important features

cross_val_score(rfc, X_important_train, y_train, cv=10)

array([0.99397447, 0.99436018, 0.99441954, 0.99391475, 0.99388506,
       0.99355854, 0.99367727, 0.99459716, 0.99320172, 0.99358765])

There wasn't much of a change from our first iteration. Let's try using only the top 5 columns. 

In [40]:
# From the top 5 features 
feature_cols = yr2015.loc[:,['funded_amnt','installment','out_prncp','out_prncp_inv',
'total_pymnt']]

In [41]:
x1 = pd.get_dummies(feature_cols)
x1 = x1.dropna(axis=1)
y1 = Y

rfc1 = ensemble.RandomForestClassifier()

cross_val_score(rfc1, x1, y1, cv=10)



array([0.96155673, 0.9747115 , 0.97202631, 0.97337924, 0.97143129,
       0.97045762, 0.9686038 , 0.96770133, 0.96763009, 0.96677512])

Those scores are still relatively high. Let's try to combine some features with PCA. 

In [43]:
from sklearn.decomposition import PCA
import bisect

In [44]:
def train_pca(df, expl_var=.95):
    pca = PCA()
    df = df.copy()
    df = (df-df.mean())/df.std(ddof=0)
    pca.fit(df)
    varexp = pca.explained_variance_ratio_.cumsum()
    cutoff = bisect.bisect(varexp, expl_var)
    newcols = pd.DataFrame(pca.transform(df)[:, :cutoff+1], columns=['PCA'+df.columns[i] for i in range(cutoff+1)])
    return pca, newcols

In [45]:
pca, new_df = train_pca(X_train)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Ok, maybe let's go back to dropping more columns. 

In [47]:
# From the top 3 features 
feature_cols2 = yr2015.loc[:,['funded_amnt','installment','out_prncp']]

In [48]:
x2 = pd.get_dummies(feature_cols2)
x2 = x2.dropna(axis=1)
y1 = Y

rfc1 = ensemble.RandomForestClassifier()

cross_val_score(rfc1, x2, y1, cv=10)



array([0.75654177, 0.79942537, 0.7981525 , 0.79857516, 0.800019  ,
       0.79716925, 0.79430973, 0.79255231, 0.79744461, 0.76811457])

These scores are too low. We'll revisit later. 