# Introduction to the Random Forest model

Decision trees are great in that they are easy to make and easy to interpret. But a _serious_ flaw of theirs is that they have a lot of variance b/w each tree iteration and can easily overfit. A way to combat this is to have more than one decision tree. We can make as many trees as we like, a whole "forest" even. Then each tree is given a vote which is aggregated in the final decision of the outcome. This __Random Forest__ can be used as a classifier or in a regression. As a classifier, aggregation is based on the most popular outcome from the trees. In a regression, the aggregation is typically the average of the mean of each tree's prediction. 

## Params
When building a random forest, we can set the parameters of the trees and forest itself. The trees have the same parameters: max features, random state, max depth of branches, and building the tree using information gain and entropy or some other method like [Gini impurity](https://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria).<br>
You can also control the number of estimators, or decision trees, in the forest. Deciding this is a blancing game between how much variance you can explain and computational complexity. As we add more trees to the forest, the information added converges to a point. That's because there's only so much information to be learned, and once an acceptable variance in accuracy is reached we can use that as a cutoff for the number of trees generated. 

## Bagging and Random subspace
Random forests don't just create trees from the same data over and over again. This would result in high correlations between estimators which can lead to bias as highly predictive features begin to dominate the trees. This lead to many trees with potentially the same/similar information, which can lead to very similar, potentially biased predictions. <br>
We fix this by applying __bagging__. Each tree picks an observation with replacement to build the training set. This means that the same observation can show up in multiple trees. This is only really a problem if the number of observations is very low. Also, random forests only use a random subset of features for deciging splits/rules. This means that for each rule we make, we are only looking at the __random subspace__ created by a random subset of _some_ of the features as possibilities to generate the rule. This helps to avoid the correlation problem of having the same features be used for each split. As a general rule for a dataset with x features, classifiers use $\sqrt{x}$ features and regression use $x/3$ features

## Advantages and Disadvantages
### Pros
* Very flexible, requires no input preparation
* They can handle most/all data types
* Very quick to train and have low variance with high accuracy (Good for using as a benchmark model)

### Cons
* Showcases the biggest weakness of supervised learning, it can only predict what it has already seen.
* Can get very large, computationally expensive, and complex if it grows too deep
* Very much a __black box__ model in that an output is given with very little insight as to how we got there
* You dont get much insight into the process of prediction and which features matter, so you can't visually represent the process or learn about the underlying process

Overall, this model works well in the beginning and we can validate the results later on as well.

## Intro to Ensemble Models
At its core, it's a model that is made up of other models. The component models are usually simple ones, but you can technically ocmbine most models together to form an ensemble.

### Ways to assemble an ensemble model
* __Bagging__: You take subsets of the data and train a model on each subset, which then they vote together to give an outcome, based either on the majority (Classifiers) or the mean of the results (Regressions). ex: Random Forest
* __Boosting__: Uses the output from one model as input for the next in a form of serial processing. These chains keep happening until a stop condition is reached.
* __Stacking__: 2 phase process. 1. Multiple models are trained in parallel. 2. The output from those models is used as inout for the final model to give a prediction. Essentially, it combines the parallel approach of bagging witht the serialization of boosting forming a hybrid model.

Ensemble models are usually used for their high accuracy and low variance because they're built from internal models. However, techniques such as boosting are prone to overfitting. You also lose a lot of transparency from single models, as from the example about random forests. Decision trees are easy to extract information from but turn it into a randomm forest and we lose a lot of the internal information about the model

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as plt
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
%matplotlib inline

In [2]:
# Replace the path with the correct path for your data.
y2015 = pd.read_csv(
    'https://www.dropbox.com/s/0so14yudedjmm5m/LoanStats3d.csv?dl=1',
    skipinitialspace=True,
    header=1
)
# Note the warning about dtypes.

  interactivity=interactivity, compiler=compiler, result=result)


## Data cleaning

In [3]:
categorical = y2015.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

id
421097
term
2
int_rate
110
grade
7
sub_grade
35
emp_title
120812
emp_length
11
home_ownership
4
verification_status
3
issue_d
12
loan_status
7
pymnt_plan
1
url
421095
desc
34
purpose
14
title
27
zip_code
914
addr_state
49
earliest_cr_line
668
revol_util
1211
initial_list_status
2
last_pymnt_d
25
next_pymnt_d
4
last_credit_pull_d
26
application_type
2
verification_status_joint
3


In [4]:
# Convert ID and Interest Rate to numeric.
y2015['id'] = pd.to_numeric(y2015['id'], errors='coerce')
y2015['int_rate'] = pd.to_numeric(y2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
y2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

In [6]:
#variables related to time, but have the tendency to contribute little information
#Too many dummy variables created
y2015.drop(
['last_pymnt_d', 'next_pymnt_d', 'title', 'issue_d', 'last_pymnt_amnt', 'last_credit_pull_d', 'member_id', 'id', 'emp_length'],
axis=1, inplace=True)


KeyError: "['last_pymnt_d' 'next_pymnt_d' 'title' 'issue_d' 'last_pymnt_amnt'\n 'last_credit_pull_d' 'member_id' 'id' 'emp_length'] not found in axis"

In [7]:
print(len(y2015.columns))
y2015.columns

94


Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'grade', 'home_ownership', 'annual_inc',
       'verification_status', 'loan_status', 'pymnt_plan', 'purpose', 'dti',
       'delinq_2yrs', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv',
       'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint',
       'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt',
       'tot_cur_bal', 'open_acc_6m', 'open_il_6m', 'open_il_12m',
       'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util',
       'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util',
       'total_rev_

In [8]:
y2015 = y2015[:-2]

## Fitting model

In [9]:
rfc = ensemble.RandomForestClassifier()
X = y2015.drop(['loan_status'], 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)
#There are 400,000+ rows so its ok if we don't impute the columns for missingness. We can just drop them and still have
#rich enough information
X = X.dropna(axis=1)

cross_val_score(rfc, X, Y, cv=10)

array([0.95089169, 0.96426112, 0.96255135, 0.96115029, 0.96053194,
       0.96060318, 0.960531  , 0.96048351, 0.96038663, 0.9602432 ])

In [10]:
rfc.fit(X, Y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [30]:
feat_labels = X.columns
clean_col = []
for feature in zip(feat_labels, rfc.feature_importances_):
    if (feature[1]>.009):
        print(feature)
        clean_col.append(feature[0])
remove_loan = ['loan_amnt', 'funded_amnt_inv', 'funded_amnt']
clean_col = [x for x in clean_col if x not in remove_loan]

('loan_amnt', 0.009550317979231799)
('funded_amnt', 0.055085643109397356)
('open_acc', 0.243768274900885)
('pub_rec', 0.306878827921759)
('revol_bal', 0.07031539328243734)
('total_acc', 0.049392207749760136)
('out_prncp', 0.07419500147437742)
('out_prncp_inv', 0.03119600501887909)
('total_pymnt_inv', 0.024704989832801626)
('total_rec_prncp', 0.015222540801252857)


In [28]:
#No loan_amnt', 'funded_amnt_inv', 'funded_amnt'
print(clean_col)
X_new = y2015.loc[:,clean_col]
X_new.dropna(axis=1, inplace=True)

['open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt_inv', 'total_rec_prncp']


In [29]:
cross_val_score(rfc, X_new, Y, cv=10)

array([0.89985989, 0.96606587, 0.9625751 , 0.95846691, 0.95746853,
       0.95732605, 0.95523522, 0.9560664 , 0.95658679, 0.95537453])

In [38]:
feat_labels = X_new.columns
v3_col = []
for feature in zip(feat_labels, rfc.feature_importances_):
    if (feature[1]>.005):
        print(feature)
        v3_col.append(feature[0])


('open_acc', 0.009550317979231799)
('pub_rec', 0.055085643109397356)
('revol_bal', 0.005412974573836585)
('total_acc', 0.005674197414755255)
('total_rec_prncp', 0.007994133185355605)


In [44]:
X_v3 = y2015.loc[:, v3_col]
X_v3.dropna(axis=1, inplace=True)

In [46]:
len(y2015)

421095

In [48]:
X_v3 = pd.concat([X_v3, y2015['out_prncp']], axis=1 )

In [49]:
cross_val_score(rfc, X_v3, Y, cv=10)

array([0.93381777, 0.95207903, 0.95055924, 0.94894446, 0.94735217,
       0.94533365, 0.94438244, 0.94393123, 0.94380982, 0.93817983])