# Random Forests 

Random forests is a supervised learning algorithm that is comprised of decision trees which are created from randomly selected data samples. The algorithm gets prediciton from each tree and selects the best solution by votes. The prediction result with the most votes becomes the final prediction. 

__Ensemble learning__ (or "ensembling") is simply the process of combining several models to solve a prediction problem, with the goal of producing a combined model that is more accurate than any individual model. For __classification__ problems, the combination is often done by majority vote. For __regression__ problems, the combination is often done by taking an average of the predictions. 

One popular method is __bootstrap aggregration/bagging__ where we take a subset of the data and train a model on each subset. Then the subsets are allowed to simultaneously vote on the outcome. This increases predictive accuracy by reducing the variance, similar to how cross-validation reduces the variance associated with the test set approach (for estimating out-of-sample error) by splitting many times an averaging the results.

Rather than building muiltple models, __boosting__ uses the output of one model as an input into the next forming a a serial/daisy-chained process. 

The last category is __stacking__, which incorperates bagging and boosting. In the first phase, multiple models are trained in parallel. Then, those models are used as inputs into a final model to give a prediction. 

### Advantages
* Random forests is considered as a highly accurate and robust method because of the number of decision trees participating in the process.
* It does not suffer from the overfitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases.
* The algorithm can be used in both classification and regression problems.
* Random forests can also handle missing values by using median values to replace continuous variables, and computing the proximity-weighted average of missing values.

### Disadvantages 
* Random forests is slow in generating predictions because it has multiple decision trees. 
* The model is difficult to interpret compared to a decision tree, where you can easily make a decision by following the path in the tree.

### Finding important features 
Random forests also offer good feature selection indictors by showing relative importance of each feature in a prediction. It uses gini index to describe the explanatory power of a the variable. If the decrease of impurity is large after the binary split, then the variable is signigicant. 

## Model Example 
We will be building a model on the [Lending Club](https://www.lendingclub.com/info/download-data.action) 2015 dataset to predict the state of a loan given. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
%matplotlib inline

In [8]:
!ls -lha

total 824736
drwxr-xr-x   9 tsawaengsri  staff   288B Aug 22 19:43 [34m.[m[m
drwxr-xr-x  16 tsawaengsri  staff   512B Aug 20 14:34 [34m..[m[m
-rw-r--r--@  1 tsawaengsri  staff   6.0K Aug 22 18:18 .DS_Store
drwxr-xr-x   5 tsawaengsri  staff   160B Aug 21 16:57 [34m.ipynb_checkpoints[m[m
-rw-r--r--   1 tsawaengsri  staff   208K Aug 20 16:10 1.Decision trees.ipynb
-rw-r--r--   1 tsawaengsri  staff   3.2K Aug 21 00:21 2.The id3 algorithm.ipynb
-rw-r--r--   1 tsawaengsri  staff    16K Aug 22 19:43 3.Guided example.ipynb
-rw-r--r--   1 tsawaengsri  staff   357M Aug 21 17:58 LoanStats3d_securev1.csv
-rw-r--r--   1 tsawaengsri  staff    45M Aug 22 19:45 my_beautiful_compressed_file.csv.xz


In [6]:
# Import Data
yr2015 = pd.read_csv('my_beautiful_compressed_file.csv.xz')
    # 'LoanStats3d_securev1.csv',
                    #skipinitialspace=True,
                    # header=1,
                    # skipfooter=2)


  """


In [7]:
yr2015.to_csv('my_beautiful_compressed_file.csv.xz')

In [None]:
yr2015.tail()

Looks like there are many rows with missing data, but that is ok since random forests can work with that. 

In [None]:
yr2015.dtypes

Since there are 150 attributes in this dataset, let's start to determine our model features by exploring the categorical data first. 

## Data Cleaning

When selecting categorical variables for our model, we will use get_dummy function, which is memory intensive if there are many stinctive values. To reduce the complexity of our model, we will take a look at all our categorical variables and convert those with over 30 distinctive values to numeric values. 

In [None]:
categorical = yr2015.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

There are a couple of columns, such as emp_title and revol_util that have more than a thousand distinctive values. Lets drop the ones with over 30 unique values, converting to numeric where it makes sense. In doing this there's a lot of code that gets written to just see if the numeric conversion makes sense. It's a manual process that we'll abstract away and just include the conversion.

In [None]:
# Convert ID and Interest Rate to numeric.
yr2015['id'] = pd.to_numeric(yr2015['id'], errors='coerce')
yr2015['int_rate'] = pd.to_numeric(yr2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
yr2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc','last_pymnt_d','last_credit_pull_d',
            'hardship_end_date','payment_plan_start_date','debt_settlement_flag_date'], 1, inplace=True)

In [None]:
pd.get_dummies(yr2015)

## Iteration 1

We will run the random forest classifier with all numeric and some categorical variables that have distinctive values less than 30. 

In [None]:
# Instantiating the model

rfc = ensemble.RandomForestClassifier()

X = yr2015.drop('loan_status', 1)
X = pd.get_dummies(X)
# Dropping NA instead of imputing because data is probably rich enough
X = X.dropna(axis=1)
Y = yr2015['loan_status']

cross_val_score(rfc, X, Y, cv=10)

The score cross validation reports is the accuracy of the tree. Here we're about 99% accurate.

However, we did not refine the model so there maybe a few potential problems. Let's try to trim down as much data as possible without dropping below an average of 90% accuracy in a 10-fold cross validation.

## Iteration 2

Let's try to identify features with the most gini importance and use those variables as features. 


In all feature selection procedures, it is a good practice to select the features by examining only the training set. This is to avoid overfitting.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into 20% test and 80% training
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

#### Train a Random Forest Classifer 

Here I will do the model fitting and feature selection altogether in one line of code.
* Firstly, I specify the random forest instance, indicating the number of trees.
* Then I use selectFromModel object from sklearn to automatically select the features.


In [None]:
rfc.fit(X_train, y_train)

#### Identify and Select Most Important Features

In [None]:
rfc1_fi = rfc.feature_importances_
indicies = np.argsort(rfc1_fi)
feat_names = X.columns

In [None]:
# Function to print the name and gini importance of each feature
def feat_importance(feat_names, model):
    for feature in zip(feat_names, model.feature_importances_):
        print(feature)

In [None]:
feat_importance(feat_names, rfc)

In [None]:
from sklearn.feature_selection import SelectFromModel
# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.01
sfm = SelectFromModel(rfc, threshold=0.01)

# Train the selector
sfm.fit(X_train, y_train)

In [None]:
# Print the names of the most important features
for feature_list_index in sfm.get_support(indices=True):
    print(feat_names[feature_list_index])

Let's use the base features into the next model. 

### Create A Data Subset With Only The Most Important Features

In [None]:
# Transform the data to create a new dataset containing only the most important features
# Note: We have to apply the transform to both the training X and test X data.
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)

### Train A New Random Forest Classifier Using Only Most Important Features

In [None]:
# Create a new random forest classifier for the most important features

cross_val_score(rfc, X_important_train, y_train, cv=10)

There wasn't much of a change from our first iteration. Let's try using only the top 5 columns. 

In [None]:
# From the top 5 features 
feature_cols = yr2015.loc[:,['funded_amnt','installment','out_prncp','out_prncp_inv',
'total_pymnt']]

In [None]:
x1 = pd.get_dummies(feature_cols)
x1 = x1.dropna(axis=1)
y1 = Y

rfc1 = ensemble.RandomForestClassifier()

cross_val_score(rfc1, x1, y1, cv=10)

Those scores are still relatively high. Let's try to combine some features with PCA. 

In [None]:
from sklearn.decomposition import PCA
import bisect

In [None]:
def train_pca(df, expl_var=.95):
    pca = PCA()
    df = df.copy()
    df = (df-df.mean())/df.std(ddof=0)
    pca.fit(df)
    varexp = pca.explained_variance_ratio_.cumsum()
    cutoff = bisect.bisect(varexp, expl_var)
    newcols = pd.DataFrame(pca.transform(df)[:, :cutoff+1], columns=['PCA'+df.columns[i] for i in range(cutoff+1)])
    return pca, newcols

In [None]:
pca, new_df = train_pca(X_train)

Ok, maybe let's go back to dropping more columns. 

In [None]:
# From the top 3 features 
feature_cols2 = yr2015.loc[:,['funded_amnt','installment','out_prncp']]

In [None]:
x2 = pd.get_dummies(feature_cols2)
x2 = x2.dropna(axis=1)
y1 = Y

rfc1 = ensemble.RandomForestClassifier()

cross_val_score(rfc1, x2, y1, cv=10)

These scores are too low. We'll revisit later. 