# Random Forest

We have learned about Decision Tree and how it is used to predict the class or value of the response variable by learning simple decision rules inferred from training data.

The random forest is an algorithm that consists of many decisions trees. It creates a set of **random** individual trees and brings together those individual decision trees to create an uncorrolated **forest**. 

It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees. Random Forest predictions are more accurate than that of any individual tree.

Two key concepts that give it the name **random**:<br/>
(1) A random sampling of training data set when building trees.<br/>
(2) Random subsets of features considered when splitting nodes.<br/>

Random forest ensures that the behavior of each individual tree is not too correlated with the behavior of any other tree in the model by using the following two methods:<br/>
**Bagging** (bootstrap aggregation): randomly sample from the dataset with replacement, resulting in different trees. This process is called Bagging. Suppose there are N rows in the train data set. Each tree will still be built using N rows with some rows repeated.<br/>

**Random feature selection**: random subsets of features for each tree.<br/>

This forces even more variation amongst the trees in the model and ultimately results in low correlation across trees and more diversification.

Final prediction is done by **ensemble** approach using individual tree prediction results. 

**Classification Ensemble:** Each individual tree in the random forest results in a class prediction and the class with the most votes becomes Random Forest model’s prediction.

**Regression Ensemble:** Each individual tree in the random forest results in a value prediction and the average value of all predictions becomes Random model’s prediction.

Uncorrelated trees can produce ensemble predictions that are more accurate than any of the individual predictions. This explains why **Random Foreast** models tend to produce more accurate results than **Decision Tree** models. 

Let's see how Decision Tree and Random Forest differ based on a given small data set.

**Decision Tree** model will use the data for training and create predictions based on a single tree structure.

In [1]:
# %cd "C:\\Users\\yasin.unlu\\Documents\\Original Docs\\Documents1\\Docs\\Teaching\\PythonForDataScienceSummer2020\\Week-8"

C:\Users\yasin.unlu\Documents\Original Docs\Documents1\Docs\Teaching\PythonForDataScienceSummer2020\Week-8


In [1]:
import pandas as pd
df = pd.read_csv('data/play.csv')
df

Unnamed: 0,Day,Weather,Temperature,Humidity,Wind,Play
0,1,Sunny,Hot,90,10,No
1,2,Cloudy,Hot,95,5,Yes
2,3,Sunny,Mild,70,30,Yes
3,4,Cloudy,Mild,89,25,Yes
4,5,Rainy,Mild,85,25,No
5,6,Rainy,Cool,60,30,No
6,7,Rainy,Mild,92,20,Yes
7,8,Sunny,Hot,95,20,No
8,9,Cloudy,Hot,65,12,Yes
9,10,Rainy,Mild,100,25,No


Bagging (random sample from the dataset with replacement) and Random Feature Selection<br/>
**Data for Tree-1:**
![image.png](attachment:image.png)

**Data for Tree-2**
![image.png](attachment:image.png)

**Data for Tree-3**
![image.png](attachment:image.png)

**Data for Tree-4**
![image.png](attachment:image.png)

Suppose we repeat this data creation and tree building operations 9 times. The following visual represents how Random Forest produces a final prediction result. Since this is a classification problem, the final result will be the mode of the class predictions. For regression problem, the final prediction result would be the average of the individual predictions. This explains how **ensemble** technique works.

![image.png](attachment:image.png)

## Random Forest Classifier Building in Scikit-learn

# Example: Adult.csv

## Step 1: Read Data

In [2]:
#Read the data
import pandas as pd
file = 'data/adult.csv'
#The data set has ''?'' for na values.
df = pd.read_csv(file, na_values='?')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  31978 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [4]:
df.shape #there are 15 features in the dataset

(32561, 15)

In [5]:
# check for na values. There are many missing values from 'workclass', 'occupation' and 'native-country'.
#We will deal with those values in the next session.
df.isna().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
salary               0
dtype: int64

We will remove any rows that has missing values.

In [6]:
df_clean = df.dropna()

In [7]:
df_clean.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
salary            0
dtype: int64

In [8]:
df_clean.shape #we lost about 2000 rows, which may be OK because we still have a lot of rows.

(30162, 15)

## Step-2: Identify Features and Response Variable

In [9]:
features = df_clean.drop(columns = ['salary'], axis=1)
response = df_clean[['salary']]

In [10]:
features.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [11]:
response.head()

Unnamed: 0,salary
0,<=50K
1,<=50K
2,<=50K
3,<=50K
4,<=50K


In [12]:
response.drop_duplicates() #response variable has only two outcomes.

Unnamed: 0,salary
0,<=50K
7,>50K


In [16]:
response.salary.unique()

array(['<=50K', '>50K'], dtype=object)

## Step-3: Preprocess Data
We will first deal with replacing the missing values with the imputed ones. Let's determine numerical and categorical columns. As a next step, we will encode the caterical columns.

In [17]:
# select columns with numerical data types
num_cols = features.select_dtypes(include=['int64', 'float64']).columns
# select columns with categorical data types
cat_cols = features.select_dtypes(include=['object', 'bool']).columns

In [18]:
num_cols = num_cols.tolist()
num_cols

['age',
 'fnlwgt',
 'education-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week']

In [19]:
cat_cols = cat_cols.tolist()
cat_cols

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

Now, let's encode the categorical columns.

In [20]:
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict

features[cat_cols] = features[cat_cols].astype(str) #let's make sure all categorical variables are of type str
d = defaultdict(LabelEncoder) #retain all columns LabelEncoder() as dictionary.
#This will look like a dictionary data type below:
# defaultdict(sklearn.preprocessing._label.LabelEncoder,
#             {'workclass': LabelEncoder(),
#              'education': LabelEncoder(),
#              'marital-status': LabelEncoder(),
#              'occupation': LabelEncoder(),
#              'relationship': LabelEncoder(),
#              'race': LabelEncoder(),
#              'sex': LabelEncoder(),
#              'native-country': LabelEncoder()})

#Now let's use apply() function to convert all caterical variables into encoded values.
features[cat_cols] = features[cat_cols].apply(lambda x: d[x.name].fit_transform(x))

Below is code from ***Week9_EvaluatingMultipleModels-Copy1.ipynb*** that does the same thing as above (I think) but is much simpler and does not involve a `lambda` function.

In [1]:
# # Make sure all categorical variables are of type str or category
# features[cat_cols] = features[cat_cols].astype('category') 
# # Now let's use apply() function to convert all caterical variables into encoded values.
# features[cat_cols] = features[cat_cols].apply(LabelEncoder().fit_transform)

In [21]:
features.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,5,77516,9,13,4,0,1,4,1,2174,0,40,38
1,50,4,83311,9,13,2,3,0,4,1,0,0,13,38
2,38,2,215646,11,9,0,5,1,4,1,0,0,40,38
3,53,2,234721,1,7,2,5,0,2,1,0,0,40,38
4,28,2,338409,9,13,2,9,5,2,0,0,0,40,4


## Step-4: Split Data

In [22]:
from sklearn.model_selection import train_test_split
my_result_list = train_test_split(features, response, test_size=0.20, random_state=0)
features_train, features_test, response_train, response_test = my_result_list

In [23]:
features.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
dtype: int64

## Step-5: Train Random Forest Model

In [24]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state = 0)
classifier.fit(features_train, response_train)

  This is separate from the ipykernel package so we can avoid doing imports until


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

## Step-6: Generate Predictions and Evaluate Model

In [25]:
response_pred = classifier.predict(features_test)

In [26]:
from sklearn.metrics import accuracy_score
print('Accuracy Score on test data: ', accuracy_score(y_true=response_test, y_pred=response_pred))

Accuracy Score on test data:  0.8518150174042765


## Step-7: Comparison

Let's use **Decision Trees** Model to compare the results.

In [27]:
from sklearn.tree import DecisionTreeClassifier
classifier_dt = DecisionTreeClassifier(random_state = 0)
classifier_dt.fit(features_train, response_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

In [28]:
response_pred = classifier_dt.predict(features_test)

In [29]:
from sklearn.metrics import accuracy_score
print('Accuracy Score on test data: ', accuracy_score(y_true=response_test, y_pred=response_pred))

Accuracy Score on test data:  0.8068954085861098


Random Forest Model yields better results.

## Evaluating Multiple Models for Classification

Let's compare **Random Forest** and **Decision Tree** models by evaluating the prediction accuracy results on the same train data set.

In [30]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

models_list = [RandomForestClassifier(), DecisionTreeClassifier()] # we put model functions in a list
model_names = ['Random Forest', 'Decision Tree'] # model names in a list
accuracy_list = []
results_dict = {}

for model in range(len(models_list)):
    classifier = models_list[model]
    classifier.fit(features_train, response_train)
    response_pred = classifier.predict(features_test)
    accuracy_list.append(accuracy_score(response_pred, response_test))
     
result_dict = {'Model Name':model_names, 'Accuracy':accuracy_list}

  # This is added back by InteractiveShellApp.init_path()


In [31]:
results_df = pd.DataFrame(result_dict)
results_df

Unnamed: 0,Model Name,Accuracy
0,Random Forest,0.854467
1,Decision Tree,0.805901


## Evaluating Multiple Models for Regression

We will evaluate **Random Forest**, **Decision Trees**, and **Multiple Linear Regression** models using Boston House Prices data.

In [32]:
from sklearn.datasets import load_boston
import pandas as pd
boston = load_boston()
df  = pd.DataFrame(boston.data, columns = boston.feature_names)
df['MEDV'] = boston.target
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [33]:
from sklearn.model_selection import train_test_split
features = df.iloc[:,0:13] #First 13 columns in dataframe accounts for features
features.head() #this is a dataframe

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [34]:
response = df[['MEDV']] #create a dataframe for response variable
response.head()

Unnamed: 0,MEDV
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


In [35]:
my_result_list = train_test_split(features, response, test_size=0.2, random_state=0)

features_train = my_result_list[0]
features_test = my_result_list[1]
response_train = my_result_list[2]
response_test = my_result_list[3]

Now our data set is ready for models training. We will create a list of regressor models and call one by one.

In [40]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LinearRegression

models_list = [RandomForestRegressor(), DecisionTreeRegressor(), LinearRegression()] # we put model functions in a list
model_names = ['Random Forest', 'Decision Tree', 'Multiple Linear Regression'] # model names in a list
mae_list = []
mse_list = []
rmse_list = []
results_dict = {}

for model in range(len(models_list)):
    regressor = models_list[model]
    regressor.fit(features_train, response_train)
    response_pred=regressor.predict(features_test)
    mae_list.append(metrics.mean_absolute_error(response_pred, response_test))
    mse_list.append(metrics.mean_squared_error(response_pred, response_test))
    rmse_list.append(np.sqrt(metrics.mean_squared_error(response_pred, response_test)))
    
result_dict = {'Model Name':model_names, 
               'Mean Absolute Error':mae_list, 
               'Mean Squared Error':mse_list,
               'Root Mean Squared Error':rmse_list}

  app.launch_new_instance()


In [41]:
results_df = pd.DataFrame(result_dict)
results_df

Unnamed: 0,Model Name,Mean Absolute Error,Mean Squared Error,Root Mean Squared Error
0,Random Forest,2.698637,20.51732,4.529605
1,Decision Tree,3.5,32.731961,5.721185
2,Multiple Linear Regression,3.842909,33.44898,5.783509


In [39]:
results_df = pd.DataFrame(result_dict)
results_df

Unnamed: 0,Model Name,Mean Absolute Error,Mean Squared Error,Root Mean Squared Error
0,Random Forest,2.668206,19.570572,4.423864
1,Decision Tree,3.630392,34.297157,5.856377
2,Multiple Linear Regression,3.842909,33.44898,5.783509
