# Decision Trees and Random Forests for Regression

Decision Trees are used for classification as well as regression(hence called CART(Classification and Regression Trees)).

When a single decision tree is used to predict, it overfits almost certainly and hence a bad example of a model. 
Hence, to address overfitting, random forests are used.

Random Forests are (surprise surprise!!) random groups of trees. A random forest consists of n deccision trees, which build on k randomly selected features and m random datapoints from the dataset.
In case of classification, a majority vote is taken and decision is given
In case of regression, an average of the predicted values is taken and a final predicted value is produced.

### Deciding on what feature to divide - The Standard Deviation

Just like entropy in classification problems, we use standard deviation to decide on what feature to divide the dataset.
We calculate the standard deviation of a parent branch target values and a weighted average of the standard deviations of the sub branches and their difference. The resulting value is the total decrease in the standard deviation. The higher the value, the better the feature is for division.

A standard deviation of 0 means a homogeneous data. But for avoiding overfitting, we prune the data by setting a tolerance value of about 5% of the original dataset. We also keep an extra safety measure of 10 datapoints per leaf node.

<img src='http://www.saedsayad.com/images/Decision_tree_r3.png'>

For more information, do check out [this](https://www.youtube.com/watch?v=nSaOuPCNvlk) video by Noureddin Sadawi

#### Lets make our own Random Forest

# Importing Data

In [4]:
# importing dependancies
import pandas as pd
from numpy import NaN

In [6]:
# This data is taken from the kaggle dataset on house prices
data = pd.read_csv('train.csv')

In [7]:
#Given the various features of the data , we need to predict the price of an unknown dataset
data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [8]:
Y = data.SalePrice 
data.drop(axis=1,labels='SalePrice',inplace=True)

# Building A Tree

In [9]:
#Defining a node of a tree
class Node:
    def __init__(self,best_feature=0 , best_value=None , children=None , result=None):
        self.best_feature = best_feature
        self.best_value = best_value
        self.result = result
        self.children = children

In [10]:
#Calculate the standard deviation of the data
def sdev(data):
    return data.std()

In [11]:
#Splits into sub branches
def split(data , Y, feature , value):
    children = []
    target=[]
    if value == None:  #Discrete data
        
        for childname in data[feature].unique():
            if not (childname == childname):
                children.append(data[data[feature].isnull()])
                target.append(Y[data[data[feature].isnull()].index])
                
            else:
                children.append(data[data[feature] == childname])  #Sub branch data
                target.append(Y[data[data[feature] == childname].index])  #Sub branch target
    else:  #Continuous data
        
        children.append(data[data[feature] <= value])  #Left branch data
        target.append(Y[data[data[feature] <= value].index])  # Left branch target
        children.append(data[data[feature] > value])  # Right branch data
        target.append(Y[data[data[feature] > value].index])  # Right branch target
        
    return children,target
        

In [19]:
lis=[]   #Global variable for keeping track of the features which are already used for splitting

# Main builder function
def build_tree(data,Y,tol=5):
    global lis
    #data is a pandas dataframe 
    # target is a pandas list
    if len(Y)<=10:  #Pruning the tree when number of datapoints in a branch is less than or equal to 10
        return Node(result=Y.mean()) #return leaf node
    
    else:
        #Initializing node parameters
        sdev_target = sdev(Y)  # standard deviation of the target
        dev_loss = 0  #the standard deviation loss
        best_feature=None #best feature to split upon 
        best_value = None #value of the feature to split upon(in case of continuous data)
        best_children = None #best sub branches
        best_target=None #the target values for the best sub branches
    
    #return leaf node when the loss in standard deviation is below tolerance
    if dev_loss >= (sdev_target*(1-(tol/100))):  
        result = Y.mean()   
        return Node(result = result)
    
    else:  #branch further
        cols=data.columns.tolist()
        for feature in cols:

            f_sdev=0 
            
            if feature in lis:
                continue
                
            if not ((data[feature].dtype == int) or data[feature].dtype == float):  #For discrete data
                children, children_target = split(data,Y,feature,value=None)  #Split into  sub branches
                
                for child in children:
                    f_sdev += (sdev(Y[child.index]) * len(child)/len(data))  #Calculate total std. deviation
                
                dev_decr = sdev_target - f_sdev  #The total decrease in std. deviation
                
                #Updating the node parameters 
                if dev_decr > dev_loss:
                    dev_loss = dev_decr
                    best_feature = feature
                    best_value = data[feature].unique()
                    best_children = children
                    best_target = children_target
                    
            
            else:  #For continuous data
                for uvalue in data[feature].unique(): #iterate for each unique value
                    if uvalue == NaN:
                        continue
                    f_sdev = 0
                    children,children_target = split(data,Y,feature,value = uvalue) #Split into sub branches
                    
                    for child in children:
                        f_sdev += (sdev(Y[child.index]) * len(child)/len(data))  #Calculate total std. deviation
                    
                        dev_decr = sdev_target - f_sdev #The total decrease in std. deviation
                
                    #updating the node parameters
                    if dev_decr > dev_loss:
                        dev_loss = dev_decr
                        best_feature = feature
                        best_value = uvalue
                        best_children = children
                        best_target = children_target
        
        if not ((data[best_feature].dtype == int) or data[best_feature].dtype ==float):
            lis.append(best_feature)
            print (lis)
        #Return Node 
        children = [build_tree(best_children[i],best_target[i],tol) for i in range(len(best_target))]
        return Node(best_feature=best_feature, best_value=best_value, children=children)    
        

In [20]:
#Returns the result for a testing datapoint X
def predict(model, X):
    if model.result != None:  #a leaf node
        return model.result
    
    elif (((model.best_value).dtype == int) or ((model.best_value).dtype == float)):  #if best_feature is numeric
        if not(X[(model.best_feature)]==X[(model.best_feature)]):
            return predict(model.children[0], X)
        
        if X[(model.best_feature)] <= model.best_value: #Select the left branch
            return predict(model.children[0], X)  
        
        else:  #Select the right branch
            return predict(model.children[1], X)
        
    else:  #The best feature has discrete values
        if not(X[model.best_feature] == X[model.best_feature]):
            return predict((model.children[0]),X)
        else:
            idx = ((model.best_value).tolist()).index(X[model.best_feature]) #index of the particular label
            return predict((model.children)[idx],X)  #Go to the label branch         
        
        

In [21]:
#Defining a model
model = build_tree(data,Y)

['Neighborhood']
['Neighborhood', 'KitchenQual']
['Neighborhood', 'KitchenQual', 'LotConfig']
['Neighborhood', 'KitchenQual', 'LotConfig', 'SaleCondition']
['Neighborhood', 'KitchenQual', 'LotConfig', 'SaleCondition', 'LandContour']
['Neighborhood', 'KitchenQual', 'LotConfig', 'SaleCondition', 'LandContour', 'BsmtFinType1']
['Neighborhood', 'KitchenQual', 'LotConfig', 'SaleCondition', 'LandContour', 'BsmtFinType1', 'Exterior2nd']
['Neighborhood', 'KitchenQual', 'LotConfig', 'SaleCondition', 'LandContour', 'BsmtFinType1', 'Exterior2nd', 'BsmtExposure']
['Neighborhood', 'KitchenQual', 'LotConfig', 'SaleCondition', 'LandContour', 'BsmtFinType1', 'Exterior2nd', 'BsmtExposure', 'BsmtQual']
['Neighborhood', 'KitchenQual', 'LotConfig', 'SaleCondition', 'LandContour', 'BsmtFinType1', 'Exterior2nd', 'BsmtExposure', 'BsmtQual', 'GarageFinish']
['Neighborhood', 'KitchenQual', 'LotConfig', 'SaleCondition', 'LandContour', 'BsmtFinType1', 'Exterior2nd', 'BsmtExposure', 'BsmtQual', 'GarageFinish', 'H

In [18]:
data.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
Utilities         object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
                  ...   
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual       object
TotRmsAbvGrd       int64
Functional        object
Fireplaces         int64
FireplaceQu       object
GarageType        object


In [22]:
print(predict(model,data.iloc[99,:]))

118713.88888888889


In [312]:
for i in range(len(Y)):
    print(i, predict(model,data.iloc[i,:]))

0 210500.0
1 195500.0
2 231083.33333333334
3 141365.22222222222
4 276400.0
5 143700.0
6 338608.3333333333
7 171090.0
8 132280.0
9 128011.11111111111
10 141687.5
11 342472.375
12 114193.55555555556
13 259649.5
14 124166.66666666667
15 132633.33333333334
16 109334.22222222222
17 141687.5
18 153625.0
19 140750.0
20 330490.5
21 132450.0
22 231980.0
23 135575.0
24 114193.55555555556
25 255867.5
26 143925.0
27 342472.375
28 197483.33333333334
29 77475.0
30 87680.0
31 114193.55555555556
32 185005.2
33 158000.0
34 268500.0
35 330490.5
36 139200.0
37 138175.0
38 112314.28571428571
39 118785.71428571429
40 161218.75
41 173333.33333333334
42 202250.0
43 142362.5
44 140750.0
45 290531.75
46 233337.2
47 248992.66666666666
48 100200.0
49 131483.33333333334
50 308750.0
51 110416.66666666667
52 115250.0
53 306125.0
54 143350.0
55 176280.0
56 181900.0
57 187033.33333333334
58 511120.3333333333
59 124300.0
60 153625.0
61 102272.0
62 204833.0
63 132280.0
64 231980.0
65 330490.5
66 197483.33333333334
67 2

ValueError: 'Fa' is not in list

In [23]:
#Predicting data
predictions =[]
for i in range(len(Y)):
    predictions.append(predict(model,data.iloc[i,:]))
print(predictions)

[210500.0, 195500.0, 231083.33333333334, 141365.22222222222, 276400.0, 143700.0, 338608.3333333333, 171090.0, 132280.0, 128011.11111111111, 141687.5, 342472.375, 114193.55555555556, 259649.5, 145225.0, 132633.33333333334, 132757.14285714287, 141687.5, 153625.0, 138207.14285714287, 330490.5, 132450.0, 231980.0, 135575.0, 114193.55555555556, 255867.5, 127425.0, 342472.375, 239833.33333333334, 77475.0, 87680.0, 114193.55555555556, 185005.2, 152800.0, 268500.0, 330490.5, 139200.0, 155450.0, 118713.88888888889, 114937.5, 161218.75, 173333.33333333334, 202250.0, 142362.5, 145225.0, 290531.75, 233337.2, 248992.66666666666, 100200.0, 131483.33333333334, 308750.0, 110416.66666666667, 115250.0, 306125.0, 118713.88888888889, 176280.0, 181900.0, 187033.33333333334, 511120.3333333333, 124300.0, 153625.0, 102272.0, 204833.0, 132280.0, 231980.0, 330490.5, 156685.7142857143, 217877.77777777778, 90063.0, 232687.5, 230333.33333333334, 127937.5, 187166.66666666666, 161218.75, 100200.0, 88000.0, 119350.0,

In [24]:
#Actual data
Y

0       208500
1       181500
2       223500
3       140000
4       250000
5       143000
6       307000
7       200000
8       129900
9       118000
10      129500
11      345000
12      144000
13      279500
14      157000
15      132000
16      149000
17       90000
18      159000
19      139000
20      325300
21      139400
22      230000
23      129900
24      154000
25      256300
26      134800
27      306000
28      207500
29       68500
         ...  
1430    192140
1431    143750
1432     64500
1433    186500
1434    160000
1435    174000
1436    120500
1437    394617
1438    149700
1439    197000
1440    191000
1441    149300
1442    310000
1443    121000
1444    179600
1445    129000
1446    157900
1447    240000
1448    112000
1449     92000
1450    136000
1451    287090
1452    145000
1453     84500
1454    185000
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

In [27]:
abs(Y - predictions ).mean()

17893.14267829963

# Random Forest

We will create a class for the random forest.
It will have characteristics of:
   * fraction of rows to use per tree(will not be used in this forest since it is not a binary search tree forest)
   * fraction of features to use per tree
   * number of trees
   * tolerance for standard deviation(in percentage of the original standard deviation)
   * trees
    
And two methods:
   * fit  (will fit the data)
   * predictor  (will predict for a given data)

In [45]:
class RandomForest:
    def __init__(self, frac_value_rows=0.6, frac_value_cols=0.6, n_trees=10,tol=5):  #Defining the hyperparameters
        self.frac_value_rows = frac_value_rows  #the fraction of rows we want per tree
        self.frac_value_cols = frac_value_cols  #the fraction of features we want per tree
        self.n_trees = n_trees  #number of trees in the forest
        self.tol = tol  #tolerance for the standard deviation (in percentage of actual deviation)
        
    #Module to fit the data
    def fit(self,data,target):
        self.trees=[]
        for i in range(self.n_trees):
            rand_data = data.sample(frac = self.frac_value_cols, axis=1) #Generate random data per tree
            rand_Y = target[rand_data.index]  
            (self.trees).append(build_tree(rand_data,rand_Y,self.tol)) #add trees to the list
        return self.trees
    
    def predictor(self,X):
        predictions=[]
        for i in range(len(X)):  #For each data point in the dataset
            point = X.iloc[i,:]  
            value=0
            
            for j in self.trees: 
                value += predict(j,point)  #Predict the value per tree per datapoint
            value = value/self.n_trees
            predictions.append(value)
        
        return predictions
                       

In [46]:
rf = RandomForest()

In [47]:
rf.fit(data,Y)

['Neighborhood', 'KitchenQual', 'LotConfig', 'SaleCondition', 'LandContour', 'BsmtFinType1', 'Exterior2nd', 'BsmtExposure', 'BsmtQual', 'GarageFinish', 'HeatingQC', 'Condition1', 'LotShape', 'Alley', 'GarageType', 'FireplaceQu', 'Functional', 'ExterCond', 'BsmtCond', 'Electrical', 'BldgType', 'Exterior1st', 'RoofStyle', 'PavedDrive', 'ExterQual', 'HouseStyle', 'Fence', 'CentralAir', 'MSZoning', 'MasVnrType', 'BsmtFinType2', 'Foundation', 'GarageQual', 'SaleType', 'GarageCond', 'MiscFeature', 'LandSlope']
['Neighborhood', 'KitchenQual', 'LotConfig', 'SaleCondition', 'LandContour', 'BsmtFinType1', 'Exterior2nd', 'BsmtExposure', 'BsmtQual', 'GarageFinish', 'HeatingQC', 'Condition1', 'LotShape', 'Alley', 'GarageType', 'FireplaceQu', 'Functional', 'ExterCond', 'BsmtCond', 'Electrical', 'BldgType', 'Exterior1st', 'RoofStyle', 'PavedDrive', 'ExterQual', 'HouseStyle', 'Fence', 'CentralAir', 'MSZoning', 'MasVnrType', 'BsmtFinType2', 'Foundation', 'GarageQual', 'SaleType', 'GarageCond', 'MiscFea

[<__main__.Node at 0x7fb8268e6cf8>,
 <__main__.Node at 0x7fb8266f2978>,
 <__main__.Node at 0x7fb8264d8898>,
 <__main__.Node at 0x7fb8267c32e8>,
 <__main__.Node at 0x7fb82689f320>,
 <__main__.Node at 0x7fb82663b588>,
 <__main__.Node at 0x7fb8265a5a58>,
 <__main__.Node at 0x7fb82658cb70>,
 <__main__.Node at 0x7fb826910588>,
 <__main__.Node at 0x7fb826638828>]

In [50]:
res = rf.predictor(data)
print(res)

[209845.0, 177588.94047619047, 222697.5111111111, 136903.97317460316, 269693.3153571428, 148629.6738095238, 294702.0071428571, 191358.12380952382, 147717.02222222224, 119443.91666666666, 130090.99999999997, 339616.7616666666, 133300.57142857142, 261826.47190476194, 145640.71444444446, 141630.8486111111, 141215.6349206349, 94376.67694444445, 155493.9761904762, 142379.14880952382, 324779.86805555556, 133819.44444444444, 225069.69833333333, 133655.56746031746, 141229.5158730159, 260986.5732936508, 135745.82142857142, 308283.4841666667, 198183.43198412698, 67623.48888888888, 77727.63194444445, 142661.52777777778, 193276.5119047619, 172112.45238095237, 282146.70019841276, 300580.31888888887, 146421.64444444445, 154738.7033333333, 122217.44444444445, 102529.8321031746, 162150.93809523812, 169591.7619047619, 132597.70714285714, 131325.1408730159, 138984.10416666666, 308926.8821428571, 244123.59472222222, 239757.44222222222, 120578.90198412698, 127382.00198412698, 168731.76845238096, 112452.5,

In [52]:
abs(Y - res).mean()

8250.98123486084

As you see, by using random forests, we reduced the mean error substantially(from 18000 to 8000).
This random forest contains only 10 trees. For more accuracy, we can increase the number of trees but at the cost of computation time. This tradeoff is crucial in deciding which model to use.

Another method to improve accuracy is fine tuning the hyperparameters.

Yet another method is to use binary search trees and simultaneously randomly selecting the datapoints per tree. But this is computationally more expensive. 