# Application of Bootstrap samples in Random Forest

In [1]:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings("ignore")

 <li> Load the boston house dataset </li>

In [2]:
boston = load_boston()
x=boston.data #independent variables
y=boston.target #target variable

In [3]:
x.shape

(506, 13)

In [4]:
y = y.reshape(506,1)
x=np.append(x, y,axis=1)
x.shape

(506, 14)

In [5]:
y[1]

array([21.6])

In [6]:
x[1][13]

21.6

### Task: 1

<font color='red'><b>Step 1 Creating samples: </b></font> Randomly create 30 samples from the whole boston data points.
<ol>
<li>Creating each sample: Consider any random 303(60% of 506) data points from whole data set and then replicate any 203 points from the sampled points</li>
<li>Ex: For better understanding of this procedure lets check this examples, assume we have 10 data points [1,2,3,4,5,6,7,8,9,10], first we take 6 data points randomly consider we have selected [4, 5, 7, 8, 9, 3] now we will replciate 4 points from [4, 5, 7, 8, 9, 3], consder they are [5, 8, 3,7] so our final sample will be [4, 5, 7, 8, 9, 3, 5, 8, 3,7]</li>
<li> we create 30 samples like this </li>
<li> Note that as a part of the Bagging when you are taking the random samples make sure each of the sample will have                different set of columns</li>
<li> Ex: assume we have 10 columns for the first sample we will select [3, 4, 5, 9, 1, 2] and for the second sample [7, 9, 1, 4, 5, 6, 2] and so on...</li>
<li> Make sure each sample will have atleast 3 feautres/columns/attributes</li>
</ol>

In [7]:
def create_sam(x):
    x1 = x[np.random.choice(x.shape[0], 303, replace=False), :]
    x1_ = x1[np.random.choice(x1.shape[0], 203, replace=True), :]
    x1 = np.concatenate((x1, x1_))
    f = np.random.choice(x.shape[1]-1, np.random.randint(3,x.shape[1]-1,1), replace=False)
    x1 = x1[:,np.append(f,np.array([13]))]
    return x1,f

In [8]:
boston_bags = list()
bag_fea = list()
for i in range(30):
    g,h = create_sam(x)
    boston_bags.append(g)
    bag_fea.append(h)

<font color='red'><b>Step 2 Building High Variance Models on each of the sample and finding train MSE value:</b></font> Build a DecisionTreeRegressor on each of the sample.
<ol><li>Build a regression trees on each of 30 samples.</li>
<li>computed the predicted values of each data point(506 data points) in your corpus.</li>
<li> predicted house price of $i^{th}$ data point $y^{i}_{pred} =  \frac{1}{30}\sum_{k=1}^{30}(\text{predicted value of } x^{i} \text{ with } k^{th} \text{ model})$.</li>
<li>Now calculate the $MSE =  \frac{1}{506}\sum_{i=1}^{506}(y^{i} - y^{i}_{pred})^{2}$.</li>
</ol>

In [9]:
from sklearn.tree import DecisionTreeRegressor
trees = list() #training the 30 DTs
for i in range(30):
    clf = DecisionTreeRegressor()
    b = boston_bags[i]
    clf.fit(b[:,0:(len(b[0])-1)],b[:,len(b[0])-1])
    trees.append(clf)

In [10]:
train_val = list()
for i in range(30):
    b = boston_bags[i]
    train_val.append(trees[i].predict(b[:,0:(len(b[0])-1)]))

In [11]:
mse_dtrain = list()
for i in range(30):
    b = boston_bags[i]
    mse_dtrain.append(mean_squared_error(b[:,len(b[0])-1],train_val[i]))


In [12]:
for i in range(30):
    print('The trained MSE of the bagged sample {} is {}'.format(i+1,mse_dtrain[i]))

The trained MSE of the bagged sample 1 is 4.677040940045525e-31
The trained MSE of the bagged sample 2 is 1.3656959544932932e-30
The trained MSE of the bagged sample 3 is 2.930945655761862e-31
The trained MSE of the bagged sample 4 is 6.360775678461913e-31
The trained MSE of the bagged sample 5 is 6.547857316063734e-31
The trained MSE of the bagged sample 6 is 4.349648074242338e-31
The trained MSE of the bagged sample 7 is 4.72381134944598e-31
The trained MSE of the bagged sample 8 is 0.04940711462450596
The trained MSE of the bagged sample 9 is 4.1625664366405165e-31
The trained MSE of the bagged sample 10 is 6.547857316063734e-31
The trained MSE of the bagged sample 11 is 9.825995529832486
The trained MSE of the bagged sample 12 is 8.044510416878303e-31
The trained MSE of the bagged sample 13 is 5.238285852850988e-31
The trained MSE of the bagged sample 14 is 1.309571463212747e-31
The trained MSE of the bagged sample 15 is 8.605755329683765e-31
The trained MSE of the bagged sample 16

In [13]:
x_c = boston.data
y_c = boston.target
pred_cor = list()
for i in range(30):
    pred_cor.append(trees[i].predict(x_c[:,bag_fea[i]])) #predicting via every DT trained

pred_y = np.zeros(x_c.shape[0])
for i in range(x_c.shape[0]):
    for j in range(30):
        pred_y[i] += pred_cor[j][i]
    pred_y[i] /= 30

mse_pop = mean_squared_error(y_c,pred_y)

print(mse_pop)

2.4581824334029694


<font color='red'><b>Step 3 Calculating the OOB score :</b></font>
<ol>
<li>Computed the predicted values of each data point(506 data points) in your corpus.</li>
<li>Predicted house price of $i^{th}$ data point $y^{i}_{pred} =  \frac{1}{k}\sum_{\text{k= model which was buit on samples not included } x^{i}}(\text{predicted value of } x^{i} \text{ with } k^{th} \text{ model})$.</li>
<li>Now calculate the $OOB Score =  \frac{1}{506}\sum_{i=1}^{506}(y^{i} - y^{i}_{pred})^{2}$.</li>
</ol>

In [14]:
pred_y_oob = np.zeros(x_c.shape[0])
for i in (range(x_c.shape[0])):
    count_b = 0
    for j in range(30):
        count = 0
        for k in range(x_c.shape[0]):
            if np.all(np.equal(x_c[i,bag_fea[j]],np.array(boston_bags[j][k,0:len(boston_bags[j][0])-1]))):
                count += 1
                break
        if count != 0:
            continue
        pred_y_oob[i] += pred_cor[j][i]
        count_b += 1
    pred_y_oob[i] /= count_b

mse_oob_pop = mean_squared_error(y_c,pred_y_oob)
print(mse_oob_pop)

14.251832307035144


### Task: 2
<pre>
<font color='red'><b>Computing CI of OOB Score and Train MSE</b></font>
<ol>
<li> Repeat Task 1 for 35 times, and for each iteration store the Train MSE and OOB score </li>
<li> After this we will have 35 Train MSE values and 35 OOB scores </li>
<li> using these 35 values (assume like a sample) find the confidence intravels of MSE and OOB Score </li>
<li> you need to report CI of MSE and CI of OOB Score </li>
<li> Note: Refer the Central_Limit_theorem.ipynb to check how to find the confidence intravel</li>
</ol>
</pre>

In [15]:
def ci_cal(x, boston):
    
    def create_sam(x):
        x1 = x[np.random.choice(x.shape[0], 303, replace=False), :]
        x1_ = x1[np.random.choice(x1.shape[0], 203, replace=True), :]
        x1 = np.concatenate((x1, x1_))
        f = np.random.choice(x.shape[1]-1, np.random.randint(3,x.shape[1]-1,1), replace=False)
        x1 = x1[:,np.append(f,np.array([13]))]
        return x1,f
    
    mse_corpus = list()
    mse_oob = list()
    for _ in (range(35)):
        
        boston_bags = list()
        bag_fea = list()
        for i in range(30):
            g,h = create_sam(x)
            boston_bags.append(g)
            bag_fea.append(h)
        
        trees = list() #training the 30 DTs
        for i in range(30):
            clf = DecisionTreeRegressor()
            b = boston_bags[i]
            clf.fit(b[:,0:(len(b[0])-1)],b[:,len(b[0])-1])
            trees.append(clf)
            
        x_c = boston.data
        pred_cor = list()
        for i in range(30):
            pred_cor.append(trees[i].predict(x_c[:,bag_fea[i]]))
        
        y_c = boston.target
        
        pred_y = np.zeros(x_c.shape[0])
        for i in range(x_c.shape[0]):
            for j in range(30):
                pred_y[i] += pred_cor[j][i]
            pred_y[i] /= 30
            
        m = mean_squared_error(y_c,pred_y)
        
        mse_corpus.append(m)
        
        pred_y_oob = np.zeros(x_c.shape[0])
        for i in (range(x_c.shape[0])):
            count_b = 0
            for j in range(30):
                count = 0
                for k in range(x_c.shape[0]):
                    if np.all(np.equal(x_c[i,bag_fea[j]],np.array(boston_bags[j][k,0:len(boston_bags[j][0])-1]))):
                        count += 1
                        break
                if count != 0:
                    continue
                pred_y_oob[i] += pred_cor[j][i]
                count_b += 1
            pred_y_oob[i] /= count_b
            
        ms = mean_squared_error(y_c,pred_y_oob)
        
        mse_oob.append(ms)
    return mse_corpus, mse_oob

In [16]:
mse_corpus, mse_oob = ci_cal(x,boston)

In [17]:
m = np.asarray(mse_corpus)
mo = np.asarray(mse_oob)

In [22]:
from prettytable import PrettyTable
import random

x = PrettyTable()
x = PrettyTable(["#samples", "Sample Size", "Sample mean", "Population Mean","Sample Std","Left C.I","Right C.I", "Catch"])
for i in range(10):
    sample=m[random.sample(range(0, m.shape[0]), 20)]
    sample_mean = sample.mean()
    sample_std =  sample.std()
    sample_size = len(sample)
    # here we are using sample standard deviation instead of population standard deviation
    left_limit  = np.round(sample_mean - 2*(sample_std/np.sqrt(sample_size)), 3)
    right_limit = np.round(sample_mean + 2*(sample_std/np.sqrt(sample_size)), 3)
    row = []
    row.append(i+1)
    row.append(sample_size)
    row.append(sample_mean)
    row.append(mse_pop)
    row.append(np.round(sample_std/np.sqrt(sample_size),3))
    row.append(left_limit)
    row.append(right_limit)
    row.append((mse_pop <= right_limit) and (mse_pop >= left_limit))
    x.add_row(row)
print(x)
print("C.I of mse_scores")

+----------+-------------+--------------------+--------------------+------------+----------+-----------+-------+
| #samples | Sample Size |    Sample mean     |  Population Mean   | Sample Std | Left C.I | Right C.I | Catch |
+----------+-------------+--------------------+--------------------+------------+----------+-----------+-------+
|    1     |      20     | 2.4627279474175747 | 2.4581824334029694 |   0.062    |  2.339   |   2.586   |  True |
|    2     |      20     | 2.5253588468271744 | 2.4581824334029694 |   0.063    |  2.399   |   2.651   |  True |
|    3     |      20     | 2.4443461026272546 | 2.4581824334029694 |   0.053    |  2.338   |    2.55   |  True |
|    4     |      20     | 2.430984935219385  | 2.4581824334029694 |   0.068    |  2.295   |   2.567   |  True |
|    5     |      20     | 2.553854592022279  | 2.4581824334029694 |   0.063    |  2.427   |    2.68   |  True |
|    6     |      20     | 2.531178269637236  | 2.4581824334029694 |    0.07    |  2.392   |   2

In [23]:
x = PrettyTable()
x = PrettyTable(["#samples", "Sample Size", "Sample mean", "Population Mean","Sample Std","Left C.I","Right C.I", "Catch"])
for i in range(10):
    sample=mo[random.sample(range(0, mo.shape[0]), 20)]
    sample_mean = sample.mean()
    sample_std =  sample.std()
    sample_size = len(sample)
    # here we are using sample standard deviation instead of population standard deviation
    left_limit  = np.round(sample_mean - 2*(sample_std/np.sqrt(sample_size)), 3)
    right_limit = np.round(sample_mean + 2*(sample_std/np.sqrt(sample_size)), 3)
    row = []
    row.append(i+1)
    row.append(sample_size)
    row.append(sample_mean)
    row.append(mse_oob_pop)
    row.append(np.round(sample_std/np.sqrt(sample_size),3))
    row.append(left_limit)
    row.append(right_limit)
    row.append((mse_oob_pop <= right_limit) and (mse_oob_pop >= left_limit))
    x.add_row(row)
print(x)
print("C.I of oob mse_scores")

+----------+-------------+--------------------+--------------------+------------+----------+-----------+-------+
| #samples | Sample Size |    Sample mean     |  Population Mean   | Sample Std | Left C.I | Right C.I | Catch |
+----------+-------------+--------------------+--------------------+------------+----------+-----------+-------+
|    1     |      20     | 14.070618847453996 | 14.251832307035144 |   0.263    |  13.545  |   14.597  |  True |
|    2     |      20     | 14.18649399839945  | 14.251832307035144 |   0.245    |  13.696  |   14.677  |  True |
|    3     |      20     | 14.303925748788576 | 14.251832307035144 |   0.205    |  13.894  |   14.714  |  True |
|    4     |      20     | 14.39395403482023  | 14.251832307035144 |   0.242    |  13.91   |   14.878  |  True |
|    5     |      20     | 14.457514027766871 | 14.251832307035144 |   0.178    |  14.102  |   14.813  |  True |
|    6     |      20     | 13.98172905104899  | 14.251832307035144 |   0.244    |  13.494  |   1

### Task: 3
<pre>
<font color='red'><b>Given a single query point predict the price of house.</b></font>

<li>Consider xq= [0.18,20.0,5.00,0.0,0.421,5.60,72.2,7.95,7.0,30.0,19.1,372.13,18.60] Predict the house price for this point as mentioned in the step 2 of Task 1. </li>
</pre>

In [20]:
xq = [0.18,20.0,5.00,0.0,0.421,5.60,72.2,7.95,7.0,30.0,19.1,372.13,18.60]
xq = np.asarray(xq)
pred_cor = list()
for i in range(30):
    bh = xq[bag_fea[i]].reshape(1,-1)
    pred_cor.append(trees[i].predict(bh))

In [21]:
pred_y = 0
for j in range(30):
    pred_y += pred_cor[j]
pred_y /= 30

print("Predicted value of xq is {}".format(pred_y))

Predicted value of xq is [19.86333333]
