# OOB

The phrase "out of bag" is often used in the context of random forest models in machine learning. In a random forest, each decision tree in the ensemble is typically trained on a random subset of the data with replacement (bootstrap samples). This means that some data points are not included in the training of a particular tree, and these omitted data points are referred to as "out of bag" (OOB) samples.

The percentage of out of bag samples in a random forest is approximately 36.8% of the total dataset. This percentage is derived from the properties of the binomial distribution, which is used to calculate the probability that a particular data point is not selected in a bootstrap sample. The expected fraction of data points not selected is about 1 - 1/e, where "e" is approximately 2.71828.

So, in a random forest with a large number of trees, you can expect that roughly 36.8% of the data points will be OOB samples for each tree. These OOB samples can be used to estimate the performance of the model without the need for a separate validation dataset.

# Using Heart Data set

Kaggle Link: https://www.kaggle.com/datasets/zhaoyingzhu/heartcsv

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("/Users/rajatchauhan/Desktop/Machine Learning Notes/Datasets/Heart.csv")
df

Unnamed: 0.1,Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
0,1,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
1,2,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
2,3,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
3,4,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
4,5,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,299,45,1,typical,110,264,0,0,132,0,1.2,2,0.0,reversable,Yes
299,300,68,1,asymptomatic,144,193,1,0,141,0,3.4,2,2.0,reversable,Yes
300,301,57,1,asymptomatic,130,131,0,0,115,1,1.2,2,1.0,reversable,Yes
301,302,57,0,nontypical,130,236,0,2,174,0,0.0,2,1.0,normal,Yes


#  Preprocessing

# 1. Handling Null values

In [3]:
df.isna().sum()

Unnamed: 0    0
Age           0
Sex           0
ChestPain     0
RestBP        0
Chol          0
Fbs           0
RestECG       0
MaxHR         0
ExAng         0
Oldpeak       0
Slope         0
Ca            4
Thal          2
AHD           0
dtype: int64

We have some null values for Columns Ca and Thal.

Let us replace these values by mode or mean 

In [4]:
df["Ca"].unique()

array([ 0.,  3.,  2.,  1., nan])

In [5]:
df["Ca"].value_counts()

0.0    176
1.0     65
2.0     38
3.0     20
Name: Ca, dtype: int64

This is a categorical column, for categorical column with values 0,1,2 and 3 we can replace nan values by mode

In [6]:
Ca_mode = float(df["Ca"].mode())
Ca_mode

0.0

In [7]:
df['Ca'].fillna(Ca_mode, inplace=True)
df["Ca"].unique()

array([0., 3., 2., 1.])

For column Thal

In [8]:
df["Thal"].unique()

array(['fixed', 'normal', 'reversable', nan], dtype=object)

In [9]:
df["Thal"].value_counts()

normal        166
reversable    117
fixed          18
Name: Thal, dtype: int64

In [10]:
Thal_mode = df["Thal"].mode()[0] # Get the first mode as a string
Thal_mode

'normal'

In [11]:
df['Thal'].fillna(Thal_mode, inplace=True)
df["Thal"].unique()

array(['fixed', 'normal', 'reversable'], dtype=object)

In [12]:
df.isna().sum()

Unnamed: 0    0
Age           0
Sex           0
ChestPain     0
RestBP        0
Chol          0
Fbs           0
RestECG       0
MaxHR         0
ExAng         0
Oldpeak       0
Slope         0
Ca            0
Thal          0
AHD           0
dtype: int64

Cool there is no missing value problem now

# 3. Removing unnecessary columns

In [13]:
df

Unnamed: 0.1,Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
0,1,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
1,2,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
2,3,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
3,4,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
4,5,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,299,45,1,typical,110,264,0,0,132,0,1.2,2,0.0,reversable,Yes
299,300,68,1,asymptomatic,144,193,1,0,141,0,3.4,2,2.0,reversable,Yes
300,301,57,1,asymptomatic,130,131,0,0,115,1,1.2,2,1.0,reversable,Yes
301,302,57,0,nontypical,130,236,0,2,174,0,0.0,2,1.0,normal,Yes


The first column is a serial number column will not be required for our ML model

In [14]:
X = df.iloc[:,1:14]
X

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal
0,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed
1,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal
2,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable
3,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal
4,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,typical,110,264,0,0,132,0,1.2,2,0.0,reversable
299,68,1,asymptomatic,144,193,1,0,141,0,3.4,2,2.0,reversable
300,57,1,asymptomatic,130,131,0,0,115,1,1.2,2,1.0,reversable
301,57,0,nontypical,130,236,0,2,174,0,0.0,2,1.0,normal


# 2. Encoding

Some columns are non numeric will need encoding

In [15]:
X["ChestPain"].unique()

array(['typical', 'asymptomatic', 'nonanginal', 'nontypical'],
      dtype=object)

In [16]:
X["Thal"].unique()

array(['fixed', 'normal', 'reversable'], dtype=object)

In [17]:
from sklearn.preprocessing import LabelEncoder

Making a LabelEncoder object

In [18]:
LabEnc = LabelEncoder()

In [19]:
X["ChestPain"] = LabEnc.fit_transform(X["ChestPain"])
X

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal
0,63,1,3,145,233,1,2,150,0,2.3,3,0.0,fixed
1,67,1,0,160,286,0,2,108,1,1.5,2,3.0,normal
2,67,1,0,120,229,0,2,129,1,2.6,2,2.0,reversable
3,37,1,1,130,250,0,0,187,0,3.5,3,0.0,normal
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,3,110,264,0,0,132,0,1.2,2,0.0,reversable
299,68,1,0,144,193,1,0,141,0,3.4,2,2.0,reversable
300,57,1,0,130,131,0,0,115,1,1.2,2,1.0,reversable
301,57,0,2,130,236,0,2,174,0,0.0,2,1.0,normal


In [20]:
X["Thal"] = LabEnc.fit_transform(X["Thal"])
X

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal
0,63,1,3,145,233,1,2,150,0,2.3,3,0.0,0
1,67,1,0,160,286,0,2,108,1,1.5,2,3.0,1
2,67,1,0,120,229,0,2,129,1,2.6,2,2.0,2
3,37,1,1,130,250,0,0,187,0,3.5,3,0.0,1
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,3,110,264,0,0,132,0,1.2,2,0.0,2
299,68,1,0,144,193,1,0,141,0,3.4,2,2.0,2
300,57,1,0,130,131,0,0,115,1,1.2,2,1.0,2
301,57,0,2,130,236,0,2,174,0,0.0,2,1.0,1


The last column is the target column

In [21]:
y = df["AHD"]
y

0       No
1      Yes
2      Yes
3       No
4       No
      ... 
298    Yes
299    Yes
300    Yes
301    Yes
302     No
Name: AHD, Length: 303, dtype: object

This column will also need encoding

In [22]:
y = LabEnc.fit_transform(df["AHD"])
y

array([0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,

# Doing Train Test Split

In [23]:
from sklearn.model_selection import train_test_split

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)

# Making Random Forest Model with OOB score

In [25]:
from sklearn.ensemble import RandomForestClassifier

In [26]:
rf = RandomForestClassifier(oob_score = True)

In [27]:
rf.fit(X_train, y_train)

RandomForestClassifier(oob_score=True)

Now, we will be having approx 36.8% samples which will not be getting selected due to random sampling with replacement.
These are OOB samples recorded and we can use these to test and validate our model.

Let us see the OOB score, how model performs on these left out samples

In [28]:
rf.oob_score_

0.7975206611570248

Let us compare this accuracy score with the predictions accuracy score of the model

In [29]:
from sklearn.metrics import accuracy_score

In [30]:
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.8688524590163934

So, this is how the model is performing but oob score also gives a good estimate on how our model is performing.