# Random Forest

<p> A random forest is a tree-based method. It involves fitting multiple regression trees using a randomly selected number of predictors and a bootstrap sample from the entire dataset for each tree. In the end, we will take the average of these trees to find the final fitted curve, which can be used for prediction (James et al., 2021). </p>

In [1]:
library(randomForest)

randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.


#### Let's load the dataset

In [2]:
data <- read.csv('C:/Users/saisr/Documents/Wayne/Winter 2022/STA 5820/Bike-Sharing-Dataset/hour.csv')
# Making a copy so the original can be referred back if needed 
df <- data
str(df)

'data.frame':	17379 obs. of  17 variables:
 $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
 $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
 $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
 $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
 $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
 $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...


In [3]:
df$instant <- NULL
df$registered <- NULL
df$casual <- NULL

df$season <- as.factor(df$season)
df$yr <- as.factor(df$yr)
df$mnth <- as.factor(df$mnth)
df$hr <- as.factor(df$hr)
df$holiday <- as.factor(df$holiday)
df$workingday <- as.factor(df$workingday)
df$weathersit <- as.factor(df$weathersit)
df$weekday <- as.factor(df$weekday)


df$atemp <- NULL # atemp is the normalized feeling temperature in Celsius while temp is 
# Normalized temperature in Celsius. So these variables are similar in meaning and are redundant. So, it is reasonable
# to remove the atemp variable.
 

#### Splitting the dataset

In [4]:
# Since this is a time-series data and we need to compare models,  the dataset will not be randomly split into train 
# and test datasets. Instead, the first 13904 observations [80% of the dataset] will be in training set, and the 
# remaining 3475 observations will be in test set.

t <- c(1:13904)
Train <- df[t,]
Test <- df[-t,]

train.X <- data.frame(Train[,1:11])
train.y <- c(Train[, 12])
test.x <- data.frame(Test[, 1:11])
test.y <- c(Test[,12])


In [5]:
Train$dteday <- NULL # Since we are interested in how the condition of the day affects the amount of rental bikes,
# we can remove dteday that states the date on which the data was collected.
Test$dteday <- NULL 
head(Train)

season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,hum,windspeed,cnt
1,0,1,0,0,6,0,1,0.24,0.81,0.0,16
1,0,1,1,0,6,0,1,0.22,0.8,0.0,40
1,0,1,2,0,6,0,1,0.22,0.8,0.0,32
1,0,1,3,0,6,0,1,0.24,0.75,0.0,13
1,0,1,4,0,6,0,1,0.24,0.75,0.0,1
1,0,1,5,0,6,0,2,0.24,0.75,0.0896,1


#### Training the model

In [6]:
RF1 <- randomForest(cnt ~ ., data=Train, importance = T)
RF1


Call:
 randomForest(formula = cnt ~ ., data = Train, importance = T) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 3

          Mean of squared residuals: 2213.827
                    % Var explained: 92.06

#### Using the model on the test dataset

In [7]:
pred.test <- predict(RF1, newdata = Test)
(mean((test.y-pred.test)^2)^0.5)

#### The test  RMSE is ~ 245.72. 