# KNN [K Nearest Neighbor]

<p> KNN for regression uses k observations that are closest to our test data point and averages them to find the predicted value of the test data point. In this case, the optimal k is found to be 3, which means the nearest 3 data points are used by model when predicting (James et al., 2021).</p>

In [1]:
library(class)
library(caret)

Loading required package: lattice
Loading required package: ggplot2
"package 'ggplot2' was built under R version 3.6.3"

#### Let's load the dataset

In [2]:
data1 <- read.csv('C:/Users/saisr/Documents/Wayne/Winter 2022/STA 5820/Bike-Sharing-Dataset/hour.csv')

# Making a copy so the original can be referred back if needed 
df1 <- data1

str(df1)

'data.frame':	17379 obs. of  17 variables:
 $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
 $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
 $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
 $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
 $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
 $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...


In [3]:
df1$instant <- NULL
df1$registered <- NULL
df1$casual <- NULL
df1$season <- as.factor(df1$season)
df1$yr <- as.factor(df1$yr)
df1$mnth <- as.factor(df1$mnth)
df1$hr <- as.factor(df1$hr)
df1$holiday <- as.factor(df1$holiday)
df1$workingday <- as.factor(df1$workingday)
df1$weathersit <- as.factor(df1$weathersit)
df1$weekday <- as.factor(df1$weekday)

df1$atemp <- NULL # atemp is the normalized feeling temperature in Celsius while temp is 
# Normalized temperature in Celsius. So these variables are similar in meaning and are redundant. So, it is reasonable
# to remove the atemp variable.

df1$dteday <- NULL # Since we are interested in how the condition of the day affects the amount of rental bikes,
# we can remove dteday that states the date on which the data was collected.

#### Splitting the dataset

In [4]:
# Since this is a time-series data and we need to compare models,  the dataset will not be randomly split into train 
# and test datasets. Instead, the first 13904 observations [80% of the dataset] will be in training set, and the 
# remaining 3475 observations will be in test set.

t <- c(1:13904)
Train <- df1[t,]
Test <- df1[-t,]

train.X <- data.frame(Train[,1:11])
train.y <- c(Train[, 12])
test.x <- data.frame(Test[, 1:11])
test.y <- c(Test[,12])

In [5]:
head(Train)
head(Test)

season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,hum,windspeed,cnt
1,0,1,0,0,6,0,1,0.24,0.81,0.0,16
1,0,1,1,0,6,0,1,0.22,0.8,0.0,40
1,0,1,2,0,6,0,1,0.22,0.8,0.0,32
1,0,1,3,0,6,0,1,0.24,0.75,0.0,13
1,0,1,4,0,6,0,1,0.24,0.75,0.0,1
1,0,1,5,0,6,0,2,0.24,0.75,0.0896,1


Unnamed: 0,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,hum,windspeed,cnt
13905,3,1,8,13,0,2,1,2,0.8,0.52,0.194,253
13906,3,1,8,14,0,2,1,2,0.82,0.46,0.0,261
13907,3,1,8,15,0,2,1,1,0.8,0.52,0.0,306
13908,3,1,8,16,0,2,1,3,0.76,0.66,0.2836,445
13909,3,1,8,17,0,2,1,2,0.78,0.62,0.1343,868
13910,3,1,8,18,0,2,1,2,0.76,0.62,0.1642,814


#### Training the model

In [6]:
mse_val <- numeric(49)

for( i in 2:50){
  KNN_1 <- knnreg(train.X, train.y, k =i )
  pred_y <- predict(KNN_1, test.x)
  mse_val[i] <- mean((test.y-pred_y)^2)
}

In [11]:
which.min(mse_val[2:49])

#### The optimal value of k was found using test MSE. 
#### The k value corresponding to the minimum value of MSE was chosen for the model. Optimal k is 3.

In [12]:
KNN_1 <- knnreg(train.X, train.y, k =3 ) 

### Using the model on the test dataset

In [13]:
predtst_y <- predict(KNN_1, newdata = test.x)

In [14]:
(mean((test.y-predtst_y)^2)^0.5)

#### The test RMSE is ~ 120.78.