# Regression Models in R (tips)

In [18]:
if(!exists("Table1", mode="function")) source("mechkar.R")

In [19]:

library(readr)
library(dplyr)
library(ggplot2)


In [20]:
df <- read.csv("train.csv")
head(df)
dim(df)

Unnamed: 0_level_0,id,season,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,1,1,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,985
2,2,1,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,801
3,3,1,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,1349
4,4,1,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,1562
5,5,1,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,1600
6,6,1,1,0,4,1,1,0.204348,0.233209,0.518261,0.0895652,1606


Data Set Information:

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.


Attribute Information:

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

- instant: record index
- season : season (1:winter, 2:spring, 3:summer, 4:fall)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from [Web Link])
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

# EDA

In [4]:
df$holiday <- as.factor(df$holiday)
df$season <- as.factor(df$season)
df$mnth <- as.factor(df$mnth)
df$workingday <- as.factor(df$workingday)
df$weathersit <- as.factor(df$weathersit)
df$weekday <- as.factor(df$weekday)
summary(df)

       id      season      mnth     holiday weekday workingday weathersit
 Min.   :  1   1:90   1      : 31   0:355   0:52    0:115      1:226     
 1st Qu.: 92   2:92   3      : 31   1: 10   1:52    1:250      2:124     
 Median :183   3:94   5      : 31           2:52               3: 15     
 Mean   :183   4:89   7      : 31           3:52                         
 3rd Qu.:274          8      : 31           4:52                         
 Max.   :365          10     : 31           5:52                         
                      (Other):179           6:53                         
      temp             atemp              hum           windspeed      
 Min.   :0.05913   Min.   :0.07907   Min.   :0.0000   Min.   :0.02239  
 1st Qu.:0.32500   1st Qu.:0.32195   1st Qu.:0.5383   1st Qu.:0.13558  
 Median :0.47917   Median :0.47285   Median :0.6475   Median :0.18690  
 Mean   :0.48666   Mean   :0.46684   Mean   :0.6437   Mean   :0.19140  
 3rd Qu.:0.65667   3rd Qu.:0.61238   3rd Qu.:0.7

# DATASET PARTITION

In [21]:
tab1 <- train_test(data=df, train_name="train", test_name="test", prop=0.7, seed=5, tableone=TRUE)
tab1

Dataset partitioned into:

 + Train dataset: train

 + Test dataset: test

"The following variables have unique values and will not be included in the analysis: "




 

You got a perfectly balanced training and test datasets

 



V1,V2,Pop,1,2,pval
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Individuals,n,365,255,110,
id,Mean (SD),183.0 (105.5),183.7 (105.0),181.4 (107.2),
id,Median (IQR),183.0 (92.0-274.0),181.0 (92.5-272.5),184.5 (88.0-277.0),0.854
season,Mean (SD),2.5 (1.1),2.5 (1.1),2.5 (1.1),
season,Median (IQR),3.0 (2.0-3.0),2.0 (2.0-3.0),3.0 (1.2-3.8),0.826
mnth,Mean (SD),6.5 (3.5),6.5 (3.4),6.5 (3.5),
mnth,Median (IQR),7.0 (4.0-10.0),6.0 (4.0-9.0),7.0 (3.2-10.0),0.873
holiday,Mean (SD),0.0 (0.2),0.0 (0.1),0.0 (0.2),
holiday,Median (IQR),0.0 (0.0-0.0),0.0 (0.0-0.0),0.0 (0.0-0.0),0.238
weekday,Mean (SD),3.0 (2.0),3.1 (2.0),2.9 (2.0),


In [22]:
### Table of resulting errors
### Name, Model, RMSE, RMSLE
err_res <- NULL

In [23]:
### The error we will use is the RMSE and RMSLE
rmse <- function(y,y_hat) {
    err <- sqrt(sum((y_hat-y)^2,na.rm=T)/length(y))
    return(err)
}

rmsle <- function(y,y_hat) {
    err <- sqrt(sum((log(y_hat+1)-log(y+1))^2,na.rm=T)/length(y))
    return(err)
}


## kNN

In [24]:
library(class)

In [25]:
min_max <- function(x) { (x -min(x))/(max(x)-min(x))   }

In [26]:
X_train <- sapply(data.frame(as.matrix(train)),min_max)

In [27]:
X_test <- sapply(data.frame(as.matrix(test)),min_max)

In [28]:
mod8 <- knn(train,test,cl=train$cnt)

In [29]:
mod8

In [31]:
pred8 <- as.numeric(as.character(mod8))

rmse(test$cnt,pred8)
rmsle(test$cnt,pred8)

In [51]:
real_test <- read.csv("test.csv")
head(real_test)
dim(real_test)

Unnamed: 0_level_0,id,season,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,366,1,1,0,0,0,1,0.37,0.375621,0.6925,0.192167
2,367,1,1,1,1,0,1,0.273043,0.252304,0.381304,0.329665
3,368,1,1,0,2,1,1,0.15,0.126275,0.44125,0.365671
4,369,1,1,0,3,1,2,0.1075,0.119337,0.414583,0.1847
5,370,1,1,0,4,1,1,0.265833,0.278412,0.524167,0.129987
6,371,1,1,0,5,1,1,0.334167,0.340267,0.542083,0.167908


In [52]:
real_test <- sapply(data.frame(as.matrix(real_test)),min_max)

In [53]:
pred_cnt <- as.numeric(as.character(mod8))


In [56]:
RMSE=rmse(test$cnt,pred_cnt)
RMSLE=rmsle(test$cnt,pred_cnt)

In [58]:
err_res <- rbind(err_res, data.frame(Name="kNN", Model="mod8", 
                                     RMSE=rmse(test$cnt,pred_cnt), 
                                     RMSLE=rmsle(test$cnt,pred_cnt)))
err_res


Name,Model,RMSE,RMSLE
<chr>,<chr>,<dbl>,<dbl>
kNN,mod8,,
kNN,mod8,33.79699,0.02525251
