# Generalized Least Square

<p>Generalized Least Square is a regression model. It is used when there is a correlation of error terms and accounts for this fact when estimating coefficients of the predictors. Since our multiple linear regression model has violated IID, more specifically, there is correlation between the error terms, we will train a GLS model.</p>

In [1]:
library(nlme)

#### Let's load the dataset

In [9]:
data1 <- read.csv('C:/Users/saisr/Documents/Wayne/Winter 2022/STA 5820/Bike-Sharing-Dataset/hour.csv')

# Making a copy so the original can be referred back if needed 
df1 <- data1

str(df1)

'data.frame':	17379 obs. of  17 variables:
 $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
 $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
 $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
 $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
 $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
 $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...


In [10]:
df1$instant <- NULL
df1$registered <- NULL
df1$casual <- NULL
df1$season <- as.factor(df1$season)
df1$yr <- as.factor(df1$yr)
df1$mnth <- as.factor(df1$mnth)
df1$hr <- as.factor(df1$hr)
df1$holiday <- as.factor(df1$holiday)
df1$workingday <- as.factor(df1$workingday)
df1$weathersit <- as.factor(df1$weathersit)
df1$weekday <- as.factor(df1$weekday)
df1$atemp <- NULL
df1$dteday <- NULL

#### Splitting the dataset

In [11]:
t <- c(1:13904)
Train <- df1[t,]
Test <- df1[-t,]

train.X <- data.frame(Train[,1:11])
train.y <- c(Train[, 12])
test.x <- data.frame(Test[, 1:11])
test.y <- c(Test[,12])

### Training the model

<p> This model provides more sound standard errors for each predictor. </p>

In [12]:
GLS = gls(cnt ~ ., data=Train, correlation = corAR1(), control = list(singular.ok = TRUE))

In [13]:
summary(GLS)

Generalized least squares fit by REML
  Model: cnt ~ . 
  Data: Train 
       AIC    BIC    logLik
  165388.5 165803 -82639.27

Correlation Structure: AR(1)
 Formula: ~1 
 Parameter estimate(s):
Phi 
  0 

Coefficients:
               Value Std.Error   t-value p-value
(Intercept) -66.7645   6.56181 -10.17472  0.0000
season2      29.7344   4.67543   6.35972  0.0000
season3      16.6683   6.02215   2.76783  0.0057
season4      40.8565   5.98117   6.83085  0.0000
yr1          84.4629   1.84559  45.76462  0.0000
mnth2         5.6001   3.60173   1.55484  0.1200
mnth3        21.2251   4.13909   5.12796  0.0000
mnth4        20.3019   6.29189   3.22667  0.0013
mnth5        34.9050   6.82089   5.11738  0.0000
mnth6        25.4681   7.19885   3.53780  0.0004
mnth7        11.3409   8.34921   1.35832  0.1744
mnth8        22.4856   8.45886   2.65824  0.0079
mnth9        43.1972   7.79543   5.54135  0.0000
mnth10       38.6978   7.74402   4.99712  0.0000
mnth11       25.4501   7.59972   3.34883  0.0

### Using the model on the test dataset

In [14]:
pred_test1 <- predict(GLS, newdata = Test)

In [15]:
(mean((test.y-pred_test1)^2)^0.5)

#### The test RMSE is ~134.02.