In [1]:
# Set seed to replicate exact answers
set.seed(123)
# Read data
df <- read.csv("canadianProtestData.csv")
head(df)

Unnamed: 0_level_0,X,year,month,prov,pop,protests
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<int>,<int>
1,1,2023,November,Alberta,4756408,20
2,2,2023,November,British Columbia,5581127,27
3,3,2023,November,Manitoba,1465440,10
4,4,2023,November,New Brunswick,842725,5
5,5,2023,November,Newfoundland and Labrador,540418,7
6,6,2023,November,Northwest Territories,44760,2


**Pre-Processing**

-------------------------------
First of the id section is to be removed as it adds redundancy. For variables that are meant to be in a category we may use the function

> as.factor()

and for the variables we wish to define as numeric instead of integers, we may use the function

> as.numeric()

In [2]:
# Remove ids
df <- df[, -which(names(df) == "X")]

# Want prov and months to be categories rather than just a word of characters
df$prov  <- as.factor( df$prov  )
df$month <- as.factor( df$month )

# Define year to also be a category of which year rather than a large number
df$year <- as.factor( df$year )

# Pop and protest defined as numbers instead of integers
df$pop <- as.numeric( df$pop )
df$protests <- as.numeric( df$protests )

# Pop was really large and needs to be roughly on the same scale so that we dont have large variance
df$pop = log(df$pop)

# Look at data after pre-processing
head(df)

Unnamed: 0_level_0,year,month,prov,pop,protests
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<dbl>,<dbl>
1,2023,November,Alberta,15.375,20
2,2023,November,British Columbia,15.5349,27
3,2023,November,Manitoba,14.19767,10
4,2023,November,New Brunswick,13.6444,5
5,2023,November,Newfoundland and Labrador,13.2001,7
6,2023,November,Northwest Territories,10.70907,2


-----------------------------------
We may also mention that the population has values which are on a higher range of numbers compared to protest. So scaling down the population variable will make sure all the numeric in the data are on the same scale, which is important because it helps with comparing them easily and understanding the results better. It also makes the process of calculation more smooth. Another important reason to scale is that we are sure when population is 0, we expect the number of protests to be 0 which the function *log()* allows us to do.

We also want to look at how seasons affect protests compared to just looking at individual months. While focusing on seasons gives us a general idea of yearly trends, it means we lose some specific monthly details and might not be totally accurate because seasons can vary in length. When we leave out months from our model, we end up with a weird situation where the relationship between population and protests is inversly proportional, I.e., as population increases, our prediction of protests decrease. So, keeping months in our model is vital to preserve all necessary information.

In [3]:
# Function to change month to season
seasons <- function(month) {
  if (month %in% c("December", "January", "February")) {
    return("Winter")
  } else if (month %in% c("March", "April", "May")) {
    return("Spring")
  } else if (month %in% c("June", "July", "August")) {
    return("Summer")
  } else {
    return("Fall")
  }
}

df$seasons <- sapply(df$month, seasons)
# Define seasons as category
df$seasons <- as.factor( df$seasons )
# Classic Poisson Regression with a log-link function
md.1 <- glm( protests~year+seasons+prov+pop, data=df, family=poisson(link = "log") )
summary(md.1)


Call:
glm(formula = protests ~ year + seasons + prov + pop, family = poisson(link = "log"), 
    data = df)

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   108.36033   44.06429   2.459 0.013927 *  
year2023                        0.22900    0.08222   2.785 0.005351 ** 
seasonsSpring                  -0.16426    0.06210  -2.645 0.008168 ** 
seasonsSummer                  -0.55482    0.05491 -10.103  < 2e-16 ***
seasonsWinter                  -0.30459    0.06443  -4.728 2.27e-06 ***
provBritish Columbia            1.74739    0.48382   3.612 0.000304 ***
provManitoba                   -8.39432    3.35008  -2.506 0.012221 *  
provNew Brunswick             -12.69997    4.95171  -2.565 0.010325 *  
provNewfoundland and Labrador -15.97063    6.17935  -2.585 0.009752 ** 
provNorthwest Territories     -35.10391   13.31026  -2.637 0.008355 ** 
provNova Scotia               -10.94779    4.27275  -2.562 0.010400 *  
provNunavut 

--------------------------------------------
The next data processing is just an overview if there are any missing values which could impact the model. We also look at the summary of each variables to give us an understanding of their average, standard deviation, frequency, etc.

In [4]:
print("-----------------Dimensions/Shape----------------------")
dim(df)             # Dimensions
print("----------------Missing Value Count--------------------")
colSums(is.na(df))  # Missing values
print("---------------------Summary---------------------------")
summary(df)         # Summary

[1] "-----------------Dimensions/Shape----------------------"


[1] "----------------Missing Value Count--------------------"


[1] "---------------------Summary---------------------------"


   year          month                            prov          pop       
 2022:156   April   : 26   Alberta                  : 23   Min.   :10.60  
 2023:143   August  : 26   British Columbia         : 23   1st Qu.:12.02  
            February: 26   Manitoba                 : 23   Median :13.85  
            January : 26   New Brunswick            : 23   Mean   :13.56  
            July    : 26   Newfoundland and Labrador: 23   3rd Qu.:15.35  
            June    : 26   Northwest Territories    : 23   Max.   :16.58  
            (Other) :143   (Other)                  :161                  
    protests       seasons  
 Min.   : 0.00   Fall  :78  
 1st Qu.: 2.00   Spring:78  
 Median : 6.00   Summer:78  
 Mean   :12.02   Winter:65  
 3rd Qu.:16.50              
 Max.   :91.00              
                            