# STAT301 Final Project: Individual Assignment 1

**Group:** 26

**Group Member:** Sam Thorne 83910448

## Vancouver Crime Report Dataset:

This data was sourced from [Kaggle](https://www.kaggle.com/datasets/agilesifaka/vancouver-crime-report/data).

The data covers crimes from the years 2013-2019 where each row represents an individual crime. Note that this does not include *every* crime throughout this time span but only crimes that have been reported. Additionally, there are crimes that are excluded from this data set due to ongoing investigations and privacy purposes.

Collection method of data is unspecified. All information provided in the data can also be found within the City of Vancouver Open Data Catalogue and was originally acquired from the Vancouver Police Department's crime records.

Due to the file size limitations on GitHub, we chose to cut down the data to include only the years 2014 and 2019. The filtered version of the data that we are going to be using for the assignment can be accessed [here](https://raw.githubusercontent.com/samnthorne/FinalProject/main/data/crime_filtered.csv). We chose to include only these two years because they represent the change in crime over the course of 5 years to see how crime has changed. Additionally, we agreed that it was not significant for our questions to included every year as none of us were interested in analyzing year-to-year change of crime in Vancouver.


### Sample of data from full dataset:

In [56]:
library(tidyverse)

In [57]:
data <- read_csv('https://raw.githubusercontent.com/samnthorne/FinalProject/main/data/crime_filtered.csv')
head(data)

[1mRows: [22m[34m68545[39m [1mColumns: [22m[34m10[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): TYPE, HUNDRED_BLOCK, NEIGHBOURHOOD
[32mdbl[39m (7): YEAR, MONTH, DAY, HOUR, MINUTE, X, Y

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
Break and Enter Commercial,2019,3,7,2,6,10XX SITKA SQ,Fairview,490613.0,5457110
Break and Enter Commercial,2019,8,27,4,12,10XX ALBERNI ST,West End,491007.8,5459174
Break and Enter Commercial,2014,8,8,5,13,10XX ALBERNI ST,West End,491015.9,5459166
Break and Enter Commercial,2014,4,17,5,50,10XX ALBERNI ST,West End,491032.3,5459150
Break and Enter Commercial,2014,9,1,14,20,10XX ALBERNI ST,West End,491032.3,5459150
Break and Enter Commercial,2014,11,22,23,28,10XX ALBERNI ST,West End,491059.5,5459122


---

## Descriptive summary

The **Vancouver Crime DataSet** has 10 attributes to describe the 68545 crimes detailed in this dataset. Each row represents an instance of crime in Vancouver while each column represents details of these crimes. Of these 10 attributes, 7 of them are quantitative values (5 of which represent temporal data), while the remaining 3 are categorical attributes. 

**Vancouver Crime Data Attribute Summary**:

| Name of attribute | Type of data | Semantics | Notes |
| ---| ---| ---| ---|
| `TYPE`| Categorical `str`| Is a representation of the type of crime. |Has 25 unique values. This includes: `['Break and Enter Commercial','Break and Enter Residential/Other','Homicide','Mischief','Offence Against a Person','Other Theft''Theft from Vehicle','Theft of Bicycle','Theft of Vehicle','Vehicle Collision or Pedestrian Struck (with Fatality)','Vehicle Collision or Pedestrian Struck (with Injury)']` |
| `YEAR`| Categorical`dbl`| The year of the crime (2 options) | This is going to be either `2014` or `2019` because that is all we are including in our data analysis.|
|`MONTH`| Temporal Quantitative `dbl`| The month of the crime | Integars `1`-`12`|
|`DAY`| Temporal Quantitative `dbl`| The day of the month of the crime | Integars `1`-`31`|
|`HOUR`| Temporal Quantitative `dbl`| The hour within the day of the crime | Integars `0`-`23`|
|`MINUTE`| Temporal Quantitative `dbl`| The minute within the hour of the crime | Integars `0`-`60` |
|`HUNDRED_BLOCK`| Categorical `str`| The block (specific location) on which the crime occurred) | 11275 unique values |
|`NEIGHBOURHOOD`| Categorical `str`| The neighbourhood the crime occurred in |  There is a high quantity of null values in this column (10%).  Has 25 unique values. These include: `['Fairview''West End', 'Central Business District', 'Grandview-Woodland', 'Mount Pleasant', 'Strathcona', 'Sunset', 'Kensington-Cedar Cottage', 'Stanley Park', 'Shaughnessy', 'Marpole', 'West Point Grey', 'Hastings-Sunrise', 'Victoria-Fraserview', 'Kitsilano''Kerrisdale', 'Riley Park', 'Oakridge', 'Arbutus Ridge', 'Renfrew-Collingwood', 'Killarney', 'Dunbar-Southlands', 'South Cambie', 'Musqueam', NA]`|
|`X`| Quantitative `dbl`| GPS longitude of crime | Has 3 NA values, and has 0's which are also NA values. |
|`Y`| Quantitative `dbl`| GPS latitude of crime | Has 3 NA values, and has 0's which are also NA values. |

### Collecting information for descriptive summary:

Code ran to collect the information provided in the above descriptive summary:

In [53]:
summary(data)

     TYPE                YEAR          MONTH             DAY       
 Length:68545       Min.   :2014   Min.   : 1.000   Min.   : 1.00  
 Class :character   1st Qu.:2014   1st Qu.: 4.000   1st Qu.: 8.00  
 Mode  :character   Median :2019   Median : 6.000   Median :15.00  
                    Mean   :2017   Mean   : 6.318   Mean   :15.17  
                    3rd Qu.:2019   3rd Qu.: 9.000   3rd Qu.:22.00  
                    Max.   :2019   Max.   :12.000   Max.   :31.00  
                                                                   
      HOUR           MINUTE      HUNDRED_BLOCK      NEIGHBOURHOOD     
 Min.   : 0.00   Min.   : 0.00   Length:68545       Length:68545      
 1st Qu.: 7.00   1st Qu.: 0.00   Class :character   Class :character  
 Median :14.00   Median : 7.00   Mode  :character   Mode  :character  
 Mean   :12.38   Mean   :16.31                                        
 3rd Qu.:19.00   3rd Qu.:30.00                                        
 Max.   :23.00   Max.   :59.00

In [46]:
length((unique(data$TYPE)))
unique(data$TYPE)

In [51]:
length((unique(data$HUNDRED_BLOCK)))
# unique(data$HUNDRED_BLOCK)

In [49]:
length((unique(data$NEIGHBOURHOOD)))
unique(data$NEIGHBOURHOOD)

---

## Question:

**Primary Research Question:**

> Based on the neighbourhood and month the crime is reported in, what type of crime is most likely to occur?

The dataset provides the neighbourhood of crimes throughout the years 2014, and 2019. Using the neighbourhoods, and time frames of the crimes, I am interested in predicting the type of crime that is most likely to occur. This question is mainly focused on prediction of future crime types. I am interested in seeing both how the types of crimes change between areas, but also how the types of crimes change throughout the seasons in the year. Is there a crime type that only happens during winter months? Is there variation in the predicted type of crime in each neighbourhood?

Response Variable: `TYPE`

Explanatory Varaibles: `MONTH`, `NEIGHBOURHOOD`

After completing a summary of the variables included in this dataset I feel as though `HUNDRED_BLOCK` should be removed because it is a confounding variable for the explanatory variable of `NEIGHBOURHOOD`. Every value within `HUNDRED_BLOCK` is going to appear in one of the neighbourhoods thereby enhancing the predictive ability of neighbourhood. Additionally, the latitude and longitude coordinates (`X` and `Y`) are going to be neighbourhood specific so they should be removed when answering this question. Choosing not to remove these variables could result in inflated confidence in the predictions of crime types.