## STAT 201 GROUP PROJECT PROPOSAL: GROUP 12
**Group Members:** Rashi Selarka, Medha Singh, Alice Zhang
**A STATISTICAL INFERENCE ABOUT TIME OF OCCURRENCE OF COMPARABLE CRIMES**

**INTRODUCTION :**

Large-scale studies of crime data from the FBI demonstrates that violent crimes often happen at night (Bannister, 2019). This statement seems reasonable since the victims are easier targets and witnesses become unlikely. We wanted to test this statement, and take it a step further by examining how different types of crimes may correlate with the time they take place at, and if we can attempt to predict when they’ll take place by looking at (seemingly) similar crimes and testing if they occur at similar times.

We have used the Vancouver crime data found on the Vancouver Police Department's website for our analysis. We plan use this data to calculate the estimated average time for which most of the comparable crimes in the dataset occur, and come up with a hypothesis test to see if these crime, in their time of occurence are actually comparable or not. Through this investigation, we hope to gain not just a better understanding of hypothesis testing, but also of the time of day we are more vulnerable to certain crimes (Brands, 2013).

The large data set that we are using was obtained from the Vancouver Police Department: https://geodash.vpd.ca/opendata/ and consists of various attributes associated with the documentation from when the crime was recorded. This is a dataset of crime data on a year-by-year basis beginning in 2003. This data is designed to provide individuals with a general overview of incidents falling into several crime categories. We will be using these attributes to construct our model and we also aim to find out which of these attributes just seem comparbale, or are truly comparable. 


**Preliminary Results**

* Demonstrate that the dataset can be read from the web into R.
Note: The data had to be locally stored and loaded into the notebook since the website doesn't have an option to scrape it. It must be downloaded onto the server home to be accessed.

In [None]:
library(tidyverse)
library(repr)
library(datateachr)
library(digest)
library(infer)
library(gridExtra)
library(cowplot)
library(caret)

In [None]:
crime_data <- read_csv("/home/jupyter/crimedata_csv_all_years.csv")

In [None]:
head(crime_data)

* **TYPE:** The type of crime activities
* BNE Commercial: (Commercial Break and Enter) Breaking and entering into a commercial property with intent to commit an offence
* BNE Residential/Other: (Residential Break and Enter) Breaking and entering into a dwelling/house/apartment/garage with intent to commit an offence
* Vehicle Collision or Pedestrian Struck (with Fatality): Includes primarily pedestrian or cyclist struck and killed by a vehicle. It also includes vehicle to vehicle fatal accidents, however these incidents are fewer in number when compared to the overall data set. Note: There is no neighbourhood information.
* Vehicle Collision or Pedestrian Struck (with Injury): Includes all categories of vehicle involved accidents with injuries. This includes pedestrian and cyclist involved incidents with injuries. Note: There is no neighbourhood information
* Homicide: A person, directly or indirectly, by any means, causes the death of another person.
* Mischief: A person commits mischief that willfully causes malicious destruction, damage, or defacement of property. This also includes any public mischief towards another person.
* Offence Against a Person: An attack on a person causing harm that may include usage of a weapon.
* Other Theft: Theft of property that includes personal items (purse, wallet, cellphone, laptop, etc.), bicycle, etc.
* Theft from Vehicle: Theft of property from a vehicle
* Theft of Vehicle: Theft of a vehicle, motorcycle, or any motor vehicle
* Theft of Bicycle: Theft of a bicycle Note: There is no neighbourhood information
* **YEAR:** A four-digit field that indicates the year when the reported crime activity occurred
* **MONTH:** A numeric field that indicates the month when the reported crime activity occurred
* **DAY:** A two-digit field that indicates the day of the month when the reported crime activity occurred
* **HOUR:** A two-digit field that indicates the hour time (in 24 hours format) when the reported crime activity occurred Note: This information is based on the findings of the police investigation. No time information will be provided for Offences Against a Person crime type
* **MINUTE:** A two-digit field that indicates the minute when the reported crime activity occurred Note: This information is based on the findings of the police investigation. No time information will be provided for Offences Against a Person crime type
* **HUNDRED_BLOCK:** Generalized location of the report crime activity 
* **NEIGHBOURHOOD:** The Vancouver Police Department uses the Statistics Canada definition of neighbourhoods within municipalities. Neighbourhoods within the City of Vancouver are based on the census tract (CT) concept within census metropolitan. 
* **X:** Coordinate values are projected in UTM Zone 10. All data must be considered offset and users should not interpret any locations as related to a specific person or specific property.
* **Y:** Coordinate values are projected in UTM Zone 10. All data must be considered offset and users should not interpret any locations as related to a specific person or specific property.


* **Here we clean and wrangle our data**

Since we are focussing on the time of occurence of the data, we will select the columns of TYPE and HOUR. For our analysis we hve selected the two similar and comparable types of crimes as: "Vehicle Collision or Pedestrian Struck (with Injury)", "Vehicle Collision or Pedestrian Struck (with Fatality)"; "Theft of Vehicle" and "Theft from Vehicle".

We also look at the frequency of see if the attributes are balanced or not, and if they are not, we plan to do that further on in the project. 

In [None]:
crime_data_grouped <- crime_data %>%
            group_by(TYPE) %>%
            select(TYPE, HOUR) %>%
            data.frame()


crime_data_sort <- crime_data_grouped[ order( crime_data_grouped$HOUR), ] %>%
                select(TYPE, HOUR)%>%
                as.tibble()%>%
                filter(TYPE == "Vehicle Collision or Pedestrian Struck (with Injury)" | 
                       TYPE == "Vehicle Collision or Pedestrian Struck (with Fatality)"|
                       TYPE == "Theft of Vehicle" | 
                       TYPE == "Theft from Vehicle")
head(crime_data_sort)
tail(crime_data_sort)
cd_count <- crime_data_sort %>%
            group_by(TYPE) %>%
            summarize(count=n())
cd_count


* **Here we visualize our data in a boxplot to get an idea about the center and the spread of the sample population**

In [None]:
#scale the plot 
options(repr.plot.width = 12, repr.plot.height = 12)
cd_plot <- crime_data_sort %>%  
    ggplot(aes(x = TYPE, y = HOUR)) + 
    geom_boxplot() + 
    ylab("hour of the day for the crime") +
    ggtitle("Boxplots of hours of day for different crimes") +
    theme(axis.text.x = element_text(size=15, angle = 45), 
          axis.text.y = element_text( size = 12, angle = 45), 
          text = element_text(face = "bold", size = 15)) 
cd_plot

* **Plot the relevant raw data, tailoring your plot in a way that addresses your question.**

Now we look at the bootstrapped sampling distribution for our hypothesis testing. 

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

theft_of_vehicle_bootstrap <- crime_data_sort %>%
                    filter(TYPE=="Theft of Vehicle" ) %>%
                    rep_sample_n(size = 30, reps=10000, replace = TRUE)

sampling_dist_theftof <- theft_of_vehicle_bootstrap %>%
                    group_by(replicate)%>%
                    summarize(mean = mean(HOUR))
sampling_dist_plot <- sampling_dist_theftof %>%
                ggplot() + 
                geom_histogram(aes(mean), bins = 15, color="white") +
                xlab("mean time of theft from vehicle") + 
                theme(text = element_text(size=15)) + 
                ggtitle("Bootstrapped sampling dist.", subtitle = "mean time of theft from vehicle") 


head(theft_of_vehicle_bootstrap)
head(sampling_dist_theftof)
sampling_dist_plot

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

vehical_col_inj_bootstrap <- crime_data_sort %>%
                    filter(TYPE=="Vehicle Collision or Pedestrian Struck (with Injury)" ) %>%
                    rep_sample_n(size = 30, reps=10000, replace = TRUE)

sampling_dist_vc_inj <- vehical_col_inj_bootstrap %>%
                    group_by(replicate)%>%
                    summarize(mean = mean(HOUR))
sampling_dist_plot_vc <- sampling_dist_vc_inj %>%
                ggplot() + 
                geom_histogram(aes(mean), bins = 15, color="white") +
                xlab("mean time at which vehicle collision occured") + 
                theme(text = element_text(size=15)) + 
                ggtitle("Bootstrapped sampling dist.", subtitle = "mean time at which vehicle collision occured") 


head(vehical_col_inj_bootstrap)
head(sampling_dist_vc_inj)
sampling_dist_plot_vc

These bootstapped sample will help us estimate if the average time of occurance of the Theft of Vehicle is the same as Theft from Vehicle, and if the average time of occurece for Vehicle Collision or Pedestrian Struck (with Fatality) is more than the avVehicle Collision or Pedestrian  Struck (with Injury). 

**METHODS:**  

The dataset was loaded from "https://geodash.vpd.ca", and has to intially be tidied up, sorted, filtered and then the attributes of interest have to be selected.

Afterwards, we look at the basic summary statistics about the dataset (frequency/relative abundance of predictor type variables and result type variables) are generated to make sure the data is balanced and complete. A boxplot for the sample population is created, sample because irl we don't have access to entire population, is used to visualize the data distribution.   

We then plan to collect bootstrap samples and visualize the samplign distributions of the means for our targetted attributes. We will conduct a one-tailed hypothesis test on whether vehicle collision or pedestrian struck with fatality occurs later than just that with injury, as well as a two-tailed one on if theft of vehicle and theft from vehicle occur at similar times. We will use the mean and standard deviation as descriptive statistics to make inferences, to arrive upon conclusions about whether differences exist, and what they are using confidence intervals.

We expect our results to corroborate the alternate hypotheses - that car accidents involving death happen later than just ones with injury, since drivers tend to go at higher speeds at night, there is lower visibility and more drunk and reckless drivers on the road. And we expect that theft of vehicle and theft from vehicle doesn’t happen at around the same time of day on average as offense theft from vehicle can take place all throughout the day depending on the neighbourhood whereas theft of vehicle is likely to happen when it’s more isolated (perhaps later at night).

It most crucially provides insight into how law enforcement could go about predicting the time during which certain types of crimes are likely to occur. The questions that arise for the future are many - other than time of day, what are other predictors we could look at; perhaps certain months when crime spikes? Also, if perhaps the month and time of day are related? Maybe crime could occur at earlier hours in the winter months because it gets dark sooner, etc.

**REFERNCES**

Crimes that Happen While You Sleep. (2020, November 04). Retrieved from https://www.thesleepjudge.com/crimes-that-happen-while-you-sleep/

Bannister J, O’Sullivan A, Bates E. Place and time in the Criminology of Place. Theoretical Criminology. 2019;23(3):315-332. doi:10.1177/1362480617733726

Brands, J., Schwanen, T., & Aalst, I. V. (2013). Fear of crime and affective ambiguities in the night-time economy. Urban Studies, 52(3), 439-455. doi:10.1177/0042098013505652

Vancouver Police Department: Crime Data. Retrieved from https://geodash.vpd.ca/opendata/

Violent Crimes Most Likely to Occur At Night. (2019, June 14). Retrieved from https://www.securitymagazine.com/articles/90384-murder-robbery-and-driving-while-impaired-happen-at-night

Violent Crime. (2018, September 10). Retrieved from https://ucr.fbi.gov/crime-in-the-u.s/2017/crime-in-the-u.s.-2017/topic-pages/violent-crime
