In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [15]:
data = pd.read_csv('crime.csv')

Dropping these columns as we won't be using them in our analysis

In [16]:
data.drop(['MINUTE', 'HUNDRED_BLOCK', 'X','Y'], axis = 1, inplace = True)
data['NEIGHBOURHOOD'].fillna('N/A', inplace=True)

Creating a DATE variable and using it as the index

In [17]:
data['DATE'] = pd.to_datetime({'year': data['YEAR'], 'month': data['MONTH'], 'day': data['DAY']})
data = data.sort_values(['DATE'])
data['Day of Week'] = data['DATE'].dt.dayofweek
data.index = pd.DatetimeIndex(data['DATE'])
data.head(10)

Unnamed: 0_level_0,TYPE,YEAR,MONTH,DAY,HOUR,NEIGHBOURHOOD,Latitude,Longitude,DATE,Day of Week
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2003-01-01,Offence Against a Person,2003,1,1,,,0.0,0.0,2003-01-01,2
2003-01-01,Offence Against a Person,2003,1,1,,,0.0,0.0,2003-01-01,2
2003-01-01,Mischief,2003,1,1,18.0,Central Business District,49.283857,-123.106363,2003-01-01,2
2003-01-01,Other Theft,2003,1,1,18.0,Central Business District,49.281898,-123.120738,2003-01-01,2
2003-01-01,Theft of Vehicle,2003,1,1,23.0,Kensington-Cedar Cottage,49.237564,-123.071217,2003-01-01,2
2003-01-01,Theft from Vehicle,2003,1,1,19.0,Grandview-Woodland,49.263683,-123.069706,2003-01-01,2
2003-01-01,Theft from Vehicle,2003,1,1,21.0,Central Business District,49.279159,-123.1131,2003-01-01,2
2003-01-01,Theft of Vehicle,2003,1,1,5.0,Central Business District,49.276548,-123.119005,2003-01-01,2
2003-01-01,Break and Enter Commercial,2003,1,1,12.0,Renfrew-Collingwood,49.258843,-123.031937,2003-01-01,2
2003-01-01,Theft from Vehicle,2003,1,1,12.0,Central Business District,49.283198,-123.102055,2003-01-01,2


Since the dataset doesn't contain the full month of 2017-07 we're going to get rid of all observations past this point. We're extracting the last 5 rows to make sure that there are no more dates with 2017-07. We can also drop the DATE column after this because it's redundant.

In [18]:
data = data[data.DATE < '2017-07-01']
data.drop(['DATE'], axis = 1, inplace = True)
data.iloc[-5:]

Unnamed: 0_level_0,TYPE,YEAR,MONTH,DAY,HOUR,NEIGHBOURHOOD,Latitude,Longitude,Day of Week
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2017-06-30,Theft from Vehicle,2017,6,30,21.0,South Cambie,49.2535,-123.123261,4
2017-06-30,Mischief,2017,6,30,19.0,West Point Grey,49.271403,-123.211865,4
2017-06-30,Break and Enter Commercial,2017,6,30,15.0,Fairview,49.26619,-123.141418,4
2017-06-30,Theft of Bicycle,2017,6,30,17.0,Kitsilano,49.266525,-123.157181,4
2017-06-30,Other Theft,2017,6,30,21.0,Renfrew-Collingwood,49.258164,-123.03695,4


Let's also add in a column "CATEGORY" to categorize the types of crime to make the plots looking for general trends a little more clear.

In [20]:
def category(crime_type):
    if 'Theft' in crime_type:
        return 'Theft'
    elif 'Break' in crime_type:
        return 'Break and Enter'
    elif 'Collision' in crime_type:
        return 'Vehicle Collision'
    else:
        return 'Others'
    
data['CATEGORY'] = data['TYPE'].apply(category)


Unnamed: 0_level_0,TYPE,YEAR,MONTH,DAY,HOUR,NEIGHBOURHOOD,Latitude,Longitude,Day of Week,CATEGORY
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2003-01-01,Offence Against a Person,2003,1,1,,,0.0,0.0,2,Others
2003-01-01,Offence Against a Person,2003,1,1,,,0.0,0.0,2,Others
2003-01-01,Mischief,2003,1,1,18.0,Central Business District,49.283857,-123.106363,2,Others
2003-01-01,Other Theft,2003,1,1,18.0,Central Business District,49.281898,-123.120738,2,Theft
2003-01-01,Theft of Vehicle,2003,1,1,23.0,Kensington-Cedar Cottage,49.237564,-123.071217,2,Theft


Now our dataset is fully prepared and ready for analysis. The missing values we see are from violent crimes and since they lack so much data, we won't be using them in our more in depth analyses for neighbourhoods and time but we still want to keep them for analyses with total crime. 