# Ken Woon's Notebook

## Dataset Overview

In [24]:
#Import required library
import pandas as pd

#Read CSV file into the variable
df = pd.read_csv('../data/raw/crimedata_csv_all_years.csv')

#Display the dataset
display(df)

Unnamed: 0,TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y
0,Theft from Vehicle,2006,3,4,20,30,DAVIE ST / HOWE ST,Central Business District,490748.5904,5.458346e+06
1,Theft from Vehicle,2006,3,5,11,30,DAVIE ST / HOWE ST,Central Business District,490748.5904,5.458346e+06
2,Theft from Vehicle,2006,4,16,0,1,DAVIE ST / HOWE ST,Central Business District,490748.5904,5.458346e+06
3,Theft from Vehicle,2006,6,11,17,45,DAVIE ST / HOWE ST,Central Business District,490748.5904,5.458346e+06
4,Theft from Vehicle,2006,8,5,20,0,DAVIE ST / HOWE ST,Central Business District,490748.5904,5.458346e+06
...,...,...,...,...,...,...,...,...,...,...
793911,Theft from Vehicle,2005,3,9,21,30,DAVIE ST / HOWE ST,Central Business District,490748.5904,5.458346e+06
793912,Theft from Vehicle,2005,6,5,23,0,DAVIE ST / HOWE ST,Central Business District,490748.5904,5.458346e+06
793913,Theft from Vehicle,2005,8,1,22,0,DAVIE ST / HOWE ST,Central Business District,490748.5904,5.458346e+06
793914,Theft from Vehicle,2005,12,14,0,0,DAVIE ST / HOWE ST,Central Business District,490748.5904,5.458346e+06


## Exploratory Data Analysis

### 1. Understanding the Variables

The dataset for this project is the historical crime data in Vancouver. In order to open the file and start with the analysis, the libraries required are needed to be imported first, which is done below. The CSV file is then read using `pandas` into a variable called `df`.

In [1]:
#Imported all of the libraries that would be needed for the analysis
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

#Read CSV file into the variable
df = pd.read_csv('../data/raw/crimedata_csv_all_years.csv')

In [10]:
#Display first 5 rows of the dataset
display(df.head())

Unnamed: 0,TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y
0,Theft from Vehicle,2006,3,4,20,30,DAVIE ST / HOWE ST,Central Business District,490748.5904,5458346.0
1,Theft from Vehicle,2006,3,5,11,30,DAVIE ST / HOWE ST,Central Business District,490748.5904,5458346.0
2,Theft from Vehicle,2006,4,16,0,1,DAVIE ST / HOWE ST,Central Business District,490748.5904,5458346.0
3,Theft from Vehicle,2006,6,11,17,45,DAVIE ST / HOWE ST,Central Business District,490748.5904,5458346.0
4,Theft from Vehicle,2006,8,5,20,0,DAVIE ST / HOWE ST,Central Business District,490748.5904,5458346.0


`.head()` returns the first 5 rows of the dataset. It can be seen that there are 10 columns which are the type of crime, the year, month, day, hour, minute, hundred block, neighbourhood, and the x and y coordinates. It is not certain that these are all the existing columns, thus it will be confirmed right after. From what can be seen, the table only shows `Theft from Vehicle` in the same hundred block, neighbourhood, and x and y coordinates, but at different times. It is not likely for the entire dataset to contain only this type of crime at this location, and will be checked in the analysis section.

In [9]:
#Display number of rows and columns
print(df.shape)
print(f"The number of rows is {df.shape[0]} and the number of columns is {df.shape[1]}")

(793916, 10)
The number of rows is 793916 and the number of columns is 10


`.shape` returns the number of rows by the number of columns of the dataset. The output was `(793916, 10)`, meaning that there are 793916 rows and 10 columns in total.

In [11]:
#Display the name of the columns
print(df.columns)

Index(['TYPE', 'YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE', 'HUNDRED_BLOCK',
       'NEIGHBOURHOOD', 'X', 'Y'],
      dtype='object')


`.columns` returns the name of all the columns in the dataset. It is confirmed that the table displayed earlier shows all the 10 existing columns with the same name as the ones in this list.

The next step would be to better understand the values for each variable. `.nunique()` will be used to return the number of unique values for each column. By specifying the column axis, `axis = 0` in this case, the method will search column-wise and return the number of unique values for each row.

In [15]:
print(df.nunique(axis = 0))

TYPE                 11
YEAR                 19
MONTH                12
DAY                  31
HOUR                 24
MINUTE               60
HUNDRED_BLOCK     22939
NEIGHBOURHOOD        24
X                139461
Y                139296
dtype: int64


It can be seen that there are indeed other types of crimes other than `Theft from Vehicle` as there are more than 1 type. There are 19 years spanned by the data, however, it is not known what the starting and ending year is at the moment. There also seems to be a wide range of areas covered from looking at the number of hundred blocks and neighbourhood present in the data.

In [23]:
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

Unnamed: 0,YEAR,MONTH,DAY,HOUR,MINUTE,X,Y
count,793916.0,793916.0,793916.0,793916.0,793916.0,793843.0,793843.0
mean,2011.1895,6.4941,15.394095,12.397745,15.711429,450191.835386,4991158.124679
std,5.528776,3.417018,8.757672,7.445387,18.302254,137534.630035,1524555.068773
min,2003.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,2006.0,4.0,8.0,7.0,0.0,490179.6107,5454241.9588
50%,2011.0,7.0,15.0,14.0,5.0,491556.8225,5457160.3973
75%,2016.0,9.0,23.0,19.0,30.0,493389.9946,5458717.6124
max,2021.0,12.0,31.0,23.0,59.0,511303.0,5512579.0


`.describe()` displays a summary of the count, mean, standard deviation, minimum, 25%, 50%, 75% , and maximum of the values for each numeric variable. From this output, it can be deduced that the dataset spans the years 2003 to 2021 with crimes occuring at a mean hour during 12 noon. The other values despite being a good indication for the expectation of each variables, do not give much useful information and thus will be further explored later. The table is also exclusive to numerical variables, meaning that the information about string variables are not provided.