### Capstone Project: Predicting Crime in San Francisco

by Elton Yeo, DSI13

#### Problem Statement and Context

Police departments have limited frontline resources, and need to prioritise areas where their officers patrol. Police patrols are important because they project presence, which can deter criminals and increase the sense of safety for residents. 

This project aims to predict the number of crime incidents that will take place in a particular zip code, given a range of variables such as the day of the week and time of the day. The data will be run through 3 regression models: linear regression, random forest, and XGBoost. 

The models will be trained on 2018 data, and tested on 2019 data, and evaluated by their r-sqaured scores. Success is defined as a r-sqaured score of 0.8 and above, which means that the model explains 80% or more of the variability of the target data that is predictable from the independent variables.

(Question: should i do a train-test split on the entire dataset, or as above?)

#### Risks and Assumptions of Data

The data may have been recorded in a manner that is useful for frontline officers or operators, thus impeding data cleaing and our understanding of the data. 

We assume that the data was recorded/provided accurately by the frontline officers or the citizens who had reported the crimes. 

Data source: https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

  import pandas.util.testing as tm


In [3]:
reports=pd.read_csv('../data/reports.csv')

In [5]:
reports.head()

Unnamed: 0,Incident Datetime,Incident Date,Incident Time,Incident Year,Incident Day of Week,Report Datetime,Row ID,Incident ID,Incident Number,CAD Number,...,SF Find Neighborhoods,Current Police Districts,Current Supervisor Districts,Analysis Neighborhoods,HSOC Zones as of 2018-06-05,OWED Public Spaces,Central Market/Tenderloin Boundary Polygon - Updated,Parks Alliance CPSI (27+TL sites),ESNCAG - Boundary File,"Areas of Vulnerability, 2016"
0,2019/05/01 01:00:00 AM,2019/05/01,01:00,2019,Wednesday,2019/06/12 08:27:00 PM,81097515200,810975,190424067,191634131.0,...,39.0,10.0,7.0,35.0,,,,,,1.0
1,2019/06/22 07:45:00 AM,2019/06/22,07:45,2019,Saturday,2019/06/22 08:05:00 AM,81465564020,814655,190450880,191730737.0,...,32.0,1.0,10.0,34.0,1.0,,1.0,,,2.0
2,2019/06/03 04:16:00 PM,2019/06/03,16:16,2019,Monday,2019/06/03 04:16:00 PM,80769875000,807698,190397016,191533509.0,...,88.0,2.0,9.0,1.0,,,,,,2.0
3,2018/11/16 04:34:00 PM,2018/11/16,16:34,2018,Friday,2018/11/16 04:34:00 PM,73857915041,738579,180870806,183202539.0,...,104.0,6.0,3.0,6.0,,18.0,,,,2.0
4,2019/05/27 02:25:00 AM,2019/05/27,02:25,2019,Monday,2019/05/27 02:55:00 AM,80509204134,805092,190378555,191470256.0,...,15.0,4.0,6.0,13.0,,,,,,1.0


In [6]:
reports.isnull().sum()

Incident Datetime                                            0
Incident Date                                                0
Incident Time                                                0
Incident Year                                                0
Incident Day of Week                                         0
Report Datetime                                              0
Row ID                                                       0
Incident ID                                                  0
Incident Number                                              0
CAD Number                                               76640
Report Type Code                                             0
Report Type Description                                      0
Filed Online                                            258059
Incident Code                                                0
Incident Category                                           31
Incident Subcategory                                   

In [11]:
reports['HSOC Zones as of 2018-06-05'].value_counts()

1.0    32334
3.0    29266
5.0     8872
4.0     1574
2.0     1273
Name: HSOC Zones as of 2018-06-05, dtype: int64

HSOC refers to "Healthy Streets Operation Center", a San Francisco inter-agecy effort (SF Police Dept, Dept of Public Health, SF Public Works etc.) to coordinate the City's response to homelessness.

Source: http://hsh.sfgov.org/wp-content/uploads/HSOC-Presentation-for-LHCB-FINAL.pdf

In [9]:
reports['OWED Public Spaces'].value_counts()

35.0    10133
50.0     1660
39.0      961
15.0      940
16.0      562
18.0      416
70.0      399
3.0       248
31.0      203
80.0      190
29.0      174
7.0       173
23.0       95
6.0        90
14.0       83
61.0       80
58.0       73
17.0       58
75.0       55
54.0       25
8.0        25
30.0       20
42.0        7
71.0        6
43.0        5
46.0        4
79.0        1
Name: OWED Public Spaces, dtype: int64

A list of public spaces being considered for management and activation by Office of Economic & Workforce Development (OEWD).

Source: https://data.sfgov.org/Geographic-Locations-and-Boundaries/OWED-Public-Spaces/gkqa-s74m

In [10]:
reports['Central Market/Tenderloin Boundary Polygon - Updated'].value_counts()

1.0    43701
Name: Central Market/Tenderloin Boundary Polygon - Updated, dtype: int64

This marks whethere or not the crime took place within the Central Market/Tenderloin district.

Source: https://data.sfgov.org/Geographic-Locations-and-Boundaries/Central-Market-Tenderloin-Boundary-Polygon/ywcr-44b8

In [12]:
reports['Parks Alliance CPSI (27+TL sites)'].value_counts()

24.0    1660
23.0    1246
31.0     961
1.0      111
6.0       47
5.0       47
3.0       10
Name: Parks Alliance CPSI (27+TL sites), dtype: int64

Boundaries of 27 open space sites that are part of the OEWD Citywide Public Space Initiative, plus UN Plaza, Boeddeker Park, TL Recreation Center, and Civic Center Plaza.

In this dataset, it marks which of those public spaces the crime had taken place in.

Source: https://data.sfgov.org/Geographic-Locations-and-Boundaries/Parks-Alliance-CPSI-27-TL-sites-/qjyb-yy3m

In [13]:
reports['ESNCAG - Boundary File'].value_counts()

1.0    3641
Name: ESNCAG - Boundary File, dtype: int64

Marks whether the crime took place within the Embarcadero SAFE Navigation Center (ESNCAG) or not. 

Source: https://data.sfgov.org/dataset/ESNCAG-Boundary-File/8cs3-kxq7

https://sfmayor.org/article/embarcadero-safe-navigation-center-slated-open-end-year

In [14]:
reports['Areas of Vulnerability, 2016'].value_counts()

2.0    172017
1.0    139658
Name: Areas of Vulnerability, 2016, dtype: int64

These geographic designations were created to define geographic areas within San Francisco that have a higher density of vulnerable populations. These geographic designations will be used for the Health Care Services Master Plan and DPH's Community Health Needs Assessment.

Source: https://data.sfgov.org/Geographic-Locations-and-Boundaries/Areas-of-Vulnerability-2016/kc4r-y88d