# COGS 108 - Final Project 
## Surveillance Cameras: An Effective Deterrent or A Waste of Public Resources?

# Overview

Our project sought to discern any quantifiable effects on the rate of change in violent crime from the installation of surveillance cameras in various neighborhoods of Chicago. This task was accomplished by comparing violent crime rates within neighborhoods with high crime rates before and after their installation. Overall, we found no evidence to suggest that surveillance cameras are effective way to curtail violent crimes.

# Names

- Chaitanya Patel
- Kristine Marie Baluyot
- Linh Le
- Namit Mishra
- Tiffany Zhang
- Robert Eaton

# Group Members IDs

- A15346478 
- A13447798 
- A14201350
- A92112718
- A13161270
- A14190293

# Research Question

### Was there a significant change to the year-on-year crime rate for violent crimes in neighborhoods of Chicago in a three year period before and after activation of the surveillance cameras in 2003?

## Background and Prior Work

*Fill in your background and prior work here* 

References (include links):
- 1)
- 2)

# Hypothesis


*Fill in your hypotheses here*

# Dataset(s)

#### First dataset
- Dataset Name: Crimes - 2001 to present
- Link to the dataset: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
- Number of observations: The number of crimes in this dataset was around 100,000 even after cleaning to our standards.

This dataset from the City of Chicago contains all crime data from 2001 to the present. 
It contains a wealth of information, and many columns not very necessary to our analysis. 
We focused on the following columns:
- Date - to sort it into the 2001-3 block or the 2004-6 block, and ignored the rest.
- IUCR - crime code, keeping only violent crimes as defined in Illinois 
- Primary Type - to sort crimes more specifically
- Location description - stating what time of place the crime occurred (such as inside or on the street)
- Latitude, Longitude, Location - pinpoints the location of the crime. The ‘Location’ column is simply the Latitude and Longitude pair
- Community Area - which district of Chicago


#### Second dataset
- Dataset Name: Blue Light Cameras in the city of Chicago
- Link to the dataset: https://redshiftzero.github.io/policesurveillance/
- Number of observations: 715 camera locations

The camera locations were indicated on a map. Therefore, the data we needed, that is the latitude and logitude of camera locations was embedded in HTML code with Javascript. We used Beautiful Soup along with python regular expressions to extract the data from source code of the link above. The script for extracting the data is as below. The latitude and logitude were saved in a csv file named 'lat_long.csv'. 

In [None]:
import pandas as pd
import requests
import bs4
from bs4 import BeautifulSoup
import re
import csv
import matplotlib 

cameras = 'https://redshiftzero.github.io/policesurveillance/pod.html'
data = requests.get(cameras)
soup = BeautifulSoup(data.content, 'html.parser')

right_table = soup.find_all('script')
what_we_need = right_table[6]

# Regular expression to find patterns of data we need
pattern = re.compile(".*var marker_.* = L.marker\(\[.*\n.*\n")
all_patterns = pattern.findall(what_we_need.string)

# Using string functions to convert data to needed form
with open('lat_long.csv', mode='w') as camera_file:
    camera_file = csv.writer(camera_file, delimiter=',', lineterminator = '\n', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    camera_file.writerow(["Latitude", "Longitude"])
    for each in all_patterns:
        each = each.strip()
        each = ' '.join(each.split())
        each = each[each.find('[') + 1:each.find(']')]
        (lat, long) = each.split(',')
        lat = lat.strip()
        long = long.strip()
        camera_file.writerow([lat, long])

# Setup

In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pylab
import pandas as pd

from bs4 import BeautifulSoup
import requests
import gps_to_neighborhood

import statsmodels.api as sm
import patsy
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

#### Importing both the datasets into pandas dataframes. 

In [None]:
df = pd.read_csv('Crimes_-_2001_to_present.csv')
df_camera = pd.read_csv('lat_long.csv')

# Data Cleaning

Because the camera dataset was extracted by us, we already cleaned it to the required standards. 
#### Below are the dataset to clean the Chicago crime dataset. 

First, let us keep only the necessary columns: ID, Date, Block, IUCR, Description, Location Description, Arrest, Latitude, Longitude, and Location. Let us also drop any datapoint will NaN value for the location as the analysis is dependent on knowing the neighborhood where the crime occured.

In [4]:
df.drop(['Block','Case Number','Community Area','Domestic','Beat','District','Ward','FBI Code','X Coordinate','Y Coordinate',
         'Year','Updated On','Historical Wards 2003-2015','Zip Codes','Census Tracts','Boundaries - ZIP Codes', 'Location', 
         'Wards','Police Districts','Police Beats'], axis=1, inplace=True)
df.dropna(subset=['Latitude', 'Longitude'], axis='rows', inplace=True)

Following are the total type of crimes in our dataset.

In [None]:
df["Primary Type"].unique()

As evidenced, there are multiple types of crimes. However, only certain ones are deemed 'violent' under Illinois and federal laws. The next step is to remove rows with non-violent crimes. The source for the definition of violent crimes is http://gis.chicagopolice.org/CLEARMap_crime_sums/crime_types.html.

In [None]:
crimes = df[(df["Primary Type"] == 'BATTERY') & (df["IUCR"] != "0440") & (df["IUCR"] != "0486") & (df["IUCR"] != "0460") & 
            (df["IUCR"] != '0484') & (df['IUCR'] != '0454') & (df['IUCR'] != '0487') & (df['IUCR'] != '0475')|
            (df["Primary Type"] == 'ROBBERY') | 
            (df["Primary Type"] == 'ASSAULT') & (df["IUCR"] != '0560') & (df['IUCR'] != '0554') & (df['IUCR'] != '0545')|
            (df["Primary Type"] == 'CRIM SEXUAL ASSAULT') | (df["Primary Type"] == 'HOMICIDE') | 
            (df["IUCR"] =="1753") | (df["IUCR"] == "1754") | (df['IUCR']=="0510") | 
            (df["Primary Type"] == "RITUALISM") & (df["IUCR"] != "0494")]

Next, since such cameras are only in public locations, it would make sense to remove crimes which occur inside. Obviously, there is some discretion with what place categories are selected. Since the CPD cameras in the dataset are only on streets (not inside housing, restaurants, or in transit stations), crimes which did not explicitly occur in a public location were removed. Looking at the different types of locations:

In [None]:
d.options.mode.chained_assignment = None
crimes.dropna(subset=['Location Description'], inplace=True)
crimes = crimes[(crimes["Location Description"] == "PARKING LOT/GARAGE(NON.RESID.)") | 
                (crimes["Location Description"] == "STREET" )| (crimes["Location Description"] == "ALLEY") |
                (crimes["Location Description"] == "SIDEWALK")|(crimes["Location Description"] == "RESIDENCE PORCH/HALLWAY")|
                (crimes["Location Description"] == "CHA PARKING LOT/GROUNDS")|
                (crimes["Location Description"] == "GAS STATION")|
                (crimes["Location Description"] == "POLICE FACILITY/VEH PARKING LOT") |
                (crimes["Location Description"] == "VACANT LOT/LAND")|(crimes["Location Description"] == "PARK PROPERTY")|
                (crimes["Location Description"] == "CTA GARAGE / OTHER PROPERTY")|
                (crimes["Location Description"] == "DRIVEWAY - RESIDENTIAL") |
                (crimes["Location Description"] == "PARKING LOT") | (crimes["Location Description"] == "PORCH")|
                (crimes["Location Description"] == "YARD")| 
                (crimes["Location Description"] == "RESIDENTIAL YARD (FRONT/BACK)") |
                (crimes["Location Description"] == "HIGHWAY/EXPRESSWAY")|
                (crimes["Location Description"] == "CHA PARKING LOT") |
                (crimes["Location Description"] == "BRIDGE")| (crimes["Location Description"] == "YARD")|
                (crimes["Location Description"] == "LAKEFRONT/WATERFRONT/RIVERBANK")|
                (crimes["Location Description"] == "DRIVEWAY")]

Next, let us look at the Date column. Out analysis requires a comparison between 2001-2003 crime and 2004-2006 crime. Hence, we separate the crime data of the two time periods. 

In [None]:
crimesBefore = crimes[(crimes["Date"].str[6:10] == '2001') | (crimes["Date"].str[6:10] == '2002') | 
                (crimes["Date"].str[6:10] == '2003')]
crimesAfter = crimes[(crimes["Date"].str[6:10] == '2004') | (crimes["Date"].str[6:10] == '2005') | 
                (crimes["Date"].str[6:10] == '2006')]

Because the limits of neighborhoods decided by the Chicago city change every few years, they were not consistent throughout the dataset. So for consistency, we used a script available on GitHub to ascertain the neighborhoods of the crime. Chicago city used the term community areas for different neighborhoods, so from this point on neighborhoods and community area will be used synonimously in this report. Source of the script to find neighborhoods is: https://github.com/jkgiesler/parse-chicago-neighborhoods 

Now, let us update the dataframes to add the neighborhoods of each crime. Starting with the camera dataset, let's add another column indicating community area of each camera.

In [None]:
# all_neighborhoods = gps_to_neighborhood.get_all_neighborhoods()
# list_name = []
# list_area = []
# for index, row in df_camera.iterrows():
#     neighborhood = gps_to_neighborhood.find_neighborhood(row['Longitude'],row['Latitude'],all_neighborhoods)
#     if neighborhood is not None:
#         list_name.append(neighborhood[0])
#         list_area.append(neighborhood[1])
#     else:
#         list_name.append('None')
#         list_area.append('None')
# df_camera['Community Areas'] = list_name
# df_camera['Area(unit)'] = list_area

# df_camera = df_camera[df_camera['Community Areas'] != 'None']
# df_camera.to_csv('lat_long.csv', index=False, header=True)
 
df_camera = pd.read_csv('lat_long.csv')

Let us do the same for the crimes dataframe. Because this process take a long time for large datasets, approximately 20 minutes, we saved the dataframes to csv files so we do not have to repeat the process multiple times. 

In [None]:
# list_name = []
# list_area = []
# for index, row in crimesBefore.iterrows():
#     neighborhood = gps_to_neighborhood.find_neighborhood(row['Longitude'],row['Latitude'],all_neighborhoods)
#     if neighborhood is not None:
#         list_name.append(neighborhood[0])
#         list_area.append(neighborhood[1])
#     else:
#         list_name.append('None')
#         list_area.append('None')
# crimesBefore['Community Areas'] = list_name
# crimesBefore['Area(unit)'] = list_area
# crimesBefore = crimesBefore[crimesBefore['Community Areas'] != 'None']

# crimesBefore.to_csv('CrimesBefore.csv', index=False, header=True)
crimesBefore = pd.read_csv('CrimesBefore.csv')

Doing the same for crimes that occured in the years 2004-06. 

In [None]:
# list_name = []
# list_area = []
# for index, row in crimesAfter.iterrows():
#     neighborhood = gps_to_neighborhood.find_neighborhood(row['Longitude'],row['Latitude'],all_neighborhoods)
#     if neighborhood is not None:
#         list_name.append(neighborhood[0])
#         list_area.append(neighborhood[1])
#     else:
#         list_name.append('None')
#         list_area.append('None')
# crimesAfter['Community Areas'] = list_name
# crimesAfter['Area(unit)'] = list_area
# crimesAfter = crimesAfter[crimesAfter['Community Areas'] != 'None']

# crimesAfter.to_csv('crimesAfter.csv', index=False, header=True)
crimesAfter = pd.read_csv('CrimesAfter.csv')

Now, let us group the crime data by neighborhoods. This way we will be able to analyse the effectiveness of cameras by neighborhoods. 

In [None]:
# Grouping by community areas before and after the cameras were installed.
sumCrimesBefore = crimesBefore['Community Areas'].value_counts().rename_axis('Community Areas').reset_index(name='2001-03')
sumCrimesAfter = crimesAfter['Community Areas'].value_counts().rename_axis('Community Areas').reset_index(name='2004-06')

# We will use left merge to create a singular dataframe with columns 
# community areas, crimes in 2001-03, crimes in 2004-06 and the difference between the two. 
totalCrimes = sumCrimesBefore.merge(sumCrimesAfter, how = 'left')
totalCrimes['Difference'] = totalCrimes['2004-06'] - totalCrimes['2001-03']

Lastly, let's do the same wrangling for the camera dataset. We will avoid analysing neighborhoods with less than 10 cameras. Reason being, less than 10 datapoints will not give us a clear idea. The number 10 was decided by manually trying different numbers are observing the bias.

In [None]:
above_10 = df_camera.groupby('Community Areas').filter(lambda x : len(x)>9)
df_camera_above_10 = above_10['Community Areas'].value_counts().rename_axis('Community Areas').reset_index(name='Counts')

# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [5]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

*Fill in your ethics & privacy discussion here*

# Conclusion & Discussion

*Fill in your discussion information here*