### Introduction:

This project will analyze 911 emergency call data compiled from San Francisco metropolitan area and will conduct a hotspot analysis to determine areas of high call volume for violent crimes.

The project will be developed keeping in mind about the potential users who can take an action to reduce the occurrence of the crimes and those could be San Francisco police department or related authorities. The project will help the authorities to know the crimes that have occurred over the time.

The dataset used here is published by City and County of San Francisco available at https://data.sfgov.org/ It consists of the calls for service regarding criminal activity (unverified), from SFPD. Data covers the period 03/31/2016-present


#### 1. Data Wrangling:
In this notebook I will clean and unify messy and complex dataset for easy access and analysis.This will include manually converting data from one format into another for more convenient consumption and organization of the data.

In [1]:
# import requisite module
import pandas as pd 
import json
import seaborn as sns
import numpy as np

In [2]:
# read data from json 
json_data = pd.read_json('https://data.sfgov.org/resource/fjjd-jecq.json?$limit=3000000')

Notably, this dataset has over two million rows from over two years of calls data. I initially tried reading in the data-set using the API but got only 1000 rows.

After some research, I learned that right now SODA API has a limit of returning 1000 rows at a time when querying the dataset. To query more than 1000 rows, I added '$limit=' parameter to json url which will set a limit on how much I want to query from a dataset. 

In [3]:
# save data in pandas dataframe
calls_for_service=pd.DataFrame(json_data)

In [4]:
calls_for_service.shape

(2277316, 14)

In [5]:
calls_for_service.head()

Unnamed: 0,address,address_type,agency_id,call_date,call_dttm,call_time,city,common_location,crime_id,disposition,offense_date,original_crimetype_name,report_date,state
0,1500 Block Of Pine St,Premise Address,1,2016-09-20T00:00:00.000,2016-09-20T11:50:00.000,2019-01-23 11:50:00,San Francisco,,162641608,REP,2016-09-20T00:00:00.000,Complaint Unkn,2016-09-20T00:00:00.000,CA
1,100 Block Of Erie St,Premise Address,1,2016-09-20T00:00:00.000,2016-09-20T12:36:00.000,2019-01-23 12:36:00,San Francisco,,162641785,UTL,2016-09-20T00:00:00.000,909,2016-09-20T00:00:00.000,CA
2,900 Block Of Market St,Premise Address,1,2016-09-20T00:00:00.000,2016-09-20T14:01:00.000,2019-01-23 14:01:00,San Francisco,,162642180,HAN,2016-09-20T00:00:00.000,Burglary,2016-09-20T00:00:00.000,CA
3,1900 Block Of Palou Av,Premise Address,1,2016-09-20T00:00:00.000,2016-09-20T14:30:00.000,2019-01-23 14:30:00,San Francisco,,162642293,REP,2016-09-20T00:00:00.000,Burglary,2016-09-20T00:00:00.000,CA
4,Florida St/division St,Intersection,1,2016-09-20T00:00:00.000,2016-09-20T14:49:00.000,2019-01-23 14:49:00,San Francisco,,162642379,HAN,2016-09-20T00:00:00.000,Encampment,2016-09-20T00:00:00.000,CA


Dataset has 2 million rows and 14 columns

In [7]:
calls_for_service.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2277316 entries, 0 to 2277315
Data columns (total 14 columns):
address                    2277316 non-null object
address_type               2277316 non-null object
agency_id                  2277316 non-null int64
call_date                  2277316 non-null object
call_dttm                  2277316 non-null object
call_time                  2277316 non-null datetime64[ns]
city                       2215867 non-null object
common_location            231429 non-null object
crime_id                   2277316 non-null int64
disposition                2277316 non-null object
offense_date               2277316 non-null object
original_crimetype_name    2277316 non-null object
report_date                2277316 non-null object
state                      2277316 non-null object
dtypes: datetime64[ns](1), int64(2), object(11)
memory usage: 243.2+ MB


In [8]:
#check for the null entries 
calls_for_service.isnull().sum()

address                          0
address_type                     0
agency_id                        0
call_date                        0
call_dttm                        0
call_time                        0
city                         61449
common_location            2045887
crime_id                         0
disposition                      0
offense_date                     0
original_crimetype_name          0
report_date                      0
state                            0
dtype: int64

In [9]:
# drop column common_location as it has many null entries
# drop unwanted columns call_date, call_time and agency_id
calls_for_service = calls_for_service.drop(['common_location','call_time','call_date','agency_id'], axis=1)

In [10]:
# drop rows with null City values
calls_for_service['city'].replace('', np.nan, inplace=True)
calls_for_service.dropna(subset=['city'], inplace=True)

In [11]:
# check the null values again after deleting the rows from city column
calls_for_service.isnull().sum()

address                    0
address_type               0
call_dttm                  0
city                       0
crime_id                   0
disposition                0
offense_date               0
original_crimetype_name    0
report_date                0
state                      0
dtype: int64

Observe that there are no NULL values in city column anymore. 

After taking care of Null values, I would now want to clean the data by removing white spaces,special characters which is largly observed in original_crimetype_name column.

Column Original_crimetype_name in above dataframe contain crimetype name and some radio codes. Radio codes are brevity codes used in voice communication by law enforcement. Later in this notebook, I will be merging radio codes excel file with above calls_for_service dataframe to replace the radio codes with it's more understandable meaning.

In [12]:
# change the datatype of call_dttm column
calls_for_service['call_dttm'] = pd.to_datetime(calls_for_service['call_dttm'])


In [13]:
# Remove the white spaces,special characters from the original crime types values
calls_for_service['original_crimetype_name'].str.strip()
calls_for_service['original_crimetype_name'] = calls_for_service['original_crimetype_name'].map(lambda x: x.lstrip('e`&[***~ +-".,//\\0').rstrip(' ***."'))

# Replace misspelled original crime types with correct values
calls_for_service['original_crimetype_name'] = calls_for_service['original_crimetype_name'].replace(to_replace=['Yelinng','Yelling Male','Yeller','Yelling Man','Yelling Off Balcony','Yelling/aggressive'], value='Yelling', regex=True)
calls_for_service['original_crimetype_name'] = calls_for_service['original_crimetype_name'].replace(to_replace=['Suspicious person'], value='Suspicious Person', regex=True)
calls_for_service['original_crimetype_name'] = calls_for_service['original_crimetype_name'].replace(to_replace=['Yel Zone','Yel Zn','Yel Grn Zn','Yz','Yello Zone','Yellow Zne','Yz,gz,rz','Yzone','Yellow Zone,gz,rz','Yellow Zoneone'], value='Yellow Zone', regex=True)
calls_for_service['original_crimetype_name'] = calls_for_service['original_crimetype_name'].replace(to_replace=['211cj Veh','211 Poss','Poss 211','211 Cold','211 Susp','211 Attempt','211veh Poss'], value='Robbery', regex=True)

calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('152'), 'original_crimetype_name'] = '152'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('1030'), 'original_crimetype_name'] = '10-30'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('207'), 'original_crimetype_name'] = '207'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('212'), 'original_crimetype_name'] = '212'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('7,2,46'), 'original_crimetype_name'] = '7.2.46'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('Wz'), 'original_crimetype_name'] = '7.2.27'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('22500e'), 'original_crimetype_name'] = '22500e'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('459'), 'original_crimetype_name'] = '459'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('240'), 'original_crimetype_name'] = '240'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('221'), 'original_crimetype_name'] = '221'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('261'), 'original_crimetype_name'] = '261'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('245'), 'original_crimetype_name'] = '245'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('Arson'), 'original_crimetype_name'] = 'Arson'
calls_for_service.loc[calls_for_service['original_crimetype_name'].str.contains('Vandalism'),'original_crimetype_name'] = 'Vandalism'

In [14]:
# Count the occurence of each crime type
data_by_city_crime=calls_for_service.groupby(['original_crimetype_name']).size().reset_index(name='count')
#data_by_city_crime

In [15]:
# Replace Null values of original crimetype with Nan and then drop the rows having nan
calls_for_service['original_crimetype_name'].replace('', np.nan, inplace=True)
calls_for_service.dropna(subset=['original_crimetype_name'], inplace=True)
calls_for_service.shape

(2215819, 10)

In [16]:
# read another exel file which has radio codes 
radio_code_xl = pd.read_excel("Radio_Codes_2016.xlsx")
radio_code = pd.DataFrame(radio_code_xl)
radio_code.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 2 columns):
original_crimetype_name    193 non-null object
Meaning                    193 non-null object
dtypes: object(2)
memory usage: 3.1+ KB


In [17]:
# replace radio codes mentioned in original_crimetype_name column with radio codes meaning from radio code data frame
calls_for_service['original_crimetype_name'] = calls_for_service['original_crimetype_name'].astype(str)
radio_code['original_crimetype_name']= radio_code['original_crimetype_name'].astype(str)
radio_code['Meaning']= radio_code['Meaning'].astype(str)
calls_for_service['original_crimetype_name'] = calls_for_service['original_crimetype_name'].map(radio_code.set_index('original_crimetype_name')['Meaning']).fillna(calls_for_service['original_crimetype_name'])

In [18]:
calls_for_service.head()
# store cleaned dataframe to use in next step of the analysis
%store calls_for_service

Stored 'calls_for_service' (DataFrame)


Observe that original_crimetype_name in second row was 909 and is now been replaced by 'Interview a citizen' from radio code dataframe. This will now help to seggregate the crimes by city.

In [19]:
# count the crimes per city
data_by_city=calls_for_service.groupby(['city']).size().reset_index(name='count')
data_by_city
# store dataframe to use in next step of the analysis
%store data_by_city

Stored 'data_by_city' (DataFrame)


For personal interest, I want to know the top 25 crimes called for the service.

In [20]:
# display the top 25 crimes 
data_by_city_crime.sort_values('count',ascending=False).head(25)

Unnamed: 0,original_crimetype_name,count
11611,Passing Call,266356
15791,Traffic Stop,228445
15394,Suspicious Person,118402
9129,Homeless Complaint,110137
385,22500e,89489
10799,Muni Inspection,71511
5538,Audible Alarm,65857
17164,Well Being Check,55504
15396,Suspicious Vehicle,54078
15891,Trespasser,48328


Although, I am interested only in violent and property crime data,it is good to know about the top 25 crimes for which SFPD receive calls.
It can be observed that there are no violent crimes listed in top 25 crimes list ( which is a good sign though).

In [22]:
calls_for_violent_crimes=calls_for_service.loc[calls_for_service['original_crimetype_name']\
                                                    .isin(['Homicide','Robbery','Strongarm Robbery','Aggravated Assault/ADW',\
                                                          'Rape/Sexual Assault','Sexual Assault Adult','Sexual Assault Juve',\
                                                          'Kidnapping','Stabbing'])]
%store calls_for_violent_crimes

Stored 'calls_for_violent_crimes' (DataFrame)


In [23]:
calls_for_violent_crimes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10750 entries, 10 to 2277287
Data columns (total 10 columns):
address                    10750 non-null object
address_type               10750 non-null object
call_dttm                  10750 non-null datetime64[ns]
city                       10750 non-null object
crime_id                   10750 non-null int64
disposition                10750 non-null object
offense_date               10750 non-null object
original_crimetype_name    10750 non-null object
report_date                10750 non-null object
state                      10750 non-null object
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 923.8+ KB


In [25]:
data_by_violent_crime =calls_for_violent_crimes.groupby(['original_crimetype_name']).size().reset_index(name='count')
data_by_violent_crime.sort_values('count',ascending=False).head(50)

Unnamed: 0,original_crimetype_name,count
8,Strongarm Robbery,4737
4,Robbery,2011
5,Sexual Assault Adult,1764
7,Stabbing,920
6,Sexual Assault Juve,696
2,Kidnapping,366
0,Aggravated Assault/ADW,148
3,Rape/Sexual Assault,101
1,Homicide,7


In [26]:
# Extract the calls made to report violent crime 
calls_for_violent_crimes= calls_for_service.loc[(calls_for_service['original_crimetype_name'] == 'Homicide')| (calls_for_service['original_crimetype_name'] == 'Robbery') | (calls_for_service['original_crimetype_name'] == 'Aggravated Assault/ADW') | (calls_for_service['original_crimetype_name'] == 'Rape/Sexual Assault')|(calls_for_service['original_crimetype_name'] == 'Sexual Assault Adult')|(calls_for_service['original_crimetype_name'] == 'Sexual Assault Juve')|(calls_for_service['original_crimetype_name'] == 'Kidnapping') ]


In [27]:
calls_for_property_crimes= calls_for_service.loc[calls_for_service['original_crimetype_name']\
                                                    .isin(['Petty Theft','Grand Theft','Burglary','Embezzlement',\
                                                          'Strongarm Robbery','Person breaking in',\
                                                           'Stolen vehicle','Broken Window','Vandalism','Extortion',\
                                                           'Auto Boost / Strip','Shoplifting'])]
%store calls_for_property_crimes

Stored 'calls_for_property_crimes' (DataFrame)


In [28]:
#calls_for_property_crimes.info()
data_by_prop_crime =calls_for_property_crimes.groupby(['original_crimetype_name']).size().reset_index(name='count')
data_by_prop_crime.sort_values('count',ascending=False).head(50)

Unnamed: 0,original_crimetype_name,count
0,Auto Boost / Strip,37807
7,Petty Theft,30947
2,Burglary,17054
11,Vandalism,16874
10,Strongarm Robbery,4737
5,Grand Theft,3539
9,Stolen vehicle,176
1,Broken Window,160
6,Person breaking in,35
3,Embezzlement,6


After cleaning the dataset, next step would be to map and plot crime data to get better insite.