<a href="https://colab.research.google.com/github/tingyiwu714/san-diego-crime-analysis/blob/master/Spark_SD_Crime.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# San Diego Crime Analysis

Analysis of crimes in San Diego County from 2007 to 2017 using Apache Spark.

## Contents

1. Data Exploration
2. Data Visualization
3. Conclusion



## 0: Setup and Load Data

### 0.1 Set up Google Drive environment

In [None]:
# Install Spark, Java and findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-bin-hadoop3.2.tgz
!pip install -q findspark

# Set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

# Initilize pyspark
import findspark
findspark.init()

# Start spark session
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [None]:
# Install geocoding library 
!pip install geopy

# Install folium
!pip install folium

In [331]:
# Import packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pyspark.sql.types import *
from pyspark.sql.functions import *
from functools import reduce
from pyspark.sql import DataFrame
from pyspark.sql import Row

from geopy.geocoders import Nominatim

import folium
from folium.plugins import HeatMap, MarkerCluster

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Connect Google Colab with Google Drive
from google.colab import drive
drive.mount('/drive')

### 0.2 Load datasets into Spark DataFrames

The data was adapted from the San Diego Regional Data Library. It includes all valid crimes reported to the San Diego County Police Departments from 2007 to 2017. 

The data is separated in years and in different format.
*   Datasets of 2007 to 2011 are in csv format ([link](https://data.sandiegodata.org/dataset/raw-san-diego-county-crime-incidents-2007-2013/))
*   Datasets of 2012 to 2017 are in xlsx format  ([link](https://data.sandiegodata.org/dataset/raw-san-diego-county-crime-incidents-2012-2017/))

The data will be loaded in Spark DataFrames from Google Drive.

In [332]:
# Load crime data between 2007 to 2011
# file_id = ['1GynC_phtJr_ycck_FwP6U_1G4-wGjjrb']
file_id = ['16GBta4t4zAWO0yr7w7dLDyo7Ji34zzWf']
# file_id = ['16GBta4t4zAWO0yr7w7dLDyo7Ji34zzWf',
#            '1EblPEoGj4x8hvxLzjsDZh-IKeYhELE18',
#            '1tbURDD5QDeAgaHgGvZxIBIrkmQIxFPeh',
#            '1WuFB52qSr-dnB5--u5jYwog5Uno12tEK',
#            '1QdadT4p1O-FJdjfFlSkSI2Y9dFN-cKdq']
mySchema = StructType([StructField("activityType", StringType(), True),
                       StructField("AGENCY", StringType(), True),
                       StructField("activityDate", StringType(), True),
                       StructField("LEGEND", StringType(), True),
                       StructField("Charge_Description", StringType(), True),
                       StructField("BLOCK_ADDRESS", StringType(), True),
                       StructField("City_Name", StringType(), True),
                       StructField("ZipCode", StringType(), True)])
sdfs = []
for id in file_id:
  link = 'https://drive.google.com/uc?export=download&id={FILE_ID}'
  url = link.format(FILE_ID=id)
  pdf = pd.read_csv(url, dtype=str)
  spd = spark.createDataFrame(pdf, schema=mySchema)
  sdfs.append(spd)
df_07to11 = reduce(DataFrame.unionAll, sdfs)

In [333]:
# Load crime data between 2012 to 2017
# file_id = ['12VYPF8HeJH1CpDfsM3fbn2S77KBhD6pv']
file_id = ['1TpeuhgB7IHa7IDupjkSoY_CdLHQCdIhd']
# file_id = ['1TpeuhgB7IHa7IDupjkSoY_CdLHQCdIhd',
#            '1WgMUozlgojzrc3RPREimLrcToY1yoRaA',
#            '1d05-6gtJYEanI6W-9NhteVEzL6mdpXe_',
#            '1VdAikAd1LRkh7Z81bVv61R_A4RUespTf',
#            '159AY4OxMvX-XF00FqFQW4KPy1aJOWjMJ',
#            '18A3yryRU2q873W149H653jl_UgxWsM_-']
mySchema = StructType([StructField("reportingYear", StringType(), True),
                       StructField("reportingMonth", StringType(), True),
                       StructField("agency", StringType(), True),
                       StructField("activityStatus", StringType(), True),
                       StructField("activitydate", StringType(), True),
                       StructField("numberActualReported", StringType(), True),
                       StructField("BLOCK_ADDRESS", StringType(), True),
                       StructField("city", StringType(), True),
                       StructField("zipCode", StringType(), True),
                       StructField("censusTract", StringType(), True),
                       StructField("censusBlock", StringType(), True),
                       StructField("CrimeCategory", StringType(), True),
                       StructField("CrimeDescription", StringType(), True)])
sdfs = []
for id in file_id:
  link = 'https://drive.google.com/uc?export=download&id={FILE_ID}'
  url = link.format(FILE_ID=id)
  pdf = pd.read_excel(url, dtype=str)
  spd = spark.createDataFrame(pdf, schema=mySchema)
  sdfs.append(spd)
df_12to17 = reduce(DataFrame.unionAll, sdfs)

## 1: Data Exploration

These crime incident records are not cleaned, processed or geocoded, and they are inconsistent in many ways. In this part, I will do data cleaning and processing including:


*   **Merge two datasets** (2007-2011 and 2012-2017): Two datasets have different column name, and the datasets of 2007-2011 have less columns then 2012-2017. 
*   **Handle missing values**: Delete rows contain null and NaN
*   **Parsing dates**: Convert data to timestamp. There are 3 different time formats in the data
*   **Handle inconsistent data**: Crime categories are organized differently between 2007-2011 and 2012-2017 data.
*   **Geocoding**: Fix typo of the address and convert to geographic coordinates



### 1.1 Understand Raw Dataset

Total of 2M rows of individual crime incidents. It includes the details of each incident.

In [334]:
print("Number of rows: ", df_07to11.count())
print("Number of cols: ", len(df_07to11.columns))
df_07to11.show(5)

Number of rows:  188669
Number of cols:  8
+------------+--------------------+--------------------+-------------+--------------------+--------------------+-----------+-------+
|activityType|              AGENCY|        activityDate|       LEGEND|  Charge_Description|       BLOCK_ADDRESS|  City_Name|ZipCode|
+------------+--------------------+--------------------+-------------+--------------------+--------------------+-----------+-------+
|  CRIME CASE| Carlsbad Police, CA|Jan 1, 2007 12:00...|THEFT/LARCENY|GRAND THEFT:MONEY...|7100  BLOCK AVIAR...|   CARLSBAD|  92009|
|  CRIME CASE|Chula Vista Polic...|Jan 1, 2007 12:00...|        FRAUD|               FRAUD|300  BLOCK SANDST...|CHULA VISTA|  91911|
|  CRIME CASE|Chula Vista Polic...|Jan 1, 2007 12:00...|        FRAUD|               FRAUD|900  BLOCK PAPPAS...|CHULA VISTA|  91911|
|  CRIME CASE|Chula Vista Polic...|Jan 1, 2007 12:00...|THEFT/LARCENY|GRAND THEFT:MONEY...|1300  BLOCK MESA ...|CHULA VISTA|  91910|
|  CRIME CASE|Escondido Po

In [335]:
print("Number of rows: ", df_12to17.count())
print("Number of cols: ", len(df_12to17.columns))
df_12to17.show(5)

Number of rows:  193552
Number of cols:  13
+-------------+--------------+--------+---------------+-------------------+--------------------+--------------------+----------+-------+-----------+-----------+---------------+--------------------+
|reportingYear|reportingMonth|  agency| activityStatus|       activitydate|numberActualReported|       BLOCK_ADDRESS|      city|zipCode|censusTract|censusBlock|  CrimeCategory|    CrimeDescription|
+-------------+--------------+--------+---------------+-------------------+--------------------+--------------------+----------+-------+-----------+-----------+---------------+--------------------+
|         2012|             1|CARLSBAD|OPEN - WORKABLE|Aug 26 2011 11:00AM|                   1|0  BLOCK UNKNOWN ...|  CARLSBAD|    NaN|          0|          0|  Part II Crime|               FRAUD|
|         2012|             1|CARLSBAD|OPEN - WORKABLE|Dec  1 2011  8:00AM|                   1|3100  BLOCK EL CA...|  CARLSBAD|  92010|      19803|       1024|Larc

### 1.2 Data Cleaning and Processing

#### 1.2.1 Merge two DataFrames

Here, I will only keep the columns contain useful information including occurred date, description and location.

In [336]:
# Rename columns
df_07to11 = df_07to11.withColumnRenamed('activityDate', 'date')\
                     .withColumnRenamed('LEGEND', 'category')\
                     .withColumnRenamed('Charge_Description', 'description')\
                     .withColumnRenamed('City_Name', 'city')
for col in df_07to11.columns:
    df_07to11 = df_07to11.withColumnRenamed(col, col.lower())

df_12to17 = df_12to17.withColumnRenamed('activityDate', 'date')\
                     .withColumnRenamed('CrimeCategory', 'category')\
                     .withColumnRenamed('CrimeDescription', 'description')
for col in df_12to17.columns:
    df_12to17 = df_12to17.withColumnRenamed(col, col.lower())

In [337]:
# Union two dataframes and remove duplicates
cols = ["date", "category", "description", "block_address", "city", "zipcode"]
df1 = df_07to11.select(cols)
df2 = df_12to17.select(cols)
df = df1.union(df2).distinct()

#### 1.2.2 Missing value

In [338]:
# Count missing values of each columns
missing = df.select([count(when(isnan(c), c)).alias(c) for c in df.columns])
print("Number of missing data per column:")
missing.show()

Number of missing data per column:
+----+--------+-----------+-------------+----+-------+
|date|category|description|block_address|city|zipcode|
+----+--------+-----------+-------------+----+-------+
|   0|       0|          0|         1638|3832|  22962|
+----+--------+-----------+-------------+----+-------+



In [339]:
# Drop the rows with all NaN values
# Drop the rows with all location columns are NaN
df = df.filter(df.date != 'NaN')\
       .filter((df.block_address != 'NaN') & (df.city != 'NaN') & (df.zipcode != 'NaN'))

#### 1.2.3 Parsing dates

In [340]:
# Remove extra space
df = df.withColumn('date', regexp_replace('date', '  ', ' '))

In [341]:
# Convert "date" column to datetime
from pyspark.sql.functions import coalesce, col, to_date, to_timestamp
def my_to_date(col, frmts=("MMM d, y H:m:s a", "M/d/y H:m", "MMM d y H:ma")):
  return coalesce(*[to_timestamp(col, i) for i in frmts])

df = df.withColumn("date", my_to_date(df.date))

In [342]:
# Extract year and month from date
df = df.withColumn('year', year(df.date))\
       .withColumn('month', month(df.date))

#### 1.2.4 Inconsistent data

Based on Uniform Crime Reporting (UCR), crimes are divided into two major groups: Part I crimes and Part II crimes. Part I crimes are broken into two categories: violent and property crimes. Part II crimes are all other crimes outside of Part I crimes.



*   Part I crime
    * Violent crime: homicide, rape, robbery, aggravated assault
    * Property crime: burglary, larceny-thef, motor vehicle theft, arson

*   Part I crime: simple assault, drug, fraud, sex offense, DUI, etc.

Here, I'll categorize the crime category follow the UCR guideline.

In [343]:
diff = df.select("category").distinct()
diff.show()

+--------------------+
|            category|
+--------------------+
|               FRAUD|
|       Vehicle Theft|
|             WEAPONS|
|      Simple Assault|
|DRUGS/ALCOHOL VIO...|
|     Larceny >= $400|
|       THEFT/LARCENY|
|               ARSON|
|                Rape|
|               Arson|
|          SEX CRIMES|
|             ASSAULT|
|                 DUI|
| MOTOR VEHICLE THEFT|
|    Non Res Burglary|
|VEHICLE BREAK-IN/...|
|       Part II Crime|
|        Res Burglary|
|             ROBBERY|
|            HOMICIDE|
+--------------------+
only showing top 20 rows



In [344]:
df = df.withColumn('category', regexp_replace('category', 'FRAUD', 'Part II Crime'))\
        .withColumn('category', regexp_replace('category', 'Vehicle Theft', 'Motor Vehicle Theft'))\
        .withColumn('category', regexp_replace('category', 'WEAPONS', 'Part II Crime'))\
        .withColumn('category', regexp_replace('category', 'Simple Assault', 'Part II Crime'))\
        .withColumn('category', regexp_replace('category', 'DRUGS/ALCOHOL VIOLATIONS', 'Part II Crime'))\
        .withColumn('category', regexp_replace('category', 'Larceny >= $400', 'Larceny-theft'))\
        .withColumn('category', regexp_replace('category', 'THEFT/LARCENY', 'Larceny-theft'))\
        .withColumn('category', regexp_replace('category', 'SEX CRIMES', 'Part II Crime'))\
        .withColumn('category', regexp_replace('category', 'ASSAULT', 'Aggravated Assault'))\
        .withColumn('category', regexp_replace('category', 'DUI', 'Part II Crime'))\
        .withColumn('category', regexp_replace('category', 'Non Res Burglary', 'Burglary'))\
        .withColumn('category', regexp_replace('category', 'VEHICLE BREAK-IN/THEFT', 'Part II Crime'))\
        .withColumn('category', regexp_replace('category', 'Res Burglary', 'Burglary'))\
        .withColumn('category', regexp_replace('category', 'HOMICIDE', 'Homicide'))\
        .withColumn('category', regexp_replace('category', 'Strong ArmRobbery', 'Robbery'))\
        .withColumn('category', regexp_replace('category', 'Murder', 'Homicide'))\
        .withColumn('category', regexp_replace('category', 'Armed Robbery', 'Robbery'))\
        .withColumn('category', regexp_replace('category', 'VANDALISM', 'Part II Crime'))\
        .withColumn('category', regexp_replace('category', 'Larceny < $400', 'Larceny-theft'))\
        .withColumn('category', initcap('category'))\
        .withColumn('category', regexp_replace('category', 'Part Ii Crime', 'Part II Crime'))

In [345]:
diff = df.select("category").distinct()
diff.show()

+-------------------+
|           category|
+-------------------+
|            Robbery|
|    Larceny >= $400|
|      Larceny-theft|
|               Rape|
|              Arson|
|           Homicide|
|           Burglary|
|      Part II Crime|
|Motor Vehicle Theft|
|     Larceny < $400|
| Aggravated Assault|
+-------------------+



After processing, crime category only has the 8 serious crimes of Part I crime and all other crimes as Part II crime.

#### 1.2.5 Geocoding

The data only contains physical address of each record. I'll perform geocoding to convert the address into geographic coordinates. Before geocoding, I'll parse the address so that it can be geocoded successfully.

Since geocoding take long time, I'll save the processed data to csv files after partial data being geocoded in case of interruption for any reason.

In [346]:
# Combine block address, city, and zipcode to get full address
# Remove "BLOCK", extra whitespaces, and fix typo
# Remove "0" from "01ST", "02ND", "03RD", "04TH", ...
df = df.withColumn('full_address', concat(df.block_address, lit(", "), df.city, lit(", CA "), df.zipcode))\
       .withColumn('full_address', regexp_replace('full_address', ' BLOCK ', ' '))\
       .withColumn('full_address', regexp_replace('full_address', '  ', ' '))\
       .withColumn('full_address', trim('full_address'))\
       .withColumn('full_address', regexp_replace('full_address', ' CAM ', ' CAMINO '))\
       .withColumn('full_address', regexp_replace('full_address', ' CAMTO ', ' CAMINITO '))\
       .withColumn('full_address', regexp_replace('full_address', ' AVNDA ', ' AVENIDA '))\
       .withColumn('full_address', regexp_replace('full_address', ' CVENIDA ', ' AVENIDA '))\
       .withColumn('full_address', regexp_replace('full_address', ' TRZA ', ' TERRAZA '))\
       .withColumn('full_address', regexp_replace('full_address', ' CR DRIVE', ' CIRCLE'))\
       .withColumn('full_address', regexp_replace('full_address', ' MC ', ' MC'))\
       .withColumn('full_address', regexp_replace('full_address', '(\s)(0)(\d(ST|ND|RD|TH)\s)', '$1$3'))

In [347]:
# Convert Spark DataFrame to Pandas DataFrame
df_pd = df.toPandas()
df = df_pd.dropna(axis=0, subset=['date'])

In [348]:
def geocode_my_address(df):
  try:
    x = geolocator.geocode(df.full_address)
    df['longitude'] = x.longitude
    df['latitude'] = x.latitude
    return df
  except:
    # print("problem with address:", addr)
    df['longitude'] = "NaN"
    df['latitude'] = "NaN"
    return df

In [None]:
# Geocode the data by year and month
# Save the data to Google Drive after processing a month of data 
df_geo = pd.DataFrame() 
geolocator = Nominatim(timeout=10, user_agent = "dlab.berkeley.edu-workshop")
for yr in range(2007, 2018):
  link = '/drive/My Drive/Colab Notebooks/data{}.csv'.format(yr)
  df_cur_yr = pd.DataFrame() 
  for mo in range(1, 13):
    df_cur_mo = df_pd.loc[(df_pd['year'] == yr) & (df_pd['month'] == mo)]
    df_cur_mo = df_cur_mo.apply(geocode_my_address, axis=1)
    df_cur_yr = pd.concat([df_cur_yr, df_cur_mo], ignore_index=True)
    df_cur_yr.to_csv(link)
  df_geo = pd.concat([df_geo, df_cur_yr], ignore_index=True)


In [None]:
df_geo

In [None]:
df_geo.to_csv('/drive/My Drive/Colab Notebooks/data.csv')

The data is cleaned, processed, and geocoded. It's ready for analysis.

### 1.3 Data Visualization

Geospatial Visualization

In [None]:
# Create a map
m = folium.Map(location=[32.817316, -117.043098], tiles='cartodbpositron', zoom_start=10)

heat_df = df_pd[['latitude', 'longitude']]
heat_df = heat_df.dropna(axis=0, subset=['latitude','longitude'])

HeatMap(data=heat_df[['latitude', 'longitude']], radius=10).add_to(m)

In [None]:
        m = folium.Map(location=[32.817316, -117.043098], tiles='cartodbpositron', zoom_start=10)
        HeatMap(data=df_pd[['latitude', 'longitude']], radius=10).add_to(m)
        
        df_pd['latitude'] = df_pd['latitude'].astype(float)
        df_pd['longitude'] = df_pd['longitude'].astype(float)
        
        # Filter the DF for rows, then columns, then remove NaNs
        heat_df = df_pd[['latitude', 'longitude']]
        heat_df = heat_df.dropna(axis=0, subset=['latitude','longitude'])
        
        # List comprehension to make out list of lists
        heat_data = [[row['latitude'],row['longitude']] for index, row in heat_df.iterrows()]
        
        # Plot it on the map
        HeatMap(heat_data).add_to(m)

In [None]:
# Create a map
m = folium.Map(location=[32.817316, -117.043098], tiles='cartodbpositron', zoom_start=10)

# HeatMap(data=df_pd[['latitude', 'longitude']], radius=10).add_to(m)

df_pd['latitude'] = df_pd['latitude'].astype(float)
df_pd['longitude'] = df_pd['longitude'].astype(float)

# Filter the DF for rows, then columns, then remove NaNs
heat_df = df_pd[['latitude', 'longitude']]
heat_df = heat_df.dropna(axis=0, subset=['latitude','longitude'])

# List comprehension to make out list of lists
heat_data = [[row['latitude'],row['longitude']] for index, row in heat_df.iterrows()]

# Plot it on the map
HeatMap(heat_data).add_to(m)

# Display the map
m
