# Crime Analysis in New York City Example

__Learning Objective:__

* Cleaning and Transforming Unstructured Data with Spark's RDD
* Learn Spark RDD's operations like filter, map, reduce, countByValue, etc.

### Load the data and get quick sense of data

__Read the compressed file content using Python's lzma library__

<font color="red">TODO: configure correct "filePath"</font>

In [1]:
filePath="D:\\Tirthal-LABs\\xLocal-Git-Repo\\Learning-BigData\\gs-spark\\gs-spark-2x\\src\\main\\python\\02\\NYPD_7_Major_Felony_Incidents.xz"
## For Linux "file:///Users/tirthalp/something/gs-spark-2x/src/main/python/02/NYPD_7_Major_Felony_Incidents.xz"
## For Windows "C:\\Users\\tirthalp\\something\\gs-spark-2x\\src\\main\\python\\02\\NYPD_7_Major_Felony_Incidents.xz"

import lzma
with lzma.open(filePath, 'rt') as f:
    file_content = list(f)

print("Type of file_content variable = ", type(file_content))
print(file_content[0])
print(file_content[1])

Type of file_content variable =  <class 'list'>
OBJECTID,Identifier,Occurrence Date,Day of Week,Occurrence Month,Occurrence Day,Occurrence Year,Occurrence Hour,CompStat Month,CompStat Day,CompStat Year,Offense,Offense Classification,Sector,Precinct,Borough,Jurisdiction,XCoordinate,YCoordinate,Location 1

1,f070032d,09/06/1940 07:30:00 PM,Friday,Sep,6,1940,19,9,7,2010,BURGLARY,FELONY,D,66,BROOKLYN,N.Y. POLICE DEPT,987478,166141,"(40.6227027620001, -73.9883732929999)"



__Convert Python list into Spark's RDD type__

In [2]:
data = sc.parallelize(file_content)
type(data)

pyspark.rdd.RDD

If you get error on next code execution, then just wait for few seconds and execute it again.

In [3]:
# Let's get quick sense of data using RDD's take operation
data.take(5)

['OBJECTID,Identifier,Occurrence Date,Day of Week,Occurrence Month,Occurrence Day,Occurrence Year,Occurrence Hour,CompStat Month,CompStat Day,CompStat Year,Offense,Offense Classification,Sector,Precinct,Borough,Jurisdiction,XCoordinate,YCoordinate,Location 1\n',
 '1,f070032d,09/06/1940 07:30:00 PM,Friday,Sep,6,1940,19,9,7,2010,BURGLARY,FELONY,D,66,BROOKLYN,N.Y. POLICE DEPT,987478,166141,"(40.6227027620001, -73.9883732929999)"\n',
 '2,c6245d4d,12/14/1968 12:20:00 AM,Saturday,Dec,14,1968,0,12,14,2008,GRAND LARCENY,FELONY,G,28,MANHATTAN,N.Y. POLICE DEPT,996470,232106,"(40.8037530600001, -73.955861904)"\n',
 '3,716dbc6f,10/30/1970 03:30:00 PM,Friday,Oct,30,1970,15,10,31,2008,BURGLARY,FELONY,H,84,BROOKLYN,N.Y. POLICE DEPT,986508,190249,"(40.688874254, -73.9918594329999)"\n',
 '4,638cd7b7,07/18/1972 11:00:00 PM,Tuesday,Jul,18,1972,23,7,19,2012,GRAND LARCENY OF MOTOR VEHICLE,FELONY,F,73,BROOKLYN,N.Y. POLICE DEPT,1005876,182440,"(40.6674141890001, -73.9220463899999)"\n']

### Clean the data (e.g. remove '\n' from end of each line and filter the header row)

In [4]:
# Clean '\n' at the end of each line
data = data.map(lambda x:x.replace("\n",""))

In [5]:
header = data.first()
print(header)

OBJECTID,Identifier,Occurrence Date,Day of Week,Occurrence Month,Occurrence Day,Occurrence Year,Occurrence Hour,CompStat Month,CompStat Day,CompStat Year,Offense,Offense Classification,Sector,Precinct,Borough,Jurisdiction,XCoordinate,YCoordinate,Location 1


In [6]:
# Filter the header row
dataWoHeader = data.filter(lambda x: x!=header)

In [7]:
dataWoHeader.first()

'1,f070032d,09/06/1940 07:30:00 PM,Friday,Sep,6,1940,19,9,7,2010,BURGLARY,FELONY,D,66,BROOKLYN,N.Y. POLICE DEPT,987478,166141,"(40.6227027620001, -73.9883732929999)"'

### Transform data from semi-structured (string) to structured (Crime) object / Transform records to extract fields

In [8]:
# How to transform each record RDD -> list of comma separated words RDD?
dataWoHeader.map(lambda x:x.split(",")).first()

['1',
 'f070032d',
 '09/06/1940 07:30:00 PM',
 'Friday',
 'Sep',
 '6',
 '1940',
 '19',
 '9',
 '7',
 '2010',
 'BURGLARY',
 'FELONY',
 'D',
 '66',
 'BROOKLYN',
 'N.Y. POLICE DEPT',
 '987478',
 '166141',
 '"(40.6227027620001',
 ' -73.9883732929999)"']

__How to transform records of string to named tuples / Parse the rows to extract fields?__

In [9]:
import csv
from io import StringIO
from collections import namedtuple

In [10]:
fields = header.replace(" ","_").replace("/","_").split(",")
print(fields)

['OBJECTID', 'Identifier', 'Occurrence_Date', 'Day_of_Week', 'Occurrence_Month', 'Occurrence_Day', 'Occurrence_Year', 'Occurrence_Hour', 'CompStat_Month', 'CompStat_Day', 'CompStat_Year', 'Offense', 'Offense_Classification', 'Sector', 'Precinct', 'Borough', 'Jurisdiction', 'XCoordinate', 'YCoordinate', 'Location_1']


In [11]:
Crime = namedtuple('Crime', fields, verbose=False)

In [12]:
def parse(row):
    reader = csv.reader(StringIO(row))
    row = next(reader)
    return Crime(*row)

In [13]:
# Transform String to Crime object
crimes=dataWoHeader.map(parse)

In [14]:
crimes.first()

Crime(OBJECTID='1', Identifier='f070032d', Occurrence_Date='09/06/1940 07:30:00 PM', Day_of_Week='Friday', Occurrence_Month='Sep', Occurrence_Day='6', Occurrence_Year='1940', Occurrence_Hour='19', CompStat_Month='9', CompStat_Day='7', CompStat_Year='2010', Offense='BURGLARY', Offense_Classification='FELONY', Sector='D', Precinct='66', Borough='BROOKLYN', Jurisdiction='N.Y. POLICE DEPT', XCoordinate='987478', YCoordinate='166141', Location_1='(40.6227027620001, -73.9883732929999)')

In [15]:
# How to get value of a field (e.g. offense field)?
crimes.first().Offense

'BURGLARY'

### Identify and remove records with missing values

In [16]:
# Review total count by offense 
crimes.map(lambda x:x.Offense).countByValue()

defaultdict(int,
            {'BURGLARY': 191369,
             'GRAND LARCENY': 428993,
             'GRAND LARCENY OF MOTOR VEHICLE': 101963,
             'RAPE': 13779,
             'ROBBERY': 198744,
             'FELONY ASSAULT': 184042,
             'MURDER & NON-NEGL. MANSLAUGHTE': 4574,
             'NA': 1})

__Problem__ = having 1 offense with 'NA' type which looks like an anomaly

In [17]:
# Review total crimes by each year to find missing values
crimes.map(lambda x:x.Occurrence_Year).countByValue()

defaultdict(int,
            {'1940': 1,
             '1968': 1,
             '1970': 2,
             '1972': 2,
             '1987': 6,
             '1990': 17,
             '1992': 12,
             '1994': 19,
             '1995': 27,
             '1996': 34,
             '1998': 74,
             '1999': 124,
             '2000': 282,
             '2001': 343,
             '2002': 368,
             '2003': 490,
             '2004': 692,
             '2005': 3272,
             '2006': 127887,
             '1910': 3,
             '1913': 4,
             '1945': 2,
             '1981': 1,
             '1985': 8,
             '1988': 6,
             '1991': 12,
             '1905': 2,
             '1971': 1,
             '1997': 40,
             '1914': 2,
             '1956': 1,
             '1989': 12,
             '1993': 23,
             '2015': 102657,
             '1954': 1,
             '1982': 5,
             '1950': 1,
             '1959': 1,
             '1966': 7,
            

__Problem 1__ = having 244 crimes doesn't have associated year <br/> 
__Problem 2__ = weird pattern of sudden increase in crimes after 2005, which indicates missing data in years before 2006

In [18]:
# Filter to exclude records with anomalies and missing values
crimesFiltered=crimes.filter(lambda x : not (x.Offense=="NA" or x.Occurrence_Year==''))\
                .filter(lambda x: int(x.Occurrence_Year)>=2006)

In [19]:
crimesFiltered.map(lambda x:x.Occurrence_Year).countByValue()

defaultdict(int,
            {'2006': 127887,
             '2015': 102657,
             '2007': 120554,
             '2008': 117375,
             '2009': 106018,
             '2010': 105643,
             '2011': 107206,
             '2012': 111798,
             '2013': 111286,
             '2014': 106849})

### Identify and remove records with anomalies

In [20]:
# A function to extract (latitude, longitude) from a given location 
def extractCoords(location):
    location_lat = float(location[1:location.index(",")])
    location_lon = float(location[location.index(",")+1:-1])
    return (location_lat,location_lon)

Get min and max extreme points of (latitude, longitude) & then check in google map if it correlates with boundries of New York correctly.

In [21]:
crimesFiltered.map(lambda x:extractCoords(x.Location_1))\
            .reduce(lambda x,y:(min(x[0],y[0]),min(x[1],y[1])))

(40.112709974, -77.519206334)

In [22]:
crimesFiltered.map(lambda x:extractCoords(x.Location_1))\
            .reduce(lambda x,y:(max(x[0],y[0]),max(x[1],y[1])))

(59.5805088160001, -73.700716685)

__Problem__: Above result refers to some coordinates of Pennsylvania and North Canada in Google map, which certainly indicates annomalies for location field.

In [23]:
# Get actual boundry co-ordinates of New York and use those to filter records with anomalies
crimesFinal =  crimesFiltered.filter(lambda x: extractCoords(x.Location_1)[0]>=40.477399 and \
                                    extractCoords(x.Location_1)[0]<=40.917577 and \
                                    extractCoords(x.Location_1)[1]>=-74.25909 and  \
                                    extractCoords(x.Location_1)[1]<=-73.700009)

### Draw insights from New York City Crime data (which is cleaned and structure data)

In [24]:
# What's Crime Trend by Year in New York city? - giving pattern of increasing crime every year
crimesFinal.map(lambda x:x.Occurrence_Year).countByValue()

defaultdict(int,
            {'2006': 127887,
             '2015': 102657,
             '2007': 120491,
             '2008': 117375,
             '2009': 106018,
             '2010': 105639,
             '2011': 107203,
             '2012': 111798,
             '2013': 111286,
             '2014': 106849})

In [25]:
# What's Crime Trend by Year for given Offense type in New York city? - e.g. crime rate of burglary is falling every year
crimesFinal.filter(lambda x : x.Offense=='BURGLARY')\
            .map(lambda x:x.Occurrence_Year)\
            .countByValue()

defaultdict(int,
            {'2006': 23069,
             '2007': 21715,
             '2008': 20732,
             '2009': 19441,
             '2010': 18700,
             '2011': 18860,
             '2012': 19309,
             '2013': 17419,
             '2014': 16832,
             '2015': 14967})

In [26]:
crimesFinal.persist()

PythonRDD[14] at RDD at PythonRDD.scala:49

### How to visualize result of crimes for given offense type in given year?

<font color="red">TODO: Go to Python console in Anaconda and run - </font> `pip install gmplot` 

In [35]:
# Python's gmplot module allows to plot points of Google Map using latitudes and longitudes
import gmplot
# gmap = gmplot.GoogleMapPlotter(37.428, -122.145, 16).from_geocode("New York City")
gmap = gmplot.GoogleMapPlotter(40.730610, -73.935242, 15).from_geocode("New York City")

IndexError: list index out of range

In [31]:
b_lats = crimesFinal.filter(lambda x : x.Offense=='BURGLARY' and x.Occurrence_Year=="2015")\
                    .map(lambda x:extractCoords(x.Location_1)[0])\
                    .collect()

In [32]:
b_lons = crimesFinal.filter(lambda x : x.Offense=='BURGLARY' and x.Occurrence_Year=="2015")\
                    .map(lambda x:extractCoords(x.Location_1)[1])\
                    .collect()

In [33]:
gmap.scatter(b_lats, b_lons, '#DE1515', size=40, marker=False)

In [35]:
gmap.draw("mymap.html")

Open "mymap.html" file in Browser

### Disclaimer

This is an example code of the [pluralsight course](https://app.pluralsight.com/library/courses/apache-spark-beginning-data-exploration-analysis), which I just upgraded to work with Python 3.6 along with few other changes while learning Spark.