# Lab 9: Geo-spatial aggregation

In this lab we will further explore and analyze the crimes dataset for data quality issues, and then use geo-spatial analysis to determine the neighborhood associated with each crime event, based on its longitude/latitude coordinates. We then use Folium to plot the data on an interactive map.

First, setup the Spark Context, and create a HiveContext using the "demo" table:

In [1]:
# Set up Spark Context
from pyspark import SparkContext, SparkConf

SparkContext.setSystemProperty('spark.executor.memory', '2g')
conf = SparkConf()
conf.set('spark.executor.instances', 15)
sc = SparkContext('yarn-client', 'Spark-lab9', conf=conf)

from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql("use demo")

DataFrame[result: string]

It's always good to inspect data for quality. We would like to do this for the longitude/latitude data in our dataset.

1. Load the crimes dataset as a Spark DataFrame
2. Use describe() to inspect the properties of the columns 'longitude' and 'latitude'

describe() computes summary statistics for each numeric feature in the Data Frame.

In [2]:
crimes = hc.table('crimes')
crimes.select(crimes.longitude.cast('float').alias('long'), \
              crimes.latitude.cast('float').alias('lat')) \
      .describe().toPandas()

Unnamed: 0,summary,long,lat
0,count,1750133.0,1750133.0
1,mean,-122.42263853403858,37.771270995454
2,stddev,0.0309396211305354,0.4727448638706336
3,min,-122.51364,37.70788
4,max,-120.5,90.0


which values stand out as abnormal, considering the general longitude/latitude values in San Francisco?

Assuming all anomalies are of similar nature, let's explore how many outliers like this exist. 
* Create a data frame with all these outliers
* Count how many exist
* Print 3 outlier rows.

In [3]:
outliers = crimes.filter((crimes.longitude.cast('float') > -121.0) | (crimes.latitude.cast('float') > 38.0))
print "number of outliers = %d" % outliers.count()

outliers.select("category", "description", "date_str", "longitude", "latitude").limit(3).toPandas()

number of outliers = 143


Unnamed: 0,category,description,date_str,longitude,latitude
0,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,12/30/2005,-120.5,90
1,ASSAULT,INFLICT INJURY ON COHABITEE,12/30/2005,-120.5,90
2,ASSAULT,AGGRAVATED ASSAULT WITH BODILY FORCE,12/30/2005,-120.5,90


We now move to some geo-spatial aggregation. The goal is to use ESRI's HIVE UDFs to determine the neighborhood for each crime event, by its longitude/latitude coordinates.
You can find more information about ESRI Hive UDFs here: https://github.com/Esri/spatial-framework-for-hadoop

Notes:
* The neighborhood polygon definitions have already been uploaded to HIVE as the table *sf_neighborhoods*, so we can use the ESRI Hive UDF functions to determine the neighborhood name for each crime.
* Remember to filter the data so as to remove any events with anomalous longitude/latitude values.
* Notice the "repartition(50)" - this is to increase parallelism and make this query faster Spark SQL.
* We add the various jars to make ESRI UDFs work properly.

In [4]:
hc.sql("add jar /home/jupyter/notebooks/jars/guava-11.0.2.jar")
hc.sql("add jar /home/jupyter/notebooks/jars/spatial-sdk-json.jar")
hc.sql("add jar /home/jupyter/notebooks/jars/esri-geometry-api.jar")
hc.sql("add jar /home/jupyter/notebooks/jars/spatial-sdk-hive.jar")

hc.sql("create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains'")
hc.sql("create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point'")

cf = hc.sql("""
SELECT  date_str, time, longitude, latitude, resolution, category, district, dayofweek, description
FROM crimes
WHERE longitude < -121.0 and latitude < 38.0
""").repartition(50)
cf.registerTempTable("cf")

crimes_wn = hc.sql("""
SELECT date_str, time, dayofweek, category, district, resolution, description, longitude, latitude,
       neighborho as neighborhood 
FROM sf_neighborhoods JOIN cf
WHERE ST_Contains(sf_neighborhoods.shape, ST_Point(cf.longitude, cf.latitude))
""").cache()

crimes_per_neighborhood = crimes_wn.groupBy('neighborhood').count().toPandas()
print crimes_per_neighborhood.sort(columns='count', ascending=False)

             neighborhood   count
36  Downtown/Civic Center  285122
16        South of Market  241230
13                Mission  190606
33                Bayview  119412
28       Western Addition  113156
2      Financial District   83757
26    Castro/Upper Market   47246
7             North Beach   43650
18         Haight Ashbury   41898
25              Excelsior   40810
0          Bernal Heights   38881
23          Outer Mission   37123
27           Potrero Hill   35745
30      Visitacion Valley   34358
6          Inner Richmond   33036
21               Nob Hill   31712
35           Outer Sunset   31298
14                 Marina   30993
12             Ocean View   30724
32           Russian Hill   26045
11              Lakeshore   25716
8          Outer Richmond   23947
5         Pacific Heights   20754
17               Parkside   19860
24           Inner Sunset   18248
9      West of Twin Peaks   16618
15              Chinatown   16016
22             Noe Valley   15762
3        Golde

Store the updated crimes dataset with neighborhood names into an ORC table in HIVE called "crimes_wn", using Spark's DataFrameWriter API and the saveAsTable() function 

In [5]:
crimes_wn.write.format("orc").saveAsTable("crimes_wn", mode="overwrite")

Now let's define the inline_map() helper function to draw maps with Folium:

In [6]:
from IPython.display import HTML
map_width=1000
map_height=600

def inline_map(m, width=map_width, height=map_height):
    m.create_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{}" '
                 'style="width: {}px; height: {}px; '
                 'border: none"></iframe>'.format(srcdoc, width, height))
    return embed

Use the Folium package to draw a map centered at the heart of San Francisco (Latitude 37.77, Longitude -122.4), and specify a starting zoom level of 12. 

In [7]:
import pandas as pd
import folium

sf_lat = 37.77
sf_long = -122.4

map_sf = folium.Map(location=[sf_lat, sf_long], zoom_start=12)
inline_map(map_sf)

We have pre-loaded into the "data" folder a GeoJSON file that includes the neigbordhood boundaries of all San Francisco neighborhoods. Use Folium's geo_json function to draw the boundaries on the map:

In [8]:
map_sf = folium.Map(location=[sf_lat, sf_long], zoom_start=12, width=map_width, height=map_height)
map_sf.geo_json(geo_path='data/sfn.geojson')
inline_map(map_sf)

Using crimes_per_neighborhood we computed earlier, plot a map color-coded with the number of crimes in each neighborhood:

In [9]:
map_sf = folium.Map(location=[sf_lat, sf_long], zoom_start=12, width=map_width, height=map_height)
map_sf.geo_json(geo_path='data/sfn.geojson', data=crimes_per_neighborhood,
                columns=['neighborhood', 'count'],
                key_on='feature.properties.neighborho',
                fill_color='YlGn', fill_opacity=0.7, line_opacity=0.2,
                legend_name='Number of crimes')
inline_map(map_sf)

Use ESRI's HIVE UDFs to compute the centroid of each neighborhood, and then plot a Folium map with a simple_marker for each neighborhood, displaying the neighborhood name and number of crimes in that neighborhood: 

In [10]:
hc.sql("create temporary function ST_Centroid as 'com.esri.hadoop.hive.ST_Centroid'")
hc.sql("create temporary function ST_X as 'com.esri.hadoop.hive.ST_X'")
hc.sql("create temporary function ST_Y as 'com.esri.hadoop.hive.ST_Y'")

rdd_centroid = hc.sql("""
SELECT neighborho as neighborhood, 
       ST_X(ST_Centroid(sf_neighborhoods.shape)) as cent_longitude,
       ST_Y(ST_Centroid(sf_neighborhoods.shape)) as cent_latitude
FROM sf_neighborhoods
""")

map_sf = folium.Map(location=[sf_lat, sf_long], zoom_start=12, width=map_width, height=map_height)
s = pd.Series(index=crimes_per_neighborhood['neighborhood'].values, \
              data=crimes_per_neighborhood['count'].values.astype(float))
for n in rdd_centroid.collect():
    map_sf.simple_marker([n.cent_latitude, n.cent_longitude], \
                         popup=n.neighborhood + "=" + str(int(s[n.neighborhood])))

inline_map(map_sf)