# Lab 9: geo-spatial aggregation

In this lab we will further explore and analyze the crimes dataset for data quality issues, and then use geo-spatial analysis to determine the neighborhood associated with each crime event, based on its longitude/latitude coordinates. We then use Folium to plot the data on an interactive map.

First, setup the Spark Context, and create a HiveContext using the "demo" table:

In [None]:
# Set up Spark Context
from pyspark import SparkContext, SparkConf

SparkContext.setSystemProperty('spark.executor.memory', '2g')
conf = SparkConf()
conf.set('spark.executor.instances', 15)
sc = SparkContext('yarn-client', 'Spark-lab9', conf=conf)

from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql("use demo")

It's always good to inspect data for quality. We would like to do this for the longitude/latitude data in our dataset.

1. Load the crimes dataset as a Spark DataFrame
2. Use describe() to inspect the properties of the columns 'longitude' and 'latitude'

describe() computes summary statistics for each numeric feature in the Data Frame.

In [None]:
crimes = hc.<YOUR CODE HERE>
crimes.<YOUR CODE HERE>

which values stand out as abnormal, considering the general longitude/latitude values in San Francisco?

Assuming all anomalies are of similar nature, let's explore how many outliers like this exist. 
* Create a data frame with all these outliers
* Count how many exist
* Print 3 outlier rows.

In [None]:
outliers = crimes.<YOUR CODE HERE>
print "number of outliers = %d" % outliers.count()

outliers.select("category", "description", "date_str", "longitude", "latitude").limit(3).toPandas()

We now move to some geo-spatial aggregation. The goal is to use ESRI's HIVE UDFs to determine the neighborhood for each crime event, by its longitude/latitude coordinates.
You can find more information about ESRI Hive UDFs here: https://github.com/Esri/spatial-framework-for-hadoop

Notes:
* The neighborhood polygon definitions have already been uploaded to HIVE as the table *sf_neighborhoods*, so we can use the ESRI Hive UDF functions to determine the neighborhood name for each crime.
* Remember to filter the data so as to remove any events with anomalous longitude/latitude values.
* Notice the "repartition(50)" - this is to increase parallelism and make this query faster Spark SQL.
* We add the various jars to make ESRI UDFs work properly.

In [None]:
hc.sql("add jar /home/jupyter/notebooks/jars/guava-11.0.2.jar")
hc.sql("add jar /home/jupyter/notebooks/jars/esri-geometry-api.jar")
hc.sql("add jar /home/jupyter/notebooks/jars/spatial-sdk-hive.jar")
hc.sql("add jar /home/jupyter/notebooks/jars/spatial-sdk-json.jar")

hc.sql("create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains'")
hc.sql("create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point'")

cf = hc.sql("""
SELECT  date_str, time, longitude, latitude, resolution, category, district, dayofweek, description
FROM crimes
WHERE longitude < -121.0 and latitude < 38.0
""").repartition(50)
cf.registerTempTable("cf")

crimes_wn = hc.sql("""
SELECT date_str, time, dayofweek, category, district, resolution, description, longitude, latitude,
       neighborho as neighborhood 
FROM sf_neighborhoods JOIN cf
WHERE ST_Contains(sf_neighborhoods.shape, ST_Point(cf.longitude, cf.latitude))
""").cache()

crimes_per_neighborhood = crimes_wn.groupBy('neighborhood').count().toPandas()
print crimes_per_neighborhood.sort(columns='count', ascending=False)


Store the updated crimes dataset with neighborhood names into an ORC table in HIVE called "crimes_wn", using Spark's DataFrameWriter API and the saveAsTable() function 

In [None]:
crimes_wn.<YOUR CODE HERE>

Now let's define the inline_map() helper function to draw maps with Folium:

In [None]:
from IPython.display import HTML
map_width=1000
map_height=600

def inline_map(m, width=map_width, height=map_height):
    m.create_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{}" '
                 'style="width: {}px; height: {}px; '
                 'border: none"></iframe>'.format(srcdoc, width, height))
    return embed

Use the Folium package to draw a map centered at the heart of San Francisco (Latitude 37.77, Longitude -122.4), and specify a starting zoom level of 12. 

In [None]:
import pandas as pd
import folium

sf_lat = 37.77
sf_long = -122.4

map_sf = folium.<YOUR CODE HERE>
inline_map(map_sf)

We have pre-loaded into the "data" folder a GeoJSON file that includes the neigbordhood boundaries of all San Francisco neighborhoods. Use Folium's geo_json function to draw the boundaries on the map:

In [None]:
map_sf = folium.<YOUR CODE HERE>
map_sf.geo_json(<YOUR CODE HERE>)
inline_map(map_sf)

Using crimes_per_neighborhood we computed earlier, plot a map color-coded with the number of crimes in each neighborhood:

In [None]:
map_sf = folium.<YOUR CODE HERE>
map_sf.<YOUR CODE HERE>
inline_map(map_sf)

Use ESRI's HIVE UDFs to compute the centroid of each neighborhood, and then plot a Folium map with a simple_marker for each neighborhood, displaying the neighborhood name and number of crimes in that neighborhood: 

In [None]:
hc.sql("create temporary function ST_Centroid as 'com.esri.hadoop.hive.ST_Centroid'")
hc.sql("create temporary function ST_X as 'com.esri.hadoop.hive.ST_X'")
hc.sql("create temporary function ST_Y as 'com.esri.hadoop.hive.ST_Y'")

rdd_centroid = hc.sql("""
SELECT neighborho as neighborhood, 
       ST_X(ST_Centroid(sf_neighborhoods.shape)) as cent_longitude,
       ST_Y(ST_Centroid(sf_neighborhoods.shape)) as cent_latitude
FROM sf_neighborhoods
""")

map_sf = folium.Map(location=[sf_lat, sf_long], zoom_start=12, width=map_width, height=map_height)
s = pd.Series(index=crimes_per_neighborhood['neighborhood'].values, \
              data=crimes_per_neighborhood['count'].values.astype(float))

for n in rdd_centroid.collect():
    map_sf.simple_marker(<YOUR CODE HERE>)
    
inline_map(map_sf)