# Lab 10: Creating a Feature Matrix using Spark-ML Pipelines

As always, create a SparkContext/HiveContext.

In [1]:
# Set up Spark Context
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *

SparkContext.setSystemProperty('spark.executor.memory', '2g')
conf = SparkConf()
conf.set('spark.executor.instances', 15)
conf.set('spark.sql.autoBroadcastJoinThreshold', 200*1024*1024)  # 200MB for broadcast join
sc = SparkContext('yarn-client', 'Spark-lab10', conf=conf)

from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql("use demo")

DataFrame[result: string]

Load the crimes_wn table that was created in Lab 9 as a Spark Dataframe, and filter the dataset to include all records between 2011 and 2014:

In [2]:
crimes_wn_all = hc.table('crimes_wn')
crimes_wn = (crimes_wn_all.filter(crimes_wn_all.date_str.substr(7,4) >= '2011')
                          .filter(crimes_wn_all.date_str.substr(7,4) <= '2014'))

Creater a new data frame 'crimes' via feature transformations as follows:
- Create features for 'year', 'month' and 'day' from the raw field 'date'
- Create the 'hour' feature from the raw field 'time'
- Create a feature 'resolved' from the 'resolution' field with a value of 0.0 if "NONE" and 1.0 otherwise
- Include these other fields: category, district, dayofweek, description, neighborhood

In [3]:
crimes = (crimes_wn.withColumn('year', crimes_wn.date_str.substr(7,4).astype('int'))
                   .withColumn('month', crimes_wn.date_str.substr(1,2).astype('int'))
                   .withColumn('day', crimes_wn.date_str.substr(4,2).astype('int'))
                   .withColumn('hour', crimes_wn.time.substr(1,2).astype('int'))
                   .withColumn('resolved', when(crimes_wn.resolution == 'NONE', 0.0).otherwise(1.0))
                   .select('year', 'month', 'day', 'hour', 'resolved', 'category', 'district', 'dayofweek', 
                           'description', 'neighborhood'))

crimes.cache()
crimes.limit(5).toPandas()

Unnamed: 0,year,month,day,hour,resolved,category,district,dayofweek,description,neighborhood
0,2014,11,22,16,0,LARCENY/THEFT,RICHMOND,Saturday,GRAND THEFT FROM LOCKED AUTO,Seacliff
1,2014,11,21,18,0,LARCENY/THEFT,RICHMOND,Friday,GRAND THEFT FROM UNLOCKED AUTO,Seacliff
2,2014,10,31,8,0,LARCENY/THEFT,RICHMOND,Friday,GRAND THEFT FROM LOCKED AUTO,Seacliff
3,2014,3,20,0,0,FRAUD,RICHMOND,Thursday,FRAUDULENT CREDIT APPLICATION,Seacliff
4,2013,9,6,17,0,LARCENY/THEFT,RICHMOND,Friday,PETTY THEFT FROM LOCKED AUTO,Seacliff


Now create the weather dataframe that includes the precipitation, tmin and tmax for San Francisco for each day in the years 2011-2014

In [4]:
weather = hc.sql("select * from weather WHERE year>=2011 and year<=2014 and station == 'USW00023272'").cache()
prcp = weather.filter(weather.metric=='PRCP').withColumnRenamed('value', 'prcp').alias("prcp")
tmin = weather.filter(weather.metric=='TMIN').withColumnRenamed('value', 'tmin').alias("tmin")
tmax = weather.filter(weather.metric=='TMAX').withColumnRenamed('value', 'tmax').alias("tmax")

wdata = prcp.join(tmin, 'date_str').join(tmax, 'date_str') \
            .select(col('prcp.year'), col('prcp.month'), col('prcp.day'), 'prcp', 'tmin', 'tmax').cache()
wdata.limit(5).toPandas()

Unnamed: 0,year,month,day,prcp,tmin,tmax
0,2011,3,30,0,111,278
1,2011,12,21,0,67,150
2,2012,1,31,0,83,139
3,2012,9,6,0,122,172
4,2012,10,22,206,106,161


Join the resulting weather dataset with the crimes dataset, using the join key (year,month,day). 
* create a dataframe "joined" that joins the two dataframes (crimes and weather)
* In the joined dataset the year, month and day fields appear twice. Create a python list "keep" that includes the names of the fields that we would like to keep from the joined result, then create the "fm" dataframe by selecting those fields only.

In [5]:
joined = wdata.join(crimes, (wdata.month==crimes.month) & (wdata.day==crimes.day) & (wdata.year==crimes.year), 'inner')

keep = ([wdata[c] for c in wdata.columns] + 
        [crimes[c] for c in crimes.columns if c not in ['year', 'month', 'day']])
fm = joined.select(*keep).cache()

fm.limit(5).toPandas()

Unnamed: 0,year,month,day,prcp,tmin,tmax,hour,resolved,category,district,dayofweek,description,neighborhood
0,2011,2,26,0,28,89,15,1,FRAUD,SOUTHERN,Saturday,"TRICK AND DEVICE, PETTY THEFT",Financial District
1,2011,2,26,0,28,89,13,1,ASSAULT,BAYVIEW,Saturday,INFLICT INJURY ON COHABITEE,Bayview
2,2011,2,26,0,28,89,10,0,LARCENY/THEFT,NORTHERN,Saturday,GRAND THEFT FROM LOCKED AUTO,Pacific Heights
3,2011,2,26,0,28,89,22,0,SUSPICIOUS OCC,SOUTHERN,Saturday,SUSPICIOUS OCCURRENCE,South of Market
4,2011,2,26,0,28,89,17,0,SUSPICIOUS OCC,BAYVIEW,Saturday,INVESTIGATIVE DETENTION,Potrero Hill


Store this feature matrix into HDFS, so that you can use it in the next lab. This is not strictly necessary when using Spark since a DataFrame can be used sequentially for the next step, but useful in our case since our application is split into individual labs.

Use the DataFrames save() function with the ORC data source. 

In [6]:
fm.write.format("orc").save("/tmp/fm", mode="overwrite")