# Lab 5.2: Joins with Spark-Core

Setup your spark Context:

In [1]:
# Set up Spark Context
from pyspark import SparkContext, SparkConf

SparkContext.setSystemProperty('spark.executor.memory', '2g')
conf = SparkConf()
conf.set('spark.executor.instances', 15)
sc = SparkContext('yarn-client', 'Spark-lab5.2', conf=conf)

Load the crimes dataset and weather dataset into two RDDs.   
* The weather file for year 2013 resides under weather/2013.csv (as before, only load the year 2013)
* The crimes dataset file resides in crime_report/sf_crimes.csv. Use Python's csv.reader to read the text fields in the crimes dataset to avoid problems of commas inside a text field.

In [2]:
import pandas as pd
from csv import reader

# Weather dataset
w_lines = sc.textFile("weather/2013.csv")
weather = w_lines.map(lambda line: line.split(',')) \
                 .map(lambda row: [row[0], row[1], row[2], row[3]]).cache()
print "%s weather readings" % weather.count()

# Crimes dataset
c_lines = sc.textFile("crime_report/sf_crimes.csv") \
            .filter(lambda line: line[:2]!='In')
crimes = c_lines.map(lambda line: [x.strip('"') for x in next(reader([line]))]).cache()
print "%s crime events" % crimes.count()

29900150 weather readings
1750133 crime events


Print the RDD lineage graph of "crimes" and "weather" using toDebugString()

In [3]:
print "crimes"
print crimes.toDebugString()

print "\n\nweather"
print weather.toDebugString()

crimes
(3) PythonRDD[6] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
 |       CachedPartitions: 3; MemorySize: 242.5 MB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 |  MapPartitionsRDD[5] at textFile at NativeMethodAccessorImpl.java:-2 [Memory Serialized 1x Replicated]
 |  crime_report/sf_crimes.csv HadoopRDD[4] at textFile at NativeMethodAccessorImpl.java:-2 [Memory Serialized 1x Replicated]


weather
(8) PythonRDD[2] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
 |       CachedPartitions: 8; MemorySize: 938.6 MB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 |  MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 [Memory Serialized 1x Replicated]
 |  weather/2013.csv HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:-2 [Memory Serialized 1x Replicated]


Recall that the weather dataset schema is: station, date, metric, value

Create an RDD from the weather dataset for the PRCP metric:
1. Use only weather station USW00023272 (San Francisco)
2. Each row in the new RDD should contain: date, value

print 3 example values from the new RDD. How many elements exist in this new RDD?

In [4]:
weather_kv = weather.filter(lambda x: x[0]=='USW00023272').filter(lambda x: x[2] == 'PRCP') \
                      .map(lambda x: (x[1],x[3]))
print weather_kv.take(3)
print weather_kv.count()

[(u'20130101', u'0'), (u'20130102', u'0'), (u'20130103', u'0')]
365


To enable a join between the crime and weather datsets, create an RDD from the crimes dataset. This new RDD will have a key-value tuple, where:
* The key is the date in a similar format to that of the weather dataset: YYYYMMDD
* The value is the whole record from the crimes dataset

In [5]:
crimes_kv = crimes.map(lambda x: (''.join([x[4][6:],x[4][:2],x[4][3:5]]), x))
print crimes_kv.take(3)

[('20150422', ['150331521', 'LARCENY/THEFT', 'GRAND THEFT FROM A BUILDING', 'Wednesday', '04/22/2015', '18:00', 'BAYVIEW', 'NONE', '2000 Block of EVANS AV', '-122.396315619126', '37.7478113603031', '(37.7478113603031, -122.396315619126)', '15033152106304']), ('20150419', ['150341605', 'ASSAULT', 'ATTEMPTED SIMPLE ASSAULT', 'Sunday', '04/19/2015', '12:15', 'CENTRAL', 'NONE', '800 Block of WASHINGTON ST', '-122.40672716771', '37.7950566945259', '(37.7950566945259, -122.40672716771)', '15034160504114']), ('20150419', ['150341605', 'ASSAULT', 'THREATS AGAINST LIFE', 'Sunday', '04/19/2015', '12:15', 'CENTRAL', 'NONE', '800 Block of WASHINGTON ST', '-122.40672716771', '37.7950566945259', '(37.7950566945259, -122.40672716771)', '15034160519057'])]


Join the crimes and weather RDDs, the output having prcp data for each crime report event. Print a few records of the resulting RDD to validate your approach.

In [6]:
j1 = crimes_kv.join(weather_kv)
j2 = j1.map(lambda (k,v): v[0] + [v[1]])
df1 = pd.DataFrame(j2.take(5), \
                   columns=['ID','category','description','dayofweek','date','time', \
                            'district','resolution','address','x','y','location','pdid', 'prcp'])
df2 = df1[['category', 'description', 'date', 'district', 'resolution', 'location', 'prcp']]
print df2


         category                                  description        date  \
0  OTHER OFFENSES        DRIVERS LICENSE, SUSPENDED OR REVOKED  08/26/2013   
1       VANDALISM    MALICIOUS MISCHIEF, VANDALISM OF VEHICLES  08/26/2013   
2   LARCENY/THEFT                  GRAND THEFT FROM A BUILDING  08/26/2013   
3   LARCENY/THEFT                 GRAND THEFT FROM LOCKED AUTO  08/26/2013   
4        BURGLARY  BURGLARY OF APARTMENT HOUSE, FORCIBLE ENTRY  08/26/2013   

    district     resolution                               location prcp  
0   SOUTHERN  ARREST, CITED   (37.775420706711, -122.403404791479)    0  
1  INGLESIDE           NONE  (37.7448909444398, -122.432375194686)    0  
2    MISSION           NONE  (37.7624209454461, -122.430625527042)    0  
3   SOUTHERN           NONE  (37.7801398610181, -122.405488408396)    0  
4    MISSION           NONE  (37.7542221906773, -122.425236707185)    0  


Print the join lineage information

In [7]:
print j2.toDebugString()

(11) PythonRDD[19] at RDD at PythonRDD.scala:43 []
 |   MapPartitionsRDD[17] at mapPartitions at PythonRDD.scala:374 []
 |   ShuffledRDD[16] at partitionBy at NativeMethodAccessorImpl.java:-2 []
 +-(11) PairwiseRDD[15] at join at <ipython-input-6-6c205b0c349d>:1 []
    |   PythonRDD[14] at join at <ipython-input-6-6c205b0c349d>:1 []
    |   UnionRDD[13] at union at NativeMethodAccessorImpl.java:-2 []
    |   PythonRDD[11] at RDD at PythonRDD.scala:43 []
    |   PythonRDD[6] at RDD at PythonRDD.scala:43 []
    |       CachedPartitions: 3; MemorySize: 242.5 MB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
    |   MapPartitionsRDD[5] at textFile at NativeMethodAccessorImpl.java:-2 []
    |   crime_report/sf_crimes.csv HadoopRDD[4] at textFile at NativeMethodAccessorImpl.java:-2 []
    |   PythonRDD[12] at RDD at PythonRDD.scala:43 []
    |   PythonRDD[2] at RDD at PythonRDD.scala:43 []
    |       CachedPartitions: 8; MemorySize: 938.6 MB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
