# Big Data Platforms

## Spark RDD

Author: Ashish Pujari (apujari@uchicago.edu)

In [2]:
#import statements
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('TestApp').getOrCreate()
sc = spark.sparkContext

In [4]:
%time data = sc.textFile("T:\\courses\\BigData\\data\\chicago-food-inspections\\food-inspections.csv")

Wall time: 748 ms


### Action vs Transformation

Actual (distributed) computations in Spark take place when we execute actions and not transformations. In this case count is the action we execute on the RDD. We can apply as many transformations as we want on a our RDD and no computation will take place until we call the first action that, in this case takes a few seconds to complete

In [5]:
#Total Record Count
%time data.count()

Wall time: 5.1 s


178090

### Map Transformation

By using the map transformation in Spark, we can apply a function to every element in our RDD. Python's lambdas are specially expressive for this particular. In this case we want to read our data file as a CSV formatted one. We can do this by applying a lambda function to each element in the RDD as follows.

In [6]:
csv_data = data.map(lambda x: x.split(","))
head_rows = csv_data.take(5)

#print header row
print(head_rows[0])

['Inspection ID', 'DBA Name', 'AKA Name', 'License #', 'Facility Type', 'Risk', 'Address', 'City', 'State', 'Zip', 'Inspection Date', 'Inspection Type', 'Results', 'Violations', 'Latitude', 'Longitude', 'Location']


In [11]:
def parseLine(line):
    elems = line.split(",")
    tag = elems[15]
    return (tag, elems)

%time key_csv_data = data.map(parseLine)
%time head_rows = key_csv_data.take(5)
print(head_rows[0])

Wall time: 0 ns
Wall time: 517 ms
('Longitude', ['Inspection ID', 'DBA Name', 'AKA Name', 'License #', 'Facility Type', 'Risk', 'Address', 'City', 'State', 'Zip', 'Inspection Date', 'Inspection Type', 'Results', 'Violations', 'Latitude', 'Longitude', 'Location'])


### Collect Action

In [8]:
%time head_rows = csv_data.take(1000)

Wall time: 894 ms


In [9]:
#might time out or run out of memory on smaller machine  
%time all_data = data.collect()

Wall time: 4.93 s


In [10]:
all_data[0:10]

['Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location',
 '2240307,2 POTZ & A PAN EATERY,2 POTZ & A PAN EATERY,2626283,Restaurant,Risk 1 (High),6052 S EBERHART AVE ,CHICAGO,IL,60637,2018-11-16T00:00:00,License,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: OBSERVED NO EMPLOYEE HEALTH POLICY AVAILABLE. INSTRUCTED MANAGER TO PROVIDE EMPLOYEE HEALTH POLICY. PRIORITY FOUNDATION 7-38-012(A) | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: OBSERVED NO CLEAN UP POLICY FOR VOMITING OR DIARRHEAL EVENTS. INSTRUCTED MANAGER TO PROVIDE CLEAN UP POLICY. PRIORITY FOUNDATION 7-38-005 | 10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLIED AND ACCESSIBLE - Comments: OBSERVED NO HAND WASHING SIGNS AT HAND WASHING SINKS IN PREP AND WASHROOM AREAS. INSTRUCTED MANAGER TO PROVIDE HAND WASHING SIG