# GBA 6430 - Big Data Technology in Business
# Dr. Mohammad Salehan
# Spark Actions

In [1]:
airports = (
    sc
    .textFile('s3://cis4567-salehan/Spark/Data/airport-codes-na.txt') 
    .map(lambda element: element.split("\t"))
)
header = airports.first() #extract header
airports = airports.filter(lambda row : row != header)
airports.take(5)

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
2,application_1717637455205_0003,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[['Abbotsford', 'BC', 'Canada', 'YXX'], ['Aberdeen', 'SD', 'USA', 'ABR'], ['Abilene', 'TX', 'USA', 'ABI'], ['Akron', 'OH', 'USA', 'CAK'], ['Alamosa', 'CO', 'USA', 'ALS']]

In [2]:
# Setup the RDD: flights
flights = (
    sc
    .textFile('s3://cis4567-salehan/Spark/Data/departuredelays.csv', minPartitions=8)
    .map(lambda line: line.split(","))
)
header = flights.first() #extract header
flights = flights.filter(lambda row : row != header)
flights.take(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[['01011245', '6', '602', 'ABE', 'ATL'], ['01020600', '-8', '369', 'ABE', 'DTW'], ['01021245', '-2', '602', 'ABE', 'ATL'], ['01020605', '-4', '602', 'ABE', 'ATL'], ['01031245', '-4', '602', 'ABE', 'ATL']]

### .collect() action
`collect()` returns a list containing all of the elements from the workers to the driver (i.e., master node). You should only apply `.collect()` to small `RDDs`. Collecting a large RDD will crash your driver node.

In [3]:
# Return all airports elements
# filtered by WA state
airports.filter(lambda c: c[1] == "WA").collect()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[['Bellingham', 'WA', 'USA', 'BLI'], ['Moses Lake', 'WA', 'USA', 'MWH'], ['Pasco', 'WA', 'USA', 'PSC'], ['Pullman', 'WA', 'USA', 'PUW'], ['Seattle', 'WA', 'USA', 'SEA'], ['Spokane', 'WA', 'USA', 'GEG'], ['Walla Walla', 'WA', 'USA', 'ALW'], ['Wenatchee', 'WA', 'USA', 'EAT'], ['Yakima', 'WA', 'USA', 'YKM']]

### count() action
The `.count()` action returns the number of elements in the `RDD`.

In [4]:
flights.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1391578

### .saveAsTextFile(...) action
The `.saveAsTextFile()` action saves your RDD into a text file; note that each partition generates a
separate file.

In [7]:
#AWS
flights.saveAsTextFile("s3://gba6430-huayang-01/tables/flights")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
x=sc.textFile('s3://gba6430-huayang-01/tables/flights')
x.take(2)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

["['01011245', '6', '602', 'ABE', 'ATL']", "['01020600', '-8', '369', 'ABE', 'DTW']"]

In [11]:
x.getNumPartitions()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

8

### .reduceByKey
* A transformation (not action) that groups by key then aggregates.
* First item of each record is used as key.
* Assumes the `f` function is commutative and associative so that it can be computed correctly in parallel.

In [12]:
# Determine delays by originating city
# map()to (origin, delay)
flights.map(lambda c: (c[3], int(c[1])))\
.reduceByKey(lambda x, y: x + y)\
.take(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('ACT', 392), ('BMI', 7817), ('BPT', 1936), ('BZN', 7226), ('CRP', 10579)]

### .groupByKey
groups value by key.

In [13]:
#previous example using groupByKey
flights.map(lambda c: (c[3], int(c[1])))\
.groupByKey()\
.map(lambda c: (c[0], sum(c[1])))\
.take(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('ACT', 392), ('BMI', 7817), ('BPT', 1936), ('BZN', 7226), ('CRP', 10579)]

## reduceByKey vs groupByKey

<table><tr><td><img src='https://cis4567-salehan.s3.amazonaws.com/img/reduce_by_key.png'></td><td><img src='https://cis4567-salehan.s3.amazonaws.com/img/group_by_key.png'></td></tr></tabke>

reduceByKey is much more efficient because groupByKey includes lots of unnecessary shuffling and data transferred over network<br>
<a target="_blank" href="https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html">source</a>