### grp

# Spark: The Definitive Guide

## PART 3: Low-Level APIs 

## dataPaths

In [1]:
retailAll = '/Users/grp/sparkTheDefinitiveGuide/data/retail-data/all/'
flightData2010 = '/Users/grp/sparkTheDefinitiveGuide/data/flight-data/parquet/2010-summary.parquet'

## _Chapter #12 - Resilient Distributed Datasets (RDDs)_

-  **Spak operates on a per-partition basis when executing code**
-  Basically all DF Spark code compiles down to an RDD
-  When calling a DF transformation the underlying logic becomes a set of RDD transformations
-  SparkContext is the entry point for low-level API functionality accessed through SparkSession
-  Main reason to use RDDs is for fine grained control over physical distribution of data (**custom partitioning of data**)
-  Spark's Structured APIs automatically store data in an optimized compressioned binary format, unlike RDDs, which will require manual object implementation to acheive this same functionality and performance optimization
-  _RDD performance is best via Scala/Java ... Python RDDs just like PySpark UDFs require serializing the data to the Python process then perform the Python operation code then serialize it back to the JVM_
-  DF vs RDD maipulation: **RDDs manipulate raw objects whereas DFs/DSs manipulate Spark Types**   
<br>
-  RDDs:
    -  **immutable, partitioned collection of records operated on in parallel as (Python, Scala, Java) objects**
    -  "Rows" do not exist in RDDs ... data records are just raw (Python, Scala, Java) objects   
    <br>
    -  Types:
        -  "generic" RDD
        -  key-value RDD
    -  Properties:
        -  list of partitions
        -  function for computing each split
        -  list of dependencies on other RDDs
        -  optional partitioner for key-value RDDs
        -  optional preferred locations on which to compute each split (ex: block locations for HDFS)
    -  Saving Files:
        -  RDDs can be written out to plain-text files
        -  RDDs can be written out to sequence files (flat file with binary key-value pairs) if in key-value format 
        -  Spark takes each partition and writes out to target destination
    -  Caching:
        -  ability to cache or persists an RDD
        -  ability to specify a storage level [org.apache.spark.storage.StorageLevel] (combinations of memory only, disk only, and off heap)
    -  Checkpointing:
        -  saves RDD to risk so future computations on that RDD point to its partitions on disk rather than recomputing the RDD from the original source
        -  similar to caching except checkpointing is stored only on disk and not in memory (like cache)
        -  when checkpointed RDD is referenced it derives from checkpoint instead of source data, which helps improve performance and optimization
    -  Finding Partitions:
        -  pipe helps with finding the # of lines per partition in RDD
        -  mapPartitions helps with finding the # of partitions in RDD
        -  mapPartitionsWithIndex helps with finding where each record in the RDD is mapped to what RDD partition
        -  foreachPartition helps with iterating over all the partitions of the data
        -  glom takes every partition in RDD and converts them to arrays **be CAREFUL because collecting large partitions can crash driver**   
        <br>
-  Shared Variables:
    -  broadcast variables
    -  accumulators

### _Chapter #12 Exercises (RDDs)_

### _RDD to DF Example_

In [2]:
for i in spark.range(3).rdd.collect(): print(i)

Row(id=0)
Row(id=1)
Row(id=2)


In [3]:
for i in spark.range(3).toDF("id").rdd.map(lambda row: row[0]).collect(): print(i)

0
1
2


In [4]:
spark.range(3).rdd.toDF().show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
+---+



### _Local Collection to RDD Example_

In [5]:
myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple".split(" ")
words = spark.sparkContext.parallelize(myCollection, 2) # sets number of partitions

In [6]:
words.getNumPartitions()

2

In [7]:
words.setName("myWords") # names app for Spark UI
words.name()
for i in words.collect(): print(i)

Spark
The
Definitive
Guide
:
Big
Data
Processing
Made
Simple


### _RDD Data Source Read Example_

In [8]:
'''
spark.sparkContext.textFile(...) # reads text file line by line
spark.sparkContext.wholeTextFiles(...) # reads key-value (fileName, textFileValue)
'''

'\nspark.sparkContext.textFile(...) # reads text file line by line\nspark.sparkContext.wholeTextFiles(...) # reads key-value (fileName, textFileValue)\n'

### _RDD Transformation Examples_:
-  distinct [removes duplicates from RDD]
-  filter [where clause]
-  map [returns value based on input]
-  flatMap [returns mulitple values (splits) based on input]
-  sort [sort by variable]
-  randomSplit [randomly splits an RDD into an Array of RDDs]

In [9]:
# distinct
print(words.distinct().count())

# filter
def startsWithS(individual):
  return individual.startswith("S")

print(words.filter(lambda word: startsWithS(word)).collect())

# map
words2 = words.map(lambda word: (word, word[0], word.startswith("S")))
print(words2.filter(lambda record: record[2]).collect())

# flatMap
print(words.flatMap(lambda word: list(word)).collect())

# sort
print(words.sortBy(lambda word: len(word) * -1).collect())

# randomSplit
fiftyFiftySplit = words.randomSplit([0.5, 0.5]) # [weight, random seed]

10
['Spark', 'Simple']
[('Spark', 'S', True), ('Simple', 'S', True)]
['S', 'p', 'a', 'r', 'k', 'T', 'h', 'e', 'D', 'e', 'f', 'i', 'n', 'i', 't', 'i', 'v', 'e', 'G', 'u', 'i', 'd', 'e', ':', 'B', 'i', 'g', 'D', 'a', 't', 'a', 'P', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', 'M', 'a', 'd', 'e', 'S', 'i', 'm', 'p', 'l', 'e']
['Definitive', 'Processing', 'Simple', 'Spark', 'Guide', 'Data', 'Made', 'The', 'Big', ':']


### _RDD Actions Examples_:
-  reduce [reduce "aggregate" values]
-  count [count records]
-  countApprox [approximation of count based off confidence threshold]
-  countApproxDistinct [approximation of count based off relative accuracy threshold]
-  countByValue [counts # of values in RDD; returns results to memory of driver so be CAREFUL executing]
-  countByValueApprox [same as countByValue except as an approximation]
-  first [returns first value in RDD]
-  min [min value]
-  max [max value]
-  take [takes number of values from RDD]

In [10]:
# reduce
print(spark.sparkContext.parallelize(range(1, 21)).reduce(lambda x, y: x + y))

def wordLengthReducer(leftWord, rightWord):
  if len(leftWord) > len(rightWord):
    return leftWord
  else:
    return rightWord

print(words.reduce(wordLengthReducer))

# count
print(words.count())

# countApprox
confidence = 0.95
timeoutMilliseconds = 400
print(words.countApprox(timeoutMilliseconds, confidence))

# countApproxDistinct
print(words.countApproxDistinct(0.05))
#print(words.countApproxDistinct(4, 10)) # inputs: precision, sparse precision

# countByValue
print(words.countByValue())

# countByValueApprox
print(words.countApproxDistinct(0.05))

# first
print(words.first())

# min
print(spark.sparkContext.parallelize(range(1, 20)).min())

# max
print(spark.sparkContext.parallelize(range(1, 20)).max())

# take
print(words.take(3)) # returns values
print(words.takeOrdered(3)) # asc order
print(words.top(3)) # top values based on implicit order
withReplacement = True
numberToTake = 6
randomSeed = 100
print(words.takeSample(withReplacement, numberToTake, randomSeed)) # random sample from RDD

210
Processing
10
10
10
defaultdict(<class 'int'>, {'Spark': 1, 'The': 1, 'Definitive': 1, 'Guide': 1, ':': 1, 'Big': 1, 'Data': 1, 'Processing': 1, 'Made': 1, 'Simple': 1})
10
Spark
1
19
['Spark', 'The', 'Definitive']
[':', 'Big', 'Data']
['The', 'Spark', 'Simple']
['Data', 'Definitive', 'Data', 'The', 'Definitive', 'Spark']


### _RDD TXT Save (Uncompressed & Compressed) Example_

In [11]:
import shutil
shutil.rmtree("/Users/grp/sparkTheDefinitiveGuide/tmp/words")

words.saveAsTextFile("/Users/grp/sparkTheDefinitiveGuide/tmp/words")

In [12]:
import os
for i in os.listdir("/Users/grp/sparkTheDefinitiveGuide/tmp/words"): print(i)

._SUCCESS.crc
.part-00000.crc
.part-00001.crc
_SUCCESS
part-00001
part-00000


In [13]:
!head /Users/grp/sparkTheDefinitiveGuide/tmp/words/part-00000
print("\n")
!head /Users/grp/sparkTheDefinitiveGuide/tmp/words/part-00001

Spark
The
Definitive
Guide
:


Big
Data
Processing
Made
Simple


In [14]:
import shutil
shutil.rmtree("/Users/grp/sparkTheDefinitiveGuide/tmp/wordsCompressed")

words.saveAsTextFile("/Users/grp/sparkTheDefinitiveGuide/tmp/wordsCompressed", \
                     compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

In [15]:
import os
for i in os.listdir("/Users/grp/sparkTheDefinitiveGuide/tmp/wordsCompressed/"): print(i)

._SUCCESS.crc
.part-00001.gz.crc
_SUCCESS
part-00000.gz
part-00001.gz
.part-00000.gz.crc


### _RDD Caching Example_

In [16]:
words.cache()

myWords ParallelCollectionRDD[26] at parallelize at PythonRDD.scala:194

### _RDD Checkpointing Example_

In [17]:
spark.sparkContext.setCheckpointDir("/Users/grp/sparkTheDefinitiveGuide/tmp/checkpointRDD")
words.checkpoint()

### _RDD Partitions Examples_:
-  pipe
-  mapPartitions
-  mapPartitionsWithIndex

In [18]:
words.pipe("wc -l").collect() # five lines per partition

['       5', '       5']

In [19]:
words.mapPartitions(lambda part: [1]).sum() # value '1' for every partition in RDD then sum to get total # of partitions

2

In [20]:
def indexedFunc(partitionIndex, withinPartIterator):
  return ["partition: {} => {}".format(partitionIndex, x) for x in withinPartIterator]

words.mapPartitionsWithIndex(indexedFunc).collect()

['partition: 0 => Spark',
 'partition: 0 => The',
 'partition: 0 => Definitive',
 'partition: 0 => Guide',
 'partition: 0 => :',
 'partition: 1 => Big',
 'partition: 1 => Data',
 'partition: 1 => Processing',
 'partition: 1 => Made',
 'partition: 1 => Simple']

In [21]:
print(spark.sparkContext.parallelize(["Hello", "World"], 2).glom().collect())
print(type(spark.sparkContext.parallelize(["Hello", "World"], 2).glom().collect()))

[['Hello'], ['World']]
<class 'list'>


## _Chapter #13 - Advanced RDDs_

-  RDD Key-Value Pairs:
    -  can only perform on **PairRDD** type (ex: "some-operation" ByKey)
    -  holds 2 values in each record of RDD (ex: tuple -> (key, value))   
    <br>
-  RDD Aggregation   
<br>
-  RDD Joins   
<br>
-  RDD Partitioning:
    -  Controlling Partitions:
        -  with RDDs one has control over how data is exactly physically distributed across the cluster
        -  coalesce:
            -  collapses partitions on the same worker node in order to avoid a shuffle of the data when repartitioning
        -  repartition:
            -  repartition data up or down but performs a shuffle across nodes
            -  increasing # of partitions helps with increasing level of parallelism
        -  repartitionAndSortWithinPartitions:
            -  repartition as well as specify ordering of each output partition
    -  Custom Partitioning:
        -  controls layout of the data on the cluster to avoid shuffles
        -  main goal is to even out data skews / the distribution of your data across the cluster
        -  used to help avoid **key skews** which means some keys have way more values than other keys
        -  HashPartitioner and RangePartitioner   
        <br>
-  RDD Serialization:
    -  objects looking to be parallelized must be serializable
    -  Kryo [configure "spark.serializer" to "org.apache.spark.serializer.KryoSerializer"] is faster compared to default Java serializer
    -  "spark.serializer" parameter is used for shuffling data between worker nodes and serializing RDDs to disk

### _Chapter #13 Exercises (RDDs)_

In [22]:
myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple".split(" ")
words = spark.sparkContext.parallelize(myCollection, 2)

### _Key-Value RDD Examples_:
-  keyBy [creates a key from value]
-  Mapping over Values [maps over the values by key]
-  Extract Keys and Values [collects individual keys and individual values]
-  lookup [lookup value(s) for particular key]
-  sampleByKey [pulls sample of data by key]

In [23]:
# keyBy
print(words.map(lambda word: (word.lower(), 1)).collect())
keyword = words.keyBy(lambda word: word.lower()[0])
print(keyword.collect())

print("\n")

# Mapping over Values
print(keyword.mapValues(lambda word: word.upper()).collect())
print(keyword.flatMapValues(lambda word: word.upper()).collect())

print("\n")

# Extract Keys and Values
print(keyword.keys().collect())
print(keyword.values().collect())

print("\n")

# lookup
print(keyword.lookup("s"))

print("\n")

# sampleByKey
import random
distinctChars = words.flatMap(lambda word: list(word.lower())).distinct().collect()
sampleMap = dict(map(lambda c: (c, random.random()), distinctChars))
print(words.map(lambda word: (word.lower()[0], word)).sampleByKey(True, sampleMap, 6).collect())

[('spark', 1), ('the', 1), ('definitive', 1), ('guide', 1), (':', 1), ('big', 1), ('data', 1), ('processing', 1), ('made', 1), ('simple', 1)]
[('s', 'Spark'), ('t', 'The'), ('d', 'Definitive'), ('g', 'Guide'), (':', ':'), ('b', 'Big'), ('d', 'Data'), ('p', 'Processing'), ('m', 'Made'), ('s', 'Simple')]


[('s', 'SPARK'), ('t', 'THE'), ('d', 'DEFINITIVE'), ('g', 'GUIDE'), (':', ':'), ('b', 'BIG'), ('d', 'DATA'), ('p', 'PROCESSING'), ('m', 'MADE'), ('s', 'SIMPLE')]
[('s', 'S'), ('s', 'P'), ('s', 'A'), ('s', 'R'), ('s', 'K'), ('t', 'T'), ('t', 'H'), ('t', 'E'), ('d', 'D'), ('d', 'E'), ('d', 'F'), ('d', 'I'), ('d', 'N'), ('d', 'I'), ('d', 'T'), ('d', 'I'), ('d', 'V'), ('d', 'E'), ('g', 'G'), ('g', 'U'), ('g', 'I'), ('g', 'D'), ('g', 'E'), (':', ':'), ('b', 'B'), ('b', 'I'), ('b', 'G'), ('d', 'D'), ('d', 'A'), ('d', 'T'), ('d', 'A'), ('p', 'P'), ('p', 'R'), ('p', 'O'), ('p', 'C'), ('p', 'E'), ('p', 'S'), ('p', 'S'), ('p', 'I'), ('p', 'N'), ('p', 'G'), ('m', 'M'), ('m', 'A'), ('m', 'D'), ('m

### _Aggregation RDD Examples_:
-  countByKey:
    -  counts items (values) per each key
-  groupByKey:
    -  when called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs
    -  each executor must hold all values for key in memory before applying function
    -  this can be a problem if partitions hold tons of values and could result in executor(s) OOM
-  reduceByKey:
    -  when called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V
    -  happens within each partition and doesn't need to put anything in memory
    -  each worker performs individual tasks before performing final reduce
-  aggregate:
    - requires start value, and 2 functions (aggregates within partitions, aggregates across partitions)
    -  performs final aggregation on driver thus if executors are too large the driver could face OOM
-  treeAggregate:
    -  helps to improve performance by performing subaggregations within executor to take away strain on driver
-  aggregateByKey:
    -  aggregates by key instead of partition by partition like "aggregate" function
-  combineByKey:
    -  operates on given key and merges values based on a custom function
-  foldByKey:
    -  merges values for each key via custom function as well as binary value [0=addition, 1=multiplication]
-  cogroup:
    -  joins values by keys

In [24]:
chars = words.flatMap(lambda word: word.lower())
KVcharacters = chars.map(lambda letter: (letter, 1))
def maxFunc(left, right):
  return max(left, right)
def addFunc(left, right):
  return left + right
nums = sc.parallelize(range(1,31), 5)

In [25]:
nums.getNumPartitions()

5

In [26]:
from functools import reduce

In [27]:
# countByKey
print(KVcharacters.countByKey())

print("\n")

# groupByKey
print(KVcharacters.groupByKey().map(lambda row: (row[0], reduce(addFunc, row[1]))).collect())

# reduceByKey
print(KVcharacters.reduceByKey(addFunc).collect())

print("\n")

# aggregate
print(nums.aggregate(0, maxFunc, addFunc))

# treeAggregate
depth = 3
print(nums.treeAggregate(0, maxFunc, addFunc, depth))

print("\n")

# aggregateByKey
print(KVcharacters. aggregateByKey(0, addFunc, maxFunc).collect())

print("\n")

# combineByKey
def valToCombiner(value):
  return [value]
def mergeValuesFunc(vals, valToAppend):
  vals.append(valToAppend)
  return vals
def mergeCombinerFunc(vals1, vals2):
  return vals1 + vals2
outputPartitions = 6
print(KVcharacters.combineByKey(valToCombiner, mergeValuesFunc, mergeCombinerFunc, outputPartitions).collect())

print("\n")

# foldByKey
print(KVcharacters.foldByKey(0, addFunc).collect())

print("\n")

# cogroup
import random
distinctChars = words.flatMap(lambda word: word.lower()).distinct()
charRDD = distinctChars.map(lambda c: (c, random.random()))
charRDD2 = distinctChars.map(lambda c: (c, random.random()))
for i in charRDD.cogroup(charRDD2).take(3): print(i)

defaultdict(<class 'int'>, {'s': 4, 'p': 3, 'a': 4, 'r': 2, 'k': 1, 't': 3, 'h': 1, 'e': 7, 'd': 4, 'f': 1, 'i': 7, 'n': 2, 'v': 1, 'g': 3, 'u': 1, ':': 1, 'b': 1, 'o': 1, 'c': 1, 'm': 2, 'l': 1})


[('s', 4), ('p', 3), ('r', 2), ('h', 1), ('d', 4), ('i', 7), ('g', 3), ('b', 1), ('c', 1), ('l', 1), ('a', 4), ('k', 1), ('t', 3), ('e', 7), ('f', 1), ('n', 2), ('v', 1), ('u', 1), (':', 1), ('o', 1), ('m', 2)]
[('s', 4), ('p', 3), ('r', 2), ('h', 1), ('d', 4), ('i', 7), ('g', 3), ('b', 1), ('c', 1), ('l', 1), ('a', 4), ('k', 1), ('t', 3), ('e', 7), ('f', 1), ('n', 2), ('v', 1), ('u', 1), (':', 1), ('o', 1), ('m', 2)]


90
90


[('s', 3), ('p', 2), ('r', 1), ('h', 1), ('d', 2), ('i', 4), ('g', 2), ('b', 1), ('c', 1), ('l', 1), ('a', 3), ('k', 1), ('t', 2), ('e', 4), ('f', 1), ('n', 1), ('v', 1), ('u', 1), (':', 1), ('o', 1), ('m', 2)]


[('s', [1, 1, 1, 1]), ('d', [1, 1, 1, 1]), ('l', [1]), ('v', [1]), (':', [1]), ('p', [1, 1, 1]), ('r', [1, 1]), ('c', [1]), ('k', [1]), ('t', [1, 1, 1]), ('

### _Join RDD Example_:
-  join
-  fullOuterJoin
-  leftOuterJoin
-  rightOuterJoin
-  cartesian
-  zip [zips together 2 RDDs with same length]

In [28]:
keyedChars = distinctChars.map(lambda c: (c, random.random()))
outputPartitions = 10
KVcharacters.join(keyedChars).count()
KVcharacters.join(keyedChars, outputPartitions).count()

51

In [29]:
numRange = sc.parallelize(range(10), 2)
words.zip(numRange).collect()

[('Spark', 0),
 ('The', 1),
 ('Definitive', 2),
 ('Guide', 3),
 (':', 4),
 ('Big', 5),
 ('Data', 6),
 ('Processing', 7),
 ('Made', 8),
 ('Simple', 9)]

### _Controlling Partitions Examples_:
-  coalesce
-  repartition

In [30]:
# collapses 2 partition RDD to 1 partition RDD to avoid shuffle operation
print(words.coalesce(1).getNumPartitions())
# repartitions 2 partition RDD to 10 partition RDD for increased parallelism but performs shuffle
print(words.repartition(10).getNumPartitions())

1
10


### _Custom Partition Example_:

In [31]:
df = spark.read.option("header", "true").option("inferSchema", "true").csv(retailAll)
rdd = df.coalesce(10).rdd

In [32]:
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



In [33]:
def partitionFunc(key):
  import random
  if key == 17850 or key == 12583:
    return 0
  else:
    return random.randint(1,2)

keyedRDD = rdd.keyBy(lambda row: row[6]) # 6th element [customerID]
keyedRDD\
  .partitionBy(3, partitionFunc)\
  .map(lambda x: x[0])\
  .glom()\
  .map(lambda x: len(set(x)))\
  .take(3)

[2, 4302, 4296]

## _Chapter #14 - Distributed Shared Variables_

-  Broadcast Variables:
    -  saves large value on all worker nodes without re-sending to cluster every time (ex: lookup table as function that fits in memory on each executor)
    -  avoids deserialization per task on the worker nodes every time variable is used
    -  shared immutable variables that are cached on every machine in cluster instead of serialized with every single task
    -  the cost of serializing data for every task can be quite expensive thus broadcast variables are a good alternative   
<br>
-  Accumulators:
    -  adds data together from all tasks into a shared result (ex: error logging counter and debugging)
    -  mutable variable that updates value via transformations and sends value to driver node in an efficient manner


### _Chapter #14 Exercises (Distributed Shared Variables)_

In [34]:
my_collection = "Spark The Definitive Guide : Big Data Processing Made Simple".split(" ")
words = spark.sparkContext.parallelize(my_collection, 2)

### _Broadcast Example_:

In [35]:
supplementalData = {"Spark":1000, "Definitive":200,
                    "Big":-300, "Simple":100}

In [36]:
suppBroadcast = spark.sparkContext.broadcast(supplementalData)

In [37]:
suppBroadcast.value

{'Big': -300, 'Definitive': 200, 'Simple': 100, 'Spark': 1000}

In [38]:
words.map(lambda word: (word, suppBroadcast.value.get(word, 0))).sortBy(lambda wordPair: wordPair[1]).collect()

[('Big', -300),
 ('The', 0),
 ('Guide', 0),
 (':', 0),
 ('Data', 0),
 ('Processing', 0),
 ('Made', 0),
 ('Simple', 100),
 ('Definitive', 200),
 ('Spark', 1000)]

### _Accumulator Example_:

In [39]:
flights = spark.read.parquet(flightData2010)

In [40]:
# count number of flights to or from China
accChina = spark.sparkContext.accumulator(0)

def accChinaFunc(flight_row):
  destination = flight_row["DEST_COUNTRY_NAME"]
  origin = flight_row["ORIGIN_COUNTRY_NAME"]
  if destination == "China":
    accChina.add(flight_row["count"])
  if origin == "China":
    accChina.add(flight_row["count"])

# runs foreach row in the input DF (flights) and runs function against each row
flights.foreach(lambda flight_row: accChinaFunc(flight_row))

In [41]:
accChina.value

953

### grp