# Big Data Platforms

## PySpark Basic APIs - RDDs, Map, Reduce

Author: Ashish Pujari (apujari@uchicago.edu)

In [2]:
#import statements
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('SparkBasics').getOrCreate()
spark

*URL of the Spark Admin console*

http://localhost:4040/jobs/

In [4]:
#store the Spark Context in a variable. Sometimes the Context is provided directly from the environment
sc = spark.sparkContext

In [4]:
#print spark configuration settings
spark.sparkContext.getConf().getAll()

[('spark.driver.host', 'DESKTOP-RV4I2EU'),
 ('spark.driver.port', '59312'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.id', 'local-1548536639398'),
 ('spark.app.name', 'SparkBasics'),
 ('spark.ui.showConsoleProgress', 'true')]

In [5]:
#change configuration settings on Spark 
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])

In [6]:
#print spark configuration settings
spark.sparkContext.getConf().getAll()

[('spark.driver.memory', '4g'),
 ('spark.executor.memory', '4g'),
 ('spark.executor.id', 'driver'),
 ('spark.executor.cores', '4'),
 ('spark.cores.max', '4'),
 ('spark.driver.host', 'DESKTOP-RV4I2EU'),
 ('spark.app.name', 'Spark Updated Conf'),
 ('spark.driver.port', '59312'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.id', 'local-1548536639398'),
 ('spark.ui.showConsoleProgress', 'true')]

Every Spark program and shell session will do its functioning as follows:
    
    1) Create some input RDDs programming from external data.
    2) Transform RDDs to define new RDDs programs using transformations such as filter ().
    3) Ask Spark to persist () any intermediate RDDs that will need to be reused as per the requirements.
    4) Create the actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark.
    5) Creating RDDs

### Create RDD
RDDs can be created in two distinct ways: by loading an external dataset, or by distributing a set of collection of objects (e.g., a list or set) in their driver program created. 

Below we will generate some dummy data and create an RDD using the `parallelize` method

In [7]:
#generate 20 natural numbers
a = range(20)

#create an RDD using Parallelize
rdd = sc.parallelize(a,3)
rdd

PythonRDD[1] at RDD at PythonRDD.scala:52

### Basic RDD APIs

In [8]:
#count number of elements in x
rdd.count()

20

In [37]:
#unique RDD
print(rdd.id())

15


In [9]:
#print first 10 elements of RDD
rdd.take(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [10]:
#collect operations returns all the contents of the RDD
rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [11]:
#print number of RDD partitions
rdd.getNumPartitions()

3

In [12]:
#generate 100 natural numbers with default partitions
a = range(20)

#create RDD
rdd = sc.parallelize(a)

#print number of RDD partitions
rdd.getNumPartitions()

#why do you get these many partitions ?

8

In [13]:
print(rdd.getStorageLevel())

Serialized 1x Replicated


In [14]:
#cache the RDD in memory
rdd.cache()

PythonRDD[6] at RDD at PythonRDD.scala:52

In [15]:
print(rdd.getStorageLevel())

Memory Serialized 1x Replicated


### Action vs Transformation

Actual (distributed) computations in Spark take place when we execute actions and not transformations. In this case count is the action we execute on the RDD. We can apply as many transformations as we want on a our RDD and no computation will take place until we call the first action that, in this case takes a few seconds to complete

In [16]:
#filter RDD
print(rdd.filter(lambda k: k % 2 == 0).collect())

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]


### Map Transformation

By using the map transformation in Spark, we can apply a function to every element in our RDD. Python's lambdas are specially expressive for this particular. In this case we want to read our data file as a CSV formatted one. We can do this by applying a lambda function to each element in the RDD as follows.

In [17]:
def square(x):
    return (x*x)

%time y = rdd.map(square)

Wall time: 0 ns


In [18]:
%time y.collect()

Wall time: 3.66 s


[0,
 1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81,
 100,
 121,
 144,
 169,
 196,
 225,
 256,
 289,
 324,
 361]

In [19]:
%time print(rdd.map(lambda x: x*x).collect())

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]
Wall time: 4.3 s


In [20]:
#flatmap is similar to map
#it returns a new RDD by applying a function to each element of the RDD but output is flattened.
rdd.flatMap(lambda x: range(1, x)).collect()

[1,
 1,
 2,
 1,
 2,
 3,
 1,
 2,
 3,
 4,
 1,
 2,
 3,
 4,
 5,
 1,
 2,
 3,
 4,
 5,
 6,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18]

### Reduce

In [21]:
#sum all numbers by using reduce operator add
from operator import add
rdd.reduce(add)

190

In [22]:
#sum all numbers by using lambda function
rdd.reduce(lambda a,b: a+b)

190

In [23]:
#averge all numbers 
rdd.mean()

9.5

In [24]:
#this is not the right way to calculate average
rdd.reduce(lambda a,b: (a+b)/2)

15.685546875

### Pair RDDs

Spark provides special operations on RDDs containing key/value pairs. These RDDs are called pair RDDs. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.

Pair RDDs are allowed to use all the transformations available to standard RDDs. Since pair RDDs contain tuples, we need to pass functions that operate on tuples rather than on individual elements

In [25]:
# creating PairRDD x with key value pairs
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3), ("a", 5),("b", 1), ("b", 7), ("b", 3), ("b", 2), ("c",1)], 3)
print(rdd)

ParallelCollectionRDD[15] at parallelize at PythonRDD.scala:194


In [26]:
print(rdd.keys().collect())

['a', 'b', 'a', 'a', 'b', 'b', 'b', 'b', 'c']


In [27]:
print(rdd.values().collect())

[1, 2, 3, 5, 1, 7, 3, 2, 1]


In [28]:
# filter values  
print(rdd.filter(lambda keyValue: keyValue[1] > 3).collect())

[('a', 5), ('b', 7)]


In [29]:
# Invoke map operation on values   
y = rdd.mapValues(square)
print(y.collect())

[('a', 1), ('b', 4), ('a', 9), ('a', 25), ('b', 1), ('b', 49), ('b', 9), ('b', 4), ('c', 1)]


In [30]:
# Apply reduceByKey operation on x
y = rdd.reduceByKey(lambda accum, n: accum + n)                                               
print(y.collect())

[('b', 15), ('a', 9), ('c', 1)]


In [31]:
# Define associative function  
def sumFunc(accum, n):
    return accum + n
 
# Apply reduceByKey operation on x using associative function
y = rdd.reduceByKey(sumFunc)
print(y.collect())

[('b', 15), ('a', 9), ('c', 1)]


In [32]:
#Sort and GroupByKey
print(sorted(rdd.groupByKey().mapValues(list).collect()))

[('a', [1, 3, 5]), ('b', [2, 1, 7, 3, 2]), ('c', [1])]


In [38]:
#join RDDs
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("a", 3), ("c",6)])

In [39]:
sorted(x.join(y).collect())

[('a', (1, 2)), ('a', (1, 3))]

In [40]:
sorted(x.leftOuterJoin(y).collect())

[('a', (1, 2)), ('a', (1, 3)), ('b', (4, None))]

In [41]:
sorted(x.rightOuterJoin(y).collect())

[('a', (1, 2)), ('a', (1, 3)), ('c', (None, 6))]