# PySpark 101

### An introduction to distributed computing

October 2020

### Installing PySpark

Steps for Windows:
1. Make sure you have java with `java -version`
2. Make sure you have Python with `python --version`
3. Download Spark from http://spark.apache.org/downloads.html
4. Unzip files with tar xvzf spark-3.0.0-bin-hadoop2.7.tgz
5. Set environment varibles with (you need to be running as admin):
    - setx SPARK_HOME "%USERPROFILE%\Documents\spark\spark-3.0.0-bin-hadoop2.7" /M
    - setx HADOOP_HOME "%USERPROFILE%\Documents\spark\spark-3.0.0-bin-hadoop2.7" /M

### Distributed Computing
For processing vast volumes of data fast, we need to **scale out** instead of **scale up**.  

- **Cheaper**: Run large data on clusters of many nodes (i.e. smaller and cheaper machines.)

- **Faster**: It parallelizes and distributes computations.

- **Reliable**: If one node or process fails, its workload is assumed by other components in the system. (Also known as fault tolerance).

### Spark Apache
Open-source distributed cluster-computing framework. It extends the MapReduce model using RDDs (Resilient Distributed Datasets).

Available in Java, Scala, Python, R. (PySpark is the Python distribution.)

**Key Components:**

- **Distribution**: Distribute the data

- **Parallelism**: Perform subsets of the computation simultaneously

- **Fault tolerance**: Handle component failure

# Part 1: MapReduce

### MapReduce
Let’s remember that MapReduce is a programming model for processing big data sets with a parallel, distributed algorithm on a cluster. 

- **Map procedure**: Applies a function to each data point over a partition in parallel. 
Examples: `filter()`, `map()`
- **Reduce procedure**: Summary operation that returns one value from multiple values
Examples: `reduce()`, `sum()`, `count()`

<img src='img/example.png'>

In [2]:
# It helps you find spark
import findspark
findspark.init()

ModuleNotFoundError: No module named 'findspark'

In [3]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

In [4]:
sc = SparkContext() 
ss = SparkSession.builder.getOrCreate()

ss.catalog.clearCache()
sc.setLogLevel("OFF")

In [5]:
sc

In [6]:
# Read data as an RDD
rdd = sc.textFile("data/numbers.txt", 8) # Number of cores

In [7]:
# See data on the computer
rdd.glom().collect()

[['4', '0', '3', '6', '8', '7', '9', '9', '8', '5', '6', '5', '6'],
 ['3', '6', '5', '1', '8', '8', '1', '5', '4', '1', '5', '8', '2'],
 ['6', '8', '8', '3', '9', '9', '0', '7', '3', '6', '0', '6'],
 ['0', '5', '9', '0', '9', '5', '9', '2', '5', '4', '8', '6', '8'],
 ['5', '1', '4', '7', '4', '0', '1', '9', '5', '3', '1', '1'],
 ['0', '0', '9', '3', '0', '5', '4', '0', '8', '0', '1', '8', '5'],
 ['7', '2', '1', '9', '3', '4', '2', '0', '2', '8', '1', '0'],
 ['0', '8', '4', '1', '0', '0', '7', '3', '8', '3', '7', '3']]

Logical cores = (# of physical cores) x (# of threads that can run on each core)

`sysctl -n hw.ncpu`

`rdd.getNumPartitions()`

In [8]:
# A note about lambda functions

def add_2(x):
    return x+2

add_2_lambda = lambda x: x+2

assert add_2(3) == add_2_lambda(3)

#### Now let's convert data to integers

In [9]:
# Example of a map function
converted_rdd = rdd.map(lambda x: int(x))
converted_rdd.glom().collect()

[[4, 0, 3, 6, 8, 7, 9, 9, 8, 5, 6, 5, 6],
 [3, 6, 5, 1, 8, 8, 1, 5, 4, 1, 5, 8, 2],
 [6, 8, 8, 3, 9, 9, 0, 7, 3, 6, 0, 6],
 [0, 5, 9, 0, 9, 5, 9, 2, 5, 4, 8, 6, 8],
 [5, 1, 4, 7, 4, 0, 1, 9, 5, 3, 1, 1],
 [0, 0, 9, 3, 0, 5, 4, 0, 8, 0, 1, 8, 5],
 [7, 2, 1, 9, 3, 4, 2, 0, 2, 8, 1, 0],
 [0, 8, 4, 1, 0, 0, 7, 3, 8, 3, 7, 3]]

<img src='img/string2int.png'>

#### Now let's filter to get numbers greater or equal to 8

In [10]:
# Example of another map function
filtered_rdd = converted_rdd.filter(lambda x: x >= 8)
filtered_rdd.glom().collect()

[[8, 9, 9, 8],
 [8, 8, 8],
 [8, 8, 9, 9],
 [9, 9, 9, 8, 8],
 [9],
 [9, 8, 8],
 [9, 8],
 [8, 8]]

<img src='img/greater8.png'>

#### Now let's add numbers

In [11]:
# Example of a reduce function
filtered_rdd.reduce(lambda x, y: x+y)

202

**What's happening?** The rdd is adding two numbers in the same partition until there is only one number left. Then it adds two numbers in different partitions until there is only one numebr left (shuffling happens).

In [12]:
# Example of another reduce function
rdd.count()

100

#### Part 1: MapReduce -- Summary
Sum of integers that are greater than or equal to 8

In [13]:
# Read data as an RDD
rdd = sc.textFile("data/numbers.txt", 8) # Number of cores

converted_rdd = rdd.map(lambda x: int(x)) # Map function that converts to integers
filtered_rdd = converted_rdd.filter(lambda x: x >= 8) # Map function to filter numbers

filtered_rdd.glom().collect() # To see results in partitions

filtered_rdd.reduce(lambda x, y: x+y) # Reduce function to add integers

202

In [14]:
# Other useful functions

rdd.getNumPartitions() # Get number of partitions
rdd.count() # Reduce function to count how many elements are in the original rdd
rdd.first() # Get first object

'4'

In [24]:
# rdd.saveAsTextFile("ex01_output") # Saves it into parquet files

files = sc.textFile("ex01_output") # Use filename or folder (multiple text files)
files = sc.wholeTextFiles("data") # Folder --- Load multiple files as key,values (good for TSA)

# Part 2: RDD Operations

### RDDs (Resilient Distributed Datasets)

Abstraction of a distributed collection of items with operations and transformation applicable to the dataset.

They are:
- Distributed
- Immutable
- Resilient

### RDD Operations

- **Transformations**:
    - Perform functions against each element in an RDD and return a new RDD
    - *Lazy evaluation*: Operations are only evaluated when an action is requested
- **Actions**
    - Trigger a computation and return a value to the Spark driver

### RDD Operations - Transformations

<img src="img/transformations.png">

### RDD Operations - Actions

<img src="img/actions.png">

#### Excercise: Calculate the sum of the odd numbers

### Pair RDDs

- Key-value pairs commonly used for many operations. (Example: `reduceByKey()`, etc.)

In [28]:
rdd = sc.textFile("data/numbers.txt").map(lambda num: int(num))
rdd_map = rdd.map(lambda val: ('even',val) if val%2==0 else ('odd',val))
rdd_map.top(5)

[('odd', 9), ('odd', 9), ('odd', 9), ('odd', 9), ('odd', 9)]

In [30]:
rdd_map.keys().top(5)

['odd', 'odd', 'odd', 'odd', 'odd']

In [31]:
rdd_map.values().top(5)

[9, 9, 9, 9, 9]

#### Some transformations

In [63]:
rdd_map.sortByKey()

rdd_map_keys = rdd_map.groupByKey()
rdd_map_keys.collect()
# rdd_map_keys.map(lambda val: (val[0], list(val[1]))).collect()

[('even', <pyspark.resultiterable.ResultIterable at 0x7faaef293400>),
 ('odd', <pyspark.resultiterable.ResultIterable at 0x7faaef293358>)]

#### Excercise: Which one is more efficient? 

In [61]:
rdd_map_keys.groupByKey().mapValues(lambda val: sum(val)).collect().groupByKey()
rdd_map.reduceByKey(lambda x,y: x+y).collect()

[('even', 202), ('odd', 233)]

#### Some actions

In [64]:
rdd_map.countByKey() # Why is it an action?

defaultdict(int, {'even': 51, 'odd': 49})

### Pair RDD Operations - Transformations

<img src="img/pair_transf.png">

### Pair RDD Operations - Actions

<img src="img/pair_act.png">

In [None]:
sc.stop() # Don't forget to close your connection