# PySpark Tutorial
<div>
 <h2> CSCI 4283 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

Spark was originally developed using Scala, although there are Python and Java interfaces as well. This tutorial covers [most of the RDD API](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds) using Python bindings.

You may want to consult the [PySpark manual](http://spark.apache.org/docs/2.1.0/api/python/pyspark.html) as well.

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 50 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 51.5 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=0e6b0471fc456569a62f71c584bd25213fc4b3ea8e56783e7fb902d53a67ce25
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [2]:
from pyspark import SparkContext, SparkConf
import numpy as np
import operator

Note that I am using an explicit declaration of the number of local processes to use with `local[3]`

In [3]:
conf=SparkConf().setAppName("pyspark tutorial").setMaster("local[3]")
sc = SparkContext(conf=conf)

## Partitions

RDD's are broken into multiple partitions or slices which are the unit of work allocation (*i.e.*, more partitions gives more potential for parallelism, but too many partitions gives too much overhead). By default, the number of partitions is related to your cluster size. In this example, I uses `local[3]` to specify three worker processes.

In [4]:
a = sc.parallelize([7, 2, 3, 1, 2, 3, 4, 5, 6, 7])

In [5]:
a.getNumPartitions()

3

We can also specify the number of partitions or slices when parallelizing a data structure

In [6]:
a2 = sc.parallelize([7, 2, 3, 1, 2, 3, 4, 5, 6, 7], numSlices=2)

In [7]:
a2.getNumPartitions()

2

## When are partitions visible?

In general, you don't need to be aware of partitions / slices but some code uses the underlying partition information when implementing other operations. We'll look at one example, **fold**

**fold** takes a "identity value" and a function and then repeatedly performs a reduction on the RDD using the identity value and function and then again when the values have be collected at the host. `fold` is a general version of `reduce` that handles the case of data across multiple partitions.

For example, assume we want to create a list from elements of an RDD using list addition. There are numerous reasons why this is a bad idea, but it helps us illustrate the impact of partitions.

The problem is that you can't use "list addition" until you have a list. For example, you need to execute:
` [] + [1] ` before you can create a list using `[] + [1] + [2]`. 

In [8]:
[] + [1] + [2]

[1, 2]

**fold** only combines elements in a partition -- **reduce** basically does a **fold** and then combines the results from the individual partitions. Let's apply **fold** to a 3-parition structure:

In [9]:
a.fold([], lambda x,y: x + [y])

[[7, 2, 3], [1, 2, 3], [4, 5, 6, 7]]

Underneath the hood of **reduce** there's a set of tools that handle operations within a partition and then across partitions. We're going to look at the general **aggregate** method.

### Aggregate -- generalization of reduce and fold

In order to understand the order of operations, we need a function that will illustrate that order for an RDD. The **showAdd** function will show the order of operations using parenthesis.

In [14]:
def showAdd(x,y):
    return f"({x} + {y})"

In [15]:
showAdd( showAdd(1,2), 3)

'((1 + 2) + 3)'

In [16]:
oneslice = sc.parallelize([2,3,4,5,6],1)
oneslice.reduce(showAdd)

'((((2 + 3) + 4) + 5) + 6)'

If we partition the same data into two slices, we see that one partition contains `(2+3)` and the other contains `(4+5)+6`. The reduce combines these two together.

In [17]:
twoslice = sc.parallelize([2,3,4,5,6],2)
twoslice.reduce(showAdd)

'((2 + 3) + ((4 + 5) + 6))'

Now, lets see the semantics of **fold**:

In [18]:
oneslice.fold(1,showAdd)

'(1 + (((((1 + 2) + 3) + 4) + 5) + 6))'

In [19]:
twoslice.fold(1,showAdd)

'((1 + ((1 + 2) + 3)) + (((1 + 4) + 5) + 6))'

In other words, the identity element is added to the first element of each partition Like **reduce**, the **fold** operation really only works well for commutative-associative operators because it's applied to each slice of an RDD independently.

Recall that we explicitly specified that `twoslice`should have two slices.

In [20]:
twoslice.reduce(showAdd)

'((2 + 3) + ((4 + 5) + 6))'

In [21]:
twoslice.fold(1, showAdd)

'((1 + ((1 + 2) + 3)) + (((1 + 4) + 5) + 6))'

The **aggregate* function performs an operation like`fold` on each RDD partition and then uses a __combine function__ to join partitions.

For example, assume we have data:

In [22]:
twoPart = sc.parallelize([1,2,3,4], numSlices=2)

This data will (likely) be divided into `[1,2]` and `[3,4]`. Now assume we want to reduce two values -- the first is the sum of the data (10) and the second is the length of largest partition (likely 2).

We'll have two distinct functions -- `seqOp` will define operations within a partition and `combOp` will define op how partitions are combined.

In [23]:
def seqOp( x, y):
    return f"({x} + S + {y})"

In [24]:
def combOp( x, y ):
    return f"({x} + C + {y})"

As with `fold`, we need a "zero-value" to start folding

In [25]:
oneslice.aggregate( 0, seqOp, combOp )

'(0 + C + (((((0 + S + 2) + S + 3) + S + 4) + S + 5) + S + 6))'

Recall that `oneslice` has a single partition. The `seqOp` operation is applied to the elements in the single RDD and combined with the identity value (0). That RDD is then combined using  `combOp` the identity value (0) and the result from the single RDD.

Now, lets see what this is like for two slices.

In [26]:
twoslice.aggregate( 0, seqOp, combOp )

'((0 + C + ((0 + S + 2) + S + 3)) + C + (((0 + S + 4) + S + 5) + S + 6))'

In this case, the two values in each of the two RDD's are combined using `seqOp` and the identity element.

The result from the two sequences are then combined using `compOp` in a *left to right* oerdering.

**aggregate** is a the basis of many of the other operations in Spark. You can use it to build additional extensions, but many of the common operations we need are built in using **aggregate**.