# Comparing Python Pure with PySpark

## Initial setup

The main reason for this notebook is to illustrate the differences in set operations between Spark RDD and Python Collections.

## Discussion

### Configure the Context

The code below should be run once and only once to setup the SparcContext object as `sc`:

In [None]:
import pyspark

sc = pyspark.SparkContext('local[*]')

Let me work with the naming convension `s*` for Spark RDD objects and `'p*` for Python Objects

Let's create a few sets in Python:

In [None]:
pInitialSet = {1,2,3}
pSecondSet = {4, 5}
pThirdSet = {2,3,4}

Now let's do the same using RDD's

In [None]:
sInitialSet = sc.parallelize({1,2,3})
sSecondSet = sc.parallelize({4,5})
sThirdSet = sc.parallelize({2,3,4})

### Union

Both Python and RDD name the method `union`

In [None]:
pInitialSet.union(pSecondSet)

In [None]:
sInitialSet.union(sSecondSet).collect()

### Intersection

Again, Python and RDD uses the same name `intersection` for this method.

In [None]:
pInitialSet.intersection(pThirdSet)

In [None]:
sInitialSet.intersection(sThirdSet).collect()

### Subtract

Subtracting is difference. Python calls this `difference`; RDD uses the name `subtract`.

In [None]:
pInitialSet.difference(pThirdSet)

In [None]:
sInitialSet.subtract(sThirdSet).collect()

### Size, find min and max of collection

Finding the size, the min or max of a collection is quite different from Python to RDD:

In [None]:
print (len(pInitialSet))
print(min(pInitialSet))
print(max(pInitialSet))

In [None]:
print(sInitialSet.count())
print(sInitialSet.min())
print(sInitialSet.max())

### Cartesian Product

Cartesian products are not directly supported by Python collections, but can be added by importing `itertools`. So in Pyton you first have to import itertools:

In [None]:
import itertools

... then you can use the `itertoos.product`:

In [None]:
list(itertools.product(pInitialSet,pSecondSet))

RDD's support carteasian products directly.

In [None]:
sInitialSet.cartesian(sSecondSet).collect()

### Aggregate

Aggregate is an importon monoid in RDD. 
To aggregate, we need to pass:

* An inital value for the accumulator
* A lambda for combining the accumulator and the value
* A lambda for combining two accumulators

E.g:

In [None]:
sInitialSet.aggregate(
    0, # Initial value for the accumulator
    lambda acc, val: acc + val, # how do you combine the accumulator with the value?
    lambda acc1, acc2: acc1 + acc2 # How do you add two accumulators together?
)
    

### Persisting RDD's

RDD's can be persisted easily:

In [None]:
persistedSet = sc.parallelize(range(1,1000)).cache()
equivalentSet = sc.parallelize(range(1,1000)).persist() # Uses default storage level MEMORY_ONLY

To specify the storage level, you have to first import the StorageLevel type. 

> In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level. The available storage levels in Python include MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, DISK_ONLY, and DISK_ONLY_2.

In [None]:
from pyspark import StorageLevel

In [None]:
anotherPersistedSet = sc.parallelize(range(1,1000)).persist(StorageLevel.DISK_ONLY)

## Case Classes

Scala has case classes that can be used as types for RDD's. 
Python does not have case classes. Instead, we can use nametupes. 
This approach, requires that we import nametuples from collections:

In [None]:
from collections import namedtuple
from dateutil.parser import parse # I'll need this for parsing dates

We can now use nametuples to create a type. E.g.:

In [None]:
TX = namedtuple("TX", ["date", "amount"])

In [None]:
someDate = TX(parse("2018-01-02"), float("123.12"))

To access the values:

In [None]:
d, a = someDate
print(d)
print(a)