# Week 5 lecture: more about RDDs

## Object-oriented programming: intro to `classes`

When calling a function in Python, we use syntax like the following
```
min(5, 7)
```
But RDDs have this funny `.function()` syntax instead:
```
rdd2 = rdd1.count()
```
The function `count()` seems to live "inside" `rdd1`.

We can do this because Python is an *object-oriented* language.  Let's see what this means.

So far we have seen built-in types such as `int`, `float`, `str`, `list`, `dict`, and `tuple`.  In an object-oriented language we can actually make our own types:

In [53]:
# A "class" is a blueprint for a new type.
class Dog(object):
    
    # this is the constructor
    def __init__(self, breed, name):
        self.breed = breed  # these are variables that live "inside" the object
        self.name = name  # ...
        print("Birthing a new " + self.breed)
        
    # this function lives "inside" the object.
    # functions like these are sometimes called "methods" of "member functions"
    def speak(self):
        if self.breed == "chihuahua":
            print(self.name + " says nip nip")
        else:
            print(self.name + " says bow wow")
        
    # this is the destructor
    def __del__(self):
        print(self.name + " is dying")

In [40]:
barney = Dog("hound", "Barney")  # this creates an "object" or an "instance"

Birthing a new hound


In [41]:
barney.speak()  # NOTICE the . notation

Barney says bow wow


In [42]:
del barney

Barney is dying


In [43]:
annoying = Dog("chihuahua", "Frank")

Birthing a new chihuahua


In [44]:
annoying.speak()

Frank says nip nip


In [45]:
del annoying

Frank is dying


Objects are everywhere in Python.  Actually, even the simple built-in types `int`, `float`, `str`, `dict`, `list`, and `tuple` are objects.

## Classes in Pyspark

In Pyspark the first class that you will encounter is `SparkContext`.  When you call it like a function `SparkContext(stuff...)` Python uses the class blueprint to create a new object (aka instance) in memory.  Now it is "alive".

So that the object can refer to itself, Python names it `self` and passes it to the `.__init__()` function for further configuration (that YOU define).  In other words, Python calls `SparkContext.__init__(self, stuff...)`.

Finally, `self` is returned and you can assign it to a variable (e.g. `sc` below):

In [46]:
from pyspark import SparkContext
sc = SparkContext('local', 'rdd_tutorial')  # this actually calls the .__init__ method inside the SparkContext class

Let's use a little bit of data to create a simple RDD (which is also an object)

In [54]:
data = [542 ,753, 2345 ,7, 3245, 5432, 76, 3]
data_rdd = sc.parallelize(data)
# data is an object (a list), and so is data_rdd (an RDD)

By convention in Python, types (aka classes, the blueprints) are usually `CamelCase`.  Variables (aka instances, aka objects) are `snake_case`.

The only exception to the `CamelCase` conventions are the built-in types `int`, `float`, etc.

The RDD class defines a method (aka member function) `.map()` that we have been using.  Let's use it now to perform a simple transformation:

In [55]:
def increment(x):
    return x+1

newdata_rdd = data_rdd.map(increment)

Let's call another method, `.collect()`, to bring the transformed data back to the driver as a list:

In [56]:
result_list = newdata_rdd.collect()
result_list

[543, 754, 2346, 8, 3246, 5433, 77, 4]

How did this actually work?  The `increment` function was defined HERE (in the driver), but it was somehow "pushed" up into Spark and run on the RDD (using `.map`).

The answer is called **serialization**.

## Serialization in Python

*Serialization* is a big topic in data engineering (and programming in general).  The idea is to take something (e.g. data, or a function) that is "alive" in memory, snapshot it into raw bytes, and then push it out over the network.

The receiving machine can take these bytes and **deserialize** to make it "alive" again in its memory.

There are a few standard *data* serialization formats in widespread use in data engineering, and they support many languages.  We'll talk about some of these later.

For shipping *functions*, however, we need something that is Python-specific.  Python's built-in serialization library is called `pickle` (think of pickling food).  It works like this:

In [60]:
import pickle

increment_frozen = pickle.dumps(increment)

In [61]:
type(increment_frozen)

bytes

In [62]:
len(increment_frozen)

25

We can now ship `increment_frozen` across the network, or write it to disk, or whatever.  It is just bytes.  Now let's pretend we are on the receiving machine.  Let us make it live again:

In [63]:
increment_thawed = pickle.loads(increment_frozen)

In [64]:
type(increment_thawed)

function

In [65]:
increment_thawed(5)

6

## Grouping data in RDDs by key

Let's shift back to RDDs and some of their features.

One important way to slice and dice data is *grouping*.  For example, we might be interested in gathering statistics by day of week.

Let's start with a much simpler problem:  consider the following simple RDD

In [66]:
simple_rdd = sc.parallelize([6, 3, 4, 53, 654, 2, 5, 8, 1 , 65, 66, 54])

Let's say that we wanted to group this data together into two groups:  even and odd numbers.  By definition, an even number has remainder `0` when divided by `2`.  An odd number has remainder `1` when divided by `2`.

Python comes with a handy "remainder" operator (usually called the **modulo** operator) that will allow us to determine even/oddness.  Here are some examples:
```
0 % 2 = 0  # The remainder of 0 divided by 2.  Read '0 modulo 2'
1 % 2 = 1  # The remainder of 1 divided by 2.  Read '1 modulo 2'
2 % 2 = 0  # The remainder of 2 divided by 2.  Read '2 modulo 2'
3 % 2 = 1  # and so on
4 % 2 = 0
5 % 2 = 1
...
```

Consider applying the following function to our dataset:

In [67]:
key_value_rdd = simple_rdd.map(lambda x: (x % 2, x))
key_value_rdd.collect()

[(0, 6),
 (1, 3),
 (0, 4),
 (1, 53),
 (0, 654),
 (0, 2),
 (1, 5),
 (0, 8),
 (1, 1),
 (1, 65),
 (0, 66),
 (0, 54)]

Notice that the new RDD `key_value_rdd` is a collection tuples (each containing 2 elements).  We call these **key-value** pairs in Spark.  The "key" is either 0 or 1 (even or odd) in this example, and the "value" is just the original piece of data.

Another way to create the same thing is to use the `keyBy` function:

In [68]:
key_value_rdd = simple_rdd.keyBy(lambda x: x % 2)
key_value_rdd.collect()

[(0, 6),
 (1, 3),
 (0, 4),
 (1, 53),
 (0, 654),
 (0, 2),
 (1, 5),
 (0, 8),
 (1, 1),
 (1, 65),
 (0, 66),
 (0, 54)]

Let's say we just wanted to know how many even and odd numbers there are.  We could use `countByKey` to determine this:

In [69]:
key_value_rdd.countByKey()

defaultdict(int, {0: 7, 1: 5})

We could even do something more interesting, e.g. sum all of the evens together and sum all of the odds together.  The thing to use here is `reduceByKey`:

In [70]:
even_odd_summed_rdd = key_value_rdd.reduceByKey(lambda x,y: x+y)
even_odd_summed_rdd.collect()

[(0, 794), (1, 127)]

Instead of counting or reducing, we can simply group by key:

In [71]:
grouped_rdd = key_value_rdd.groupByKey()
grouped_rdd.collect()

[(0, <pyspark.resultiterable.ResultIterable at 0x7fc5a082a080>),
 (1, <pyspark.resultiterable.ResultIterable at 0x7fc5a082a0f0>)]

In *spirit* your output should look like
```
[(0, [6, 4, 654, 2, 8]),
 (1, [3, 53, 5, 1, 65])]
```
In actuality, the lists are computed lazily, so you should see some nonsense about "iterable".  To actually get what we want (lists) we need to apply the `list()` function to each value.

Fortunately, Spark provides a `mapValues` function for just such an occasion:

In [72]:
grouped_rdd = key_value_rdd.groupByKey()
grouped_rdd = grouped_rdd.mapValues(list)
grouped_rdd.collect()

[(0, [6, 4, 654, 2, 8, 66, 54]), (1, [3, 53, 5, 1, 65])]

Another way to get to the same place (skipping the intermediate `key_value_rdd`) is to use `groupBy` instead:

In [73]:
grouped_rdd = simple_rdd.groupBy(lambda x: x % 2)
grouped_rdd = grouped_rdd.mapValues(list)
grouped_rdd.collect()

[(0, [6, 4, 654, 2, 8, 66, 54]), (1, [3, 53, 5, 1, 65])]

There are many other operations we can perform with keys.  For example, we can `.join()` two RDDs based on matching keys (just like joining tables in SQL).

We'll come to that later, though.

The documentation listing all the transformations and actions you can apply to an RDD is here:

http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.RDD