In [1]:
from __future__ import print_function

# Basic Spark 

We have discussed some [basic features of Spark](Spark_intro.ipynb) -- now we'll try to actually use the framework for some basic operations. 

In particular, this notebook will walk you through some of the basic [Spark RDD methods](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD). As you'll see, there is a lot more to it than `map` and `reduce`. It might be useful to have a look at [RDD Transformations](http://spark.apache.org/docs/latest/programming-guide.html#transformations) and [RDD Actions](http://spark.apache.org/docs/latest/programming-guide.html#actions) in the Spark programming guide, just to have an idea of what is available. 

We will explore the concept of "lineage" in Spark RDDs and construct some simple key-value pair RDDs to write our first Spark applications.

If you need a reminder of some of the python concepts discussed earlier, you can make use of the [intro notebook](../intro/Spark_workshop_introduction.ipynb).

## Starting up the `SparkContext` locally

The most lightweight way of playing around with Spark is to run the whole Spark runtime on a single (local) machine. 

First, we need to do a few lines of setup (we can later move these to a startup script of some sort) and then we start the `SparkContext`

In [2]:
import findspark
findspark.init()
import pyspark

In [3]:
sc = pyspark.SparkContext(master='local[*]')
print(sc)

<pyspark.context.SparkContext object at 0x1066e7350>


Hurrah! We have a Spark Context! Now lets get some data into the Spark universe: 

In [4]:
data = xrange(100)
data_rdd = sc.parallelize(data)
print('Number of elements: ', data_rdd.count())
print('Sum and mean: ', data_rdd.sum(), data_rdd.mean())

Number of elements:  100
Sum and mean:  4950 49.5


Now if you look at your console, you will see *a lot* of output -- Spark is reporting all the stages of execution and can become rather verbose. Initially it's useful to inspect this output just to see what's going on and to see when issues arise. Later on we'll see how to quiet it down. 

In addition, each Spark application runs its own dedicated Web UI, accessible by default at `driver:4040`. In this case this is http://localhost:4040. 

This gives you a lot of nice information about the state of your job, including stats on execution time of individual tasks, available memory on all of the workers, links to worker logs, etc. You will probably begin to appreciate some of this information when things start to go wrong...

## Map/Reduce 

Lets bring some of the simple python-only examples from the [first notebook]('../intro/Spark_workshop_Introduction.ipynb) into the Spark framework. The first map function we made was simply doubling the input array, so lets do this here. 

Write the function `double_the_number` and then use this function with the `map` method of `data_rdd` to yield `double_rdd`:

In [5]:
def double_the_number(x) : 
    return x*2

In [6]:
help(data_rdd.map)

Help on method map in module pyspark.rdd:

map(self, f, preservesPartitioning=False) method of pyspark.rdd.PipelinedRDD instance
    Return a new RDD by applying a function to each element of this RDD.
    
    >>> rdd = sc.parallelize(["b", "a", "c"])
    >>> sorted(rdd.map(lambda x: (x, 1)).collect())
    [('a', 1), ('b', 1), ('c', 1)]



In [7]:
double_rdd = data_rdd.map(double_the_number)

Not much happened here - or at least, no tasks were launched (you can check the console and the Web UI). Spark simply recorded that the `data_rdd` maps into `double_rdd` via the `map` method using the `double_the_number` function. You can see some of this information by inspecting the RDD debug string: 

In [8]:
print(double_rdd.toDebugString())

(4) PythonRDD[4] at RDD at PythonRDD.scala:43 []
 |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:396 []


In [9]:
# comparing the first few elements of the original and mapped RDDs using take
print(data_rdd.take(10))
print(double_rdd.take(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]


Now if you go over to check on the [stages in the Spark UI](http://localhost:4040/stages/) you'll see that jobs were run to grab data from the RDD. In this case, a single task was run since all the numbers needed reside in one partition. Here we used `take` to extract a few RDD elements, a very very very convenient method for checking the data inside the RDD and debugging your map/reduce operations. 

Often, you will want to make sure that the function you define executes properly on the whole RDD. The most common way of forcing Spark to execute the mapping on all elements of the RDD is to invoke the `count` method: 

In [10]:
double_rdd.count()

100

If you now go back to the [stages page](http://localhost:4040/stages), you'll see that four tasks were run for this stage. 

In our initial example of using `map` in pure python code, we also used an inline lambda function. For such a simple construct like doubling the entire array, the lambda function is much neater than a separate function declaration. This works exactly the same way here.

Map the `data_rdd` to `double_lambda_rdd` by using a lambda function to multiply each element by 2: 

In [11]:
double_lambda_rdd = data_rdd.map(lambda x: x*2)
print(double_lambda_rdd.take(10))

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]


Finally, do a simple `reduce` step, adding up all the elements of `double_lambda_rdd`:

In [12]:
from operator import add
double_lambda_rdd.reduce(add)

9900

(Spark RDDs actually have a `sum` method which accomplishes essentially the same thing)

## Filtering

A critical step in many analysis tasks is to filter down the input data. In Spark, this is another *transformation*, i.e. it takes an RDD and maps it to a new RDD via a filter function. The filter function needs to evaluate each element of the RDD to either `True` or `False`. 

Use `filter` with a lambda function to select all values less than 10: 

In [13]:
filtered_rdd = data_rdd.filter(lambda x : x % 2)
filtered_rdd.count()

50

Of course we can now apply the `map` and double the `filtered_rdd` just as before: 

In [14]:
filtered_rdd.map(lambda x: x*2).take(10)

[2, 6, 10, 14, 18, 22, 26, 30, 34, 38]

Note that each RDD transformation returns a new RDD instance to the caller -- for example:

In [15]:
data_rdd.filter(lambda x: x % 2)

PythonRDD[12] at RDD at PythonRDD.scala:43

You can therefore string together many transformations without creating a separate instance variable for each step. Our `filter` + `map` step can therefore be combined into one. Note that if we surround the operations with "( )" we can make the code more readable by placing each transformation on a separate line: 

In [16]:
composite = (data_rdd.filter(lambda x: x % 2)
                     .map(lambda x: x*2))

Again, if you now look at the [Spark UI](http://localhost:4040) you'll see that nothing actually happened -- no job was trigerred. The `composite` RDD simply encodes the information needed to create it. 

If an action is executed that only requires a part of the RDD, only those parts will be computed. If we cache the RDD and only calculate a few of the elements, this will be made clear:

In [17]:
composite.cache()
composite.take(10)

[2, 6, 10, 14, 18, 22, 26, 30, 34, 38]

If you look at the [storage information](http://localhost:4040/storage/) you'll see that just a quarter of the RDD is cached. Now if we trigger the full calculation, this will increase to 100%:

In [18]:
composite.count()

50

## Key, value pair RDDs

`key`,`value` pair data is the "bread and butter" of map/reduce programming. Think of the `value` part as the meat of your data and the `key` part as some crucial metadata. For example, you might have time-series data for CO$_2$ concentration by geographic location: the `key` might be the coordinates or a time window, and `value` the CO$_2$ data itself. 

If your data can be expressed in this way, then the map/reduce computation model can be very convenient for pre-processing, cleaning, selecting, filtering, and finally analyzing your data. 

Spark offers a `keyBy` method that you can use to produce a key from your data. In practice this might not be useful often but we'll do it here just to make an example: 

In [19]:
# key the RDD by x modulo 5
keyed_rdd = data_rdd.keyBy(lambda x: x%5)

In [20]:
keyed_rdd.take(20)

[(0, 0),
 (1, 1),
 (2, 2),
 (3, 3),
 (4, 4),
 (0, 5),
 (1, 6),
 (2, 7),
 (3, 8),
 (4, 9),
 (0, 10),
 (1, 11),
 (2, 12),
 (3, 13),
 (4, 14),
 (0, 15),
 (1, 16),
 (2, 17),
 (3, 18),
 (4, 19)]

This created keys with values 0-4 for each element of the RDD. We can now use the multitude of `key` transformations and actions that the Spark API offers. For example, we can revisit `reduce`, but this time do it by `key`: 

## `reduceByKey`

In [21]:
red_by_key = keyed_rdd.reduceByKey(add)
red_by_key.collect()

[(0, 950), (4, 1030), (1, 970), (2, 990), (3, 1010)]

Unlike the global `reduce`, the `reduceByKey` is a *transformation* --> it returns another RDD. Often, when we reduce by key, the dataset size is reduced enough that it is safe to pull it completely out of Spark and into the driver (i.e. this notebook). A useful way of doing this is to automatically convert it to python dictionary for subsequent processing with the `collectAsMap` method:

In [22]:
red_dict= red_by_key.collectAsMap()
red_dict

{0: 950, 1: 970, 2: 990, 3: 1010, 4: 1030}

In [23]:
# access by key
red_dict[0]

950

## `groupByKey`

If you want to collect the elements belonging to a key into a list in order to process them further, you can do this with `groupByKey`. Note that if you want to group the elements only to do a subsequent reduction, you are far better off using `reduceByKey`, because it does the reduction locally on each partition first before communicating the results to the other nodes. By contrast, `groupByKey` reshuffles the entire dataset because it has to group *all* the values for each key from all of the partitions. 

In [24]:
keyed_rdd.groupByKey().collect()

[(0, <pyspark.resultiterable.ResultIterable at 0x106c00510>),
 (4, <pyspark.resultiterable.ResultIterable at 0x106c005d0>),
 (1, <pyspark.resultiterable.ResultIterable at 0x106c00650>),
 (2, <pyspark.resultiterable.ResultIterable at 0x106c00590>),
 (3, <pyspark.resultiterable.ResultIterable at 0x106c00490>)]

Note the ominous-looking `pyspark.resultiterable.Resultiterable`: this is exactly what it says, an iterable. You can turn it into a list or go through it in a loop. For example:

In [25]:
key, iterable = keyed_rdd.groupByKey().first()

In [26]:
list(iterable)

[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]

In [27]:
for val in iterable : 
    print(val)

0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95


## `sortBy`

Use the `sortBy` method of `red_by_key` to return a list sorted by the sums and print it out. 

In [28]:
sorted_red = red_by_key.sortBy(lambda (x,count): count, False).collect()

In [29]:
assert(sorted_red == [(4, 1030), (3, 1010), (2, 990), (1, 970), (0, 950)])

Finally, to shut down the `SparkContex`, call `sc.stop()`:

In [30]:
sc.stop()