In [1]:
from __future__ import print_function

# Basic Spark 

We have discussed some [basic features of Spark](Spark_intro.ipynb) -- now we'll try to actually use the framework for some basic operations. 

In particular, this notebook will walk you through some of the basic [Spark RDD methods](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD). As you'll see, there is a lot more to it than `map` and `reduce`.

We will explore the concept of "lineage" in Spark RDDs and construct some simple key-value pair RDDs to write our first Spark applications.

If you need a reminder of some of the python concepts discussed earlier, you can make use of the [intro notebook](../intro/Spark_workshop_introduction.ipynb).

## Starting up the `SparkContext` locally

The most lightweight way of playing around with Spark is to run the whole Spark runtime on a single (local) machine. 

First, we need to do a few lines of setup (we can later move these to a startup script of some sort) and then we start the `SparkContext`

In [2]:
import sys, os, findspark
findspark.init()

In [3]:
try: 
    print('SPARK_HOME set to %s' % os.environ['SPARK_HOME'])
except KeyError : 
    raise KeyError('please exit, set the SPARK_HOME environment variable and restart')

SPARK_HOME set to /home/vagrant/spark


In [5]:
# set up the python path
#sys.path.insert(0, '%s/python/' % os.environ['SPARK_HOME'])
#sys.path.insert(0, '%s/python/lib/py4j-0.8.2.1-src.zip' % os.environ['SPARK_HOME'])

try : 
    import pyspark
except ImportError : 
    raise ImportError('make sure you actually have Spark downloaded and extracted...')

In [6]:
sc = pyspark.SparkContext(master='local[*]')
print(sc)

<pyspark.context.SparkContext object at 0x7ff018081c50>


Hurrah! We have a Spark Context! Now lets get some data into the Spark universe: 

In [7]:
data = xrange(100)
data_rdd = sc.parallelize(data)
print('Number of elements: ', data_rdd.count())
print('Sum and mean: ', data_rdd.sum(), data_rdd.mean())

Number of elements:  100
Sum and mean:  4950 49.5


Now if you look at your console, you will see *a lot* of output -- Spark is reporting all the stages of execution and can become rather verbose. Initially it's useful to inspect this output just to see what's going on and to see when issues arise. Later on we'll see how to quiet it down. 

In addition, each Spark application runs its own dedicated Web UI, accessible by default at `driver:4040`. In this case this is http://localhost:4040. 

This gives you a lot of nice information about the state of your job, including stats on execution time of individual tasks, available memory on all of the workers, links to worker logs, etc. You will probably begin to appreciate some of this information when things start to go wrong...

## Map/Reduce 

Lets bring some of the simple python-only examples from the [first notebook]('../intro/Spark_workshop_Introduction.ipynb) into the Spark framework. The first map function we made was simply doubling the input array, so lets do this here. 

Write the function `double_the_number` and then use this function with the `map` method of `data_rdd` to yield `double_rdd`:

In [8]:
def double_the_number(x) : 
    return x*2

In [9]:
help(data_rdd.map)

Help on method map in module pyspark.rdd:

map(self, f, preservesPartitioning=False) method of pyspark.rdd.PipelinedRDD instance
    Return a new RDD by applying a function to each element of this RDD.
    
    >>> rdd = sc.parallelize(["b", "a", "c"])
    >>> sorted(rdd.map(lambda x: (x, 1)).collect())
    [('a', 1), ('b', 1), ('c', 1)]



In [10]:
double_rdd = data_rdd.map(double_the_number)

Not much happened here - or at least, no tasks were launched (you can check the console and the Web UI). Spark simply recorded that the `data_rdd` maps into `double_rdd` via the `map` method using the `double_the_number` function. You can see some of this information by inspecting the RDD debug string: 

In [11]:
print(double_rdd.toDebugString())

(1) PythonRDD[4] at RDD at PythonRDD.scala:43 []
 |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:396 []


In [12]:
# comparing the first few elements of the original and mapped RDDs
print(data_rdd.take(10))
print(double_rdd.take(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]


Now if you go over to check on the [stages in the Spark UI](http://localhost:4040/stages/) you'll see that jobs were run to grab data from the RDD. In this case, a single task was run since all the numbers needed reside in one partition. Here we used `take` to extract a few RDD elements, a very very very convenient method for checking the data inside the RDD and debugging your map/reduce operations. 

Often, you will want to make sure that the function you define executes properly on the whole RDD. The most common way of forcing Spark to execute the mapping on all elements of the RDD is to invoke the `count` method: 

In [13]:
double_rdd.count()

100

If you now go back to the [stages page](http://localhost:4040/stages), you'll see that four tasks were run for this stage. 

In our initial example of using `map` in pure python code, we also used an inline lambda function. For such a simple construct like doubling the entire array, the lambda function is much neater than a separate function declaration. This works exactly the same way here.

Map the `data_rdd` to `double_lambda_rdd` by using a lambda function to multiply each element by 2: 

In [14]:
double_lambda_rdd = data_rdd.map(lambda x: x*2)
print(double_lambda_rdd.take(10))

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]


Finally, do a simple `reduce` step, adding up all the elements of `double_lambda_rdd`:

In [15]:
from operator import add
double_lambda_rdd.reduce(add)

9900

## Filtering

A critical step in many analysis tasks is to filter down the input data. In Spark, this is another *transformation*, i.e. it takes an RDD and maps it to a new RDD via a filter function. The filter function needs to evaluate each element of the RDD to either `True` or `False`. 

Use `filter` with a lambda function to select all values less than 10: 

In [16]:
filtered_rdd = data_rdd.filter(lambda x : x < 10)
filtered_rdd.count()

10

Of course we can now apply the `map` and double the `filtered_rdd` just as before: 

In [17]:
filtered_rdd.map(lambda x: x*2).take(10)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

## Key, value pair RDDs

`key`,`value` pair data is the "bread and butter" of map/reduce programming. Think of the `value` part as the meat of your data and the `key` part as some crucial metadata. For example, you might have time-series data for CO$_2$ concentration by geographic location: the `key` would be the coordinates, and `value` the CO$_2$ data itself. 

If your data can be expressed in this way, then the map/reduce computation model can be very convenient for pre-processing, cleaning, selecting, filtering, and finally analyzing your data. 

Spark offers a `keyBy` method that you can use to produce a key from your data. In practice this might not be useful often but we'll do it here just to make an example: 

In [18]:
keyed_rdd = data_rdd.keyBy(lambda x: x%5)

In [19]:
keyed_rdd.take(10)

[(0, 0),
 (1, 1),
 (2, 2),
 (3, 3),
 (4, 4),
 (0, 5),
 (1, 6),
 (2, 7),
 (3, 8),
 (4, 9)]

This created keys with values 0-4 for each element of the RDD. We can now use the multitude of `key` transformations and actions that the Spark API offers. For example, we can revisit `reduce`, but this time do it by `key`: 

## `reduceByKey`

In [20]:
red_by_key = keyed_rdd.reduceByKey(add)
red_by_key.collect()

[(0, 950), (1, 970), (2, 990), (3, 1010), (4, 1030)]

Unlike the global `reduce`, the `reduceByKey` is a *transformation* --> it returns another RDD. Often, when we reduce by key, the dataset size is reduced enough that it is safe to pull it completely out of Spark and into the driver (i.e. this notebook). A useful way of doing this is to automatically convert it to python dictionary for subsequent processing: 

In [21]:
red_dict= red_by_key.collectAsMap()
red_dict

{0: 950, 1: 970, 2: 990, 3: 1010, 4: 1030}

In [22]:
# access by key
red_dict[0]

950

## `sortBy`

Use the `sortBy` method of `red_by_key` to return a list sorted by the sums and print it out. 

In [23]:
red_by_key.sortBy(lambda (x,count): count, False).collect()

[(4, 1030), (3, 1010), (2, 990), (1, 970), (0, 950)]

Finally, to shut down the `SparkContex`, call `sc.stop()`:

In [23]:
sc.stop()