# Science Data Analysis with Spark

[Spark](http://apache.spark.org) is a distributed data analysis framework that allows for processing of large data sets with a minimal amount of complexity for the user/scientist. It builds on the [Map/Reduce](http://en.wikipedia.org/wiki/MapReduce) computing paradigm, of which the Apache Hadoop project is perhaps the most well known implementation. 

In this workshop we will first learn the basics of the map-reduce programming model and some associated `python` tricks. We will then use this knowledge to build simple Spark applications.

## Map-reduce 

The map-reduce programming model is quite simple: 

1. start with a collection of data and a function
2. apply the function to every element in the data collection
3. once the data has been massaged into a useful state, compute some aggregate value and return it to the user

Let's see how this works through a simple example. 

(Note: initially I will use very un-optimized and clumsy python code for the sake of clarity... we will make better-performing code by using e.g. `numpy` and/or list comprehensions/generators later on)

First, we define our data array, in this case we're not very creative and just use 10 random integers in the range 0 - 100:

In [1]:
import random
random.seed(1)

In [2]:
data = []
for x in xrange(10) : data.append(random.randint(0,100))
data

[13, 85, 77, 25, 50, 45, 65, 79, 9, 2]

Lets say now that we wanted to compute the total sum of all the values doubled. The most obvious (and language-agnostic) programming construct for this would be some sort of a loop, in this case a `for` loop:

In [3]:
dbl_sum = 0
for x in data : 
    dbl_sum += x*2

In [4]:
dbl_sum

900

In this case, the calculation was entirely sequential -- we went through each element in `data`, doubled it, and added the result to the aggregate variable `dbl_sum`. But the two stages are separable -- we might first double all the elements in `data` and then sum them all together. This is exactly what a map-reduce operation would do. First, *map* the values to be double the original, and then reduce them to a single number by summing them together. 

As it turns out, the `python` language already includes the `map` and `reduce` functions so we can try this out immediately. First, we define the function that will be used by `map`:

In [5]:
def double_the_number(x) : 
    return x*2

Now we apply the `map` -- notice how compact this looks!

In [6]:
dbl_data = map(double_the_number, data)
dbl_data

[26, 170, 154, 50, 100, 90, 130, 158, 18, 4]

For the reduction, we will use the standard `add` operator: 

In [7]:
from operator import add

In [8]:
reduce(add, dbl_data)

900

## "Lambda" functions

In the example above, our function `double_the_number` needed a lot of writing for a very simple operation -- it literaly only multiplies the number by two. However, it is necessary to define a function in order to use `map` on the data array. In such cases, the concept of "in-line" functions comes in handy -- in Python, these are called "lambda" functions. 

The basic concept is that the lambda function gets its items from an iterable object (list, dictionary, tuple, etc.) and returns another element. Here's how we can write the above using a lambda function: 

In [9]:
dbl_data = map(lambda x: x*2, data)
dbl_data

[26, 170, 154, 50, 100, 90, 130, 158, 18, 4]

This form has the advantage of being much more compact and allowing function creation "on the fly". The concept of in-line functions will be key to writing simple Spark applications!

## List comprehension

"List comprehension" is a complicated name for a pretty nice feature of python -- creating lists on the fly using any kind of iterable object and, you guessed it, often with the help of lambda functions. 

In many cases, list comprehension can replace `for` loops when creating lists of objects and due to some subtle optimizations it can perform faster than the equivalent `for` loop. 

The basic syntax is that you enclose a `for` loop *inside* the list brackets `[]`. To make a simple (slightly contrived) example, consider: 

In [10]:
simple_list = [x for x in range(10)]
simple_list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

The `f(x) for x in y` construct is *extremely* powerful! Anything that can be iterated can be used as `y`. Note that in the case above, `f(x)` is just `x` itself, but it could be any function you want (including of course a lambda function!) 

Lets make a simple list of tuples to see one common application of such list comprehensions: 

In [11]:
tuple_list = zip([1,2,3,4], ['a','b','c','d'])
tuple_list

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]

Now we want to extract just the letters out of this list:

In [12]:
[x[1] for x in tuple_list]

['a', 'b', 'c', 'd']

This notation is very elegant and allows us to do a reasonably complex operation (iterate over the list and extracting elements of a tuple into a new list) in a very simple way. 

Often, you will want to apply a condition on the values in the iterator when creating the new list via a list comprehension. This is allowed for in the syntax and done quite simply by including an `if` statement. For example, if we wanted only the letters corresponding to all the even values: 

In [13]:
[x[1] for x in tuple_list if x[0]%2 == 0]

['b', 'd']

A similar result can be obtained using the built-in `filter` function. Sometimes this offers an extra degree of flexibility and is prefered to the conditional inside a list comprehension. For example, 

In [38]:
filtered_tuple_list = filter(lambda (x,y) : x%2 == 0, tuple_list)
filtered_tuple_list

[(2, 'b'), (4, 'd')]

Or to do the same as before: 

In [40]:
[x[1] for x in filter(lambda (i,j): i%2 == 0, tuple_list)]

['b', 'd']

### Generator expressions

Unfortunately, lists can have considerable memory overhead when they become long enough. Often, we don't need to hold the entire lists in memory, but only need the elements one by one -- this is the case with *all* reductions, for example, such as the `sum` we used above. 

In the cell below, two lists are actually created -- first, the one returned by `range` and once this one is iterated over, we have a second list resulting from the `x for x in range` part:

In [41]:
sum([x for x in range(100000)])

4999950000

When dealing with large amounts of data, the memory footprint becomes a serious concern and can make a difference between a code completing or crashing with an "out of memory" error. 

Luckily, `python` has a neat solution for this, and it's called "generator expressions". The gist is that such an expression acts like an iterable, but only creates the items when they are requested, computing them on the fly. 

Generator expressions work *exactly* the same way as list comprehension, but using `()` instead of `[]`. Very nice. 

So, lets see how this works: 

In [31]:
sum((x for x in range(100000)))

4999950000

The downside is that the elements of a generator expression can be accessed exactly once, i.e. there is *no* indexing!

In [33]:
list_expression = [x for x in range(100)]
list_expression[5]

5

In [35]:
gen_expression = (x for x in range(100))
gen_expression[5]

TypeError: 'generator' object has no attribute '__getitem__'

Finally, because `range` is so common and so wasteful, `python` includes the generator version of `range` which is simply `xrange` -- instead of making a list, this simply yields the elements one by one, but otherwise behaves exactly like `range`. 

In [42]:
(x for x in xrange(10))

<generator object <genexpr> at 0x2b3708061eb0>

In [43]:
sum((x for x in xrange(10)))

45