# Science Data Analysis with Spark

[Spark](http://apache.spark.org) is a distributed data analysis framework that allows for processing of large data sets with a minimal amount of complexity for the user/scientist. It builds on the [Map/Reduce](http://en.wikipedia.org/wiki/MapReduce) computing paradigm, of which the Apache Hadoop project is perhaps the most well known implementation. 

In this workshop we will first learn the basics of the map-reduce programming model and some associated `python` tricks. We will then use this knowledge to build simple Spark applications.

## Map-reduce 

The map-reduce programming model is quite simple: 

1. start with a collection of data and a function
2. apply the function to every element in the data collection
3. once the data has been massaged into a useful state, compute some aggregate value and return it to the user

Let's see how this works through a simple example. 

(Note: initially I will use very un-optimized and clumsy python code for the sake of clarity... we will make better-performing code by using e.g. `numpy` and/or list comprehensions/generators later on)

First, we define our data array, in this case we're not very creative and just use 10 random integers in the range 0 - 100:

In [15]:
import random

In [31]:
data = []
for x in xrange(100000) : data.append(random.randint(0,100))

In [18]:
data

[92, 58, 91, 36, 45, 11, 88, 66, 99, 13]

Lets say now that we wanted to compute the total sum of all the values doubled. The most obvious (and language-agnostic) programming construct for this would be some sort of a loop, in this case a `for` loop:

In [20]:
dbl_sum = 0
for x in data : 
    dbl_sum += x*2

In [21]:
dbl_sum

1198

In this case, the calculation was entirely sequential -- we went through each element in `data`, doubled it, and added the result to the aggregate variable `dbl_sum`. But the two stages are separable -- we might first double all the elements in `data` and then sum them all together. This is exactly what a map-reduce operation would do. First, *map* the values to be double the original, and then reduce them to a single number by summing them together. 

As it turns out, the `python` language already includes the `map` and `reduce` functions so we can try this out immediately. First, we define the function that will be used by `map`:

In [22]:
def double_the_number(x) : 
    return x*2

Now we apply the `map` -- notice how compact this looks!

In [24]:
dbl_data = map(double_the_number, data)

In [25]:
dbl_data

[184, 116, 182, 72, 90, 22, 176, 132, 198, 26]

For the reduction, we will use the standard `add` operator: 

In [26]:
from operator import add

In [37]:
reduce(add, dbl_data)

1198

We can use the map/reduce model on any kind of data collection -- lets see what we can do with an image: 

In [46]:
import urllib2
import matplotlib.pyplot as plt

%matplotlib inline

f = urllib2.urlopen("http://google.com")
im = plt.imread(f)
plt.imshow(im)


URLError: <urlopen error [Errno 101] Network is unreachable>