# Distributed Data Analysis Workshop
## Day 1: Programming Background / Python basics

## Outline

#### 1. [MapReduce](#Map-reduce)
#### 2. [Lambda functions](#"Lambda"-functions)
#### 3. [List comprehensions](#List-comprehension) (and [Tuples](#Tuples))
#### 4. [Generator expressions](#Generator-expressions)
#### 5. [Generators](#Generators)

People have been doing this (using many machines for computation) for decades! 

But mostly for *creating* data on various HPC platforms, not processing it! 

Some interactive tools exist, but are often specialized and hard to taylor to specific needs

Distributed data processing has really been driven by *industry* for the past decade or so.

Enormous amounts of data in data warehouses needed a sensible way to chug through it all as cheaply as possible

---> [MapReduce (Dean & Gemawat 2004)](http://research.google.com/archive/mapreduce.html) was born

## Map-reduce 

The map-reduce programming model is at the heart of distributed data processing. In essence, it is quite simple: 

1. start with a collection of data and distribute it
2. define a function you want to use to operate on that data
2. apply the function to every element in the data collection (the *map* step)
3. once the data has been massaged into a useful state, compute some aggregate value and return it to the user (the *reduce* step)

A few things to note: 

1. this is an extremely limiting programming model (compare to MPI where anything is possible)
2. strictly task-parallel --> individual tasks *never* communicate to each other (except maybe via shared variables)
3. very clear on intent *because* it is so limiting

Let's see how this works through a simple example. 

Before we get started, we need to get set up. If you have `git`, check out the repository:

```bash
> git clone git@github.com:rokroskar/spark_workshop.git
```

If you don't have `git` (why don't you?) download the zip file:

```bash
> wget https://github.com/rokroskar/spark_workshop/archive/master.zip
> unzip spark_workshop-master.zip
```

(or if you don't have `wget` just go to http://github.com/rokroskar/spark_workshop and download from there)

### If you are using Vagrant:

```bash
> cd spark_workshop
> vagrant up
> vagrant ssh -c "notebooks/start_notebook.py --setup --launch
```

If running your own installation: 

```bash
> cd spark_workshop
> notebooks/start_notebook.py --setup --launch
```

Look for output like this: 

```bash
To access the notebook, inspect the output below for the port number, then point your browser to https://localhost:<port_number>
[I 00:20:18.462 NotebookApp] Serving notebooks from local directory: /Users/rokstar/spark_workshop
[I 00:20:18.462 NotebookApp] 0 active kernels
[I 00:20:18.462 NotebookApp] The IPython Notebook is running at: https://[all ip addresses on your system]:8889/
```

and go to https://localhost:8889 (or whatever port it says the server is running on)

### Very very basic MapReduce example

First, we define our data array, in this case we're not very creative and just use 10 random integers in the range 0 - 100:

In [1]:
import random
random.seed(1) # initialized to make sure we get the same numbers every time
data = []
for x in xrange(10) : data.append(random.randint(0,100))
print data

[13, 85, 77, 25, 50, 45, 65, 79, 9, 2]


Lets say now that we wanted to compute the total sum of all the values doubled. The most obvious choice for this would be some sort of a loop, in this case a `for` loop:

In [2]:
dbl_sum = 0
for x in data : 
    dbl_sum += x*2
    
print dbl_sum

900


In this case, the calculation was entirely sequential:

we went through each element in `data`, doubled it, and added the result to the aggregate variable `dbl_sum` all in a single step. 

But the two stages are separable: 

we might *first* double all the elements in `data` and then sum them all together. 

This is exactly a map-reduce operation: 

1. *map* the values to be double the original
2. *reduce* them to a single number by summing them together. 

As it turns out, the `python` language already includes the `map` and `reduce` functions so we can try this out immediately. First, we define the function that will be used as a `map`:

In [3]:
def double_the_number(x) : 
    return x*2

Now we apply the `map` -- notice how compact this looks!

In [4]:
dbl_data = map(double_the_number, data)
print dbl_data

[26, 170, 154, 50, 100, 90, 130, 158, 18, 4]


`map` implicitly loops over all of the elements of data and applies `double_the_number` to each one. 

For the reduction, we will use the standard `add` operator: 

In [5]:
from operator import add
from functools import reduce
reduce(add, dbl_data)

900

## "Lambda" functions

* our function `double_the_number` needed a lot of writing for a very simple operation 
* but! `map` *requires* a function to apply to the data array
* when a function needed is very simple, the concept of "in-line" lambda functions is great 

Basic idea: 

* the lambda function gets its items from an *iterable* object (list, dictionary, tuple, etc.)
* it returns one element for each element it takes in  

Here's how we can write the above using a lambda function: 

In [6]:
dbl_data = map(lambda x: x*2, data)
dbl_data

[26, 170, 154, 50, 100, 90, 130, 158, 18, 4]

This form has the advantage of being much more compact and allowing function creation "on the fly". 

The concept of in-line functions will be key to writing simple Spark applications!

Note that a `lambda` function is a function just like any other and you can also give it a name (although tha almost defies the point of an in-line function...)

In [7]:
double_lambda = lambda x: x*2
print 'type of double_lambda is %s ' % type(double_lambda)

type of double_lambda is <type 'function'> 


## List comprehension

"List comprehension" is a complicated name for a pretty nice feature of python: creating lists on the fly using any kind of iterable object, often with the help of lambda functions. 

* In many cases, a handy replacement for `for` loops when creating lists of objects 
* can sometimes perform faster than the equivalent for loop 

A normal python list is made by 

In [8]:
my_list = [1, 2, 3, 4, 5]

The basic syntax for a *list comprehension* is that you enclose a `for` loop *inside* the list brackets `[]`. 

To make a simple (slightly contrived) example, consider: 

In [9]:
simple_list = [x for x in range(10)]
simple_list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

(should really be called list *expansion*)

The equivalent `for` loop:

In [10]:
simple_list = []
for x in range(10) : 
    simple_list.append(x)
simple_list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

The list comprehension gives you much more concise syntax!

Even neater when a conditional is used in the iteration: 

In [11]:
# only even numbers
simple_list = []
for x in range(10) : 
    if x % 2 == 0 :
        simple_list.append(x)
simple_list

[0, 2, 4, 6, 8]

In [12]:
[x for x in range(10) if x % 2 == 0]

[0, 2, 4, 6, 8]

The construct 

    f(x) for x in y  
    
is *extremely* powerful! 

Anything that can be iterated can be used as the `y`. In the case above, `f(x) = x`, but it could be any function you want (including of course a lambda function!) 


### Tuples
Lets make a simple list of tuples to see one common application of such list comprehensions: 

In [13]:
tuple_list = zip([1,2,3,4], ['a','b','c','d'])
tuple_list

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]

Now we want to extract just the letters out of this list:

In [14]:
[x[1] for x in tuple_list]

['a', 'b', 'c', 'd']

An even clearer syntax is to label the tuple elements that we are extracting from the list:

In [15]:
[letter for (num, letter) in tuple_list]

['a', 'b', 'c', 'd']

This notation is very elegant and allows us to do a reasonably complex operation (iterate over the list and extracting elements of a tuple into a new list) in a very simple way. 

A conditional can be applied on the values in the iterator when creating the new list when processing a tuple, just as we did above.

For example, if we wanted only the letters corresponding to all the even values: 

In [16]:
[letter for (num, letter) in tuple_list if num%2 == 0]

['b', 'd']

## Filter

Sometimes some more complex logic needs to be applied to the values. 

* for such cases, use the `filter` function
* can be any function of the form `f(x) --> boolean`

In [17]:
def filter_func(x) :
    num, letter = x
    return num%2 == 0

filtered_tuple_list = filter(filter_func, tuple_list)
filtered_tuple_list

[(2, 'b'), (4, 'd')]

We can of course use the results of `filter` also in a list comprehension:

In [18]:
[letter for (num,letter) in filter(filter_func, tuple_list)]

['b', 'd']

Slight **warning** here: 

the elements of the list are tuples and the function arguments don't expand the tuple automatically. That's why we have the extra line

    num, letter = x

which takes `num` and `letter` out of each tuple that gets passed to `filter_func`. 

The same would happen with a `lambda funcion`: 

In [19]:
# error
filter(lambda x,y: x%2==0, tuple_list)

TypeError: <lambda>() takes exactly 2 arguments (1 given)

In [20]:
# correct syntax
filter(lambda (x,y): x%2==0, tuple_list)

[(2, 'b'), (4, 'd')]

## Generator expressions 

Unfortunately, creating long lists can have large memory overhead. 

Often, we don't need to hold the entire lists in memory, but only need the elements one by one -- this is the case with *all* reductions, for example, such as the `sum` we used above. 

In the cell below, two lists are actually created -- first, the one returned by `range` and once this one is iterated over, we have a second list resulting from the `x for x in range` part:

In [21]:
sum([x for x in range(1000000)])

499999500000

When dealing with large amounts of data, the memory footprint becomes a serious concern and can make a difference between a code completing or crashing with an "out of memory" error. 

Luckily, `python` has a neat solution for this, and it's called "generator expressions". The gist is that such an expression acts like an **iterable**, but only creates the items when they are requested, computing them on the fly. 

Generator expressions work *exactly* the same way as list comprehension, but using `()` instead of `[]`. Very nice. 

So, lets see how this works: 

In [22]:
print(sum(x for x in range(1000000)))

# now summing only the even numbers -- conditionals work just like in list comprehension
print(sum(x for x in range(1000000) if x % 2 == 0))

499999500000
249999500000


The downside is that the elements of a generator expression can be accessed exactly once, i.e. there is *no* indexing!

In [23]:
list_expression = [x for x in range(100)]
list_expression[5]

5

In [24]:
gen_expression = (x for x in range(100))
gen_expression[5]

TypeError: 'generator' object has no attribute '__getitem__'

Finally, because `range` is so common and so wasteful, `python` includes the generator version of `range` which is simply `xrange` -- instead of making a list, this simply yields the elements one by one, but otherwise behaves exactly like `range`. 

*note:* this is the default `range` behavior in python3

Compare memory usage of these two executions: 

In [25]:
%load_ext memory_profiler
%memit sum([x for x in range(100000000)])

peak memory: 3197.58 MiB, increment: 3149.74 MiB


In [26]:
%memit sum(x for x in xrange(100000000))

peak memory: 51.28 MiB, increment: 0.06 MiB


## Generators
Closely related to generator *expressions* are *generators* - they are 

* functions that keep track of their internal state when they return 
* on next call they continue from where they left off. 

It's easy to illustrate this with writing our own version of `xrange` discussed above.

In [27]:
def my_xrange(N) :
    i = 0
    while i < N :
        yield i
        i += 1

In [28]:
gen = my_xrange(10)
gen

<generator object my_xrange at 0x1045771e0>

In [29]:
print 'first value', gen.next()
print 'next value', gen.next()

first value 0
next value 1


In [30]:
[x for x in gen]

[2, 3, 4, 5, 6, 7, 8, 9]

In [31]:
# exhausted iterator
gen.next()

StopIteration: 

This only scratches the surface of generator functionality in `python`, but for our purposes it is enough. For a more complete discussion see e.g. [the python wiki](https://wiki.python.org/moin/Generators) and [this pretty good example](http://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/). 

Generators and generator expressions are useful in general when dealing with large data objects because they allow you to iterate through the data without ever holding it in memory. 

The concept of generators will be useful when we discuss the `mapPartitions` RDD method in Spark.