# Analyzing large datasets
## Day 0: Basics

## Purpose

What are we doing here? 

* analyzing data on a laptop or even a capable workstation doesn't cut it anymore
* individual data sets are quickly growing to many Gb of *raw* data 
* add on top of this memory overhead for processing and you're stuck!
* on top of it all, we'd like some interactivity please! hard to really understand data without exploration

## Outline

####1. [MapReduce](#Map-reduce)
####2. [Lambda functions](#"Lambda"-functions)
####3. [List comprehensions](#List-comprehension) (and [Tuples](#Tuples))
####4. [Generator expressions](#Generator-expressions)
####5. [Generators](#Generators)

Sooner or later, data analysis needs to become *distributed*. 

This doesn't mean using more cores on your laptop - it means using many machines in a coordinated way. 

People have been doing this (using many machines for computation) for decades! 

But mostly for *creating* data on various HPC platforms, not processing it! 

Some interactive tools exist, but are often specialized and hard to taylor to specific needs

Distributed data processing has really been driven by *industry* for the past decade or so.

Enormous amounts of data in data warehouses needed a sensible way to chug through it all as cheaply as possible

---> Map/Reduce was born

## Map-reduce 

The map-reduce programming model is at the heart of distributed data processing. In essence, it is quite simple: 

1. start with a collection of data and distribute it
2. define a function you want to use to operate on that data
2. apply the function to every element in the data collection (the *map* step)
3. once the data has been massaged into a useful state, compute some aggregate value and return it to the user (the *reduce* step)

Let's see how this works through a simple example. 

(Note: initially I will use very un-optimized and clumsy python code for the sake of clarity... we will make better-performing code by using e.g. `numpy` and/or list comprehensions/generators later on)

First, we define our data array, in this case we're not very creative and just use 10 random integers in the range 0 - 100:

In [1]:
import random
random.seed(1) # initialized to make sure we get the same numbers every time

data = []
for x in xrange(10) : data.append(random.randint(0,100))
print data

[13, 85, 77, 25, 50, 45, 65, 79, 9, 2]


Lets say now that we wanted to compute the total sum of all the values doubled. The most obvious choice for this would be some sort of a loop, in this case a `for` loop:

In [2]:
dbl_sum = 0
for x in data : 
    dbl_sum += x*2
    
print dbl_sum

900


In this case, the calculation was entirely sequential:

we went through each element in `data`, doubled it, and added the result to the aggregate variable `dbl_sum` all in a single step. 

But the two stages are separable: 

we might *first* double all the elements in `data` and then sum them all together. 

This is exactly a map-reduce operation: 

1. *map* the values to be double the original
2. *reduce* them to a single number by summing them together. 

As it turns out, the `python` language already includes the `map` and `reduce` functions so we can try this out immediately. First, we define the function that will be used as a `map`:

In [3]:
def double_the_number(x) : 
    return x*2

Now we apply the `map` -- notice how compact this looks!

In [4]:
dbl_data = map(double_the_number, data)
print dbl_data

[26, 170, 154, 50, 100, 90, 130, 158, 18, 4]


For the reduction, we will use the standard `add` operator: 

In [5]:
from operator import add
reduce(add, dbl_data)

900

## "Lambda" functions

In the example above, our function `double_the_number` needed a lot of writing for a very simple operation -- it only multiplies the number by two. However, it is necessary to define a function in order to use `map` on the data array and sometimes that function does indeed end up being very simple. In such cases, the concept of "in-line" functions comes in handy -- in Python, these are called "lambda" functions. 

The basic concept is that the lambda function gets its items from an iterable object (list, dictionary, tuple, etc.) and returns another element. Here's how we can write the above using a lambda function: 

In [6]:
dbl_data = map(lambda x: x*2, data)
dbl_data

[26, 170, 154, 50, 100, 90, 130, 158, 18, 4]

This form has the advantage of being much more compact and allowing function creation "on the fly". The concept of in-line functions will be key to writing simple Spark applications!

## List comprehension

"List comprehension" is a complicated name for a pretty nice feature of python -- creating lists on the fly using any kind of iterable object and, you guessed it, often with the help of lambda functions. 

In many cases, list comprehension can replace `for` loops when creating lists of objects and due to some subtle optimizations it can perform faster than the equivalent `for` loop. 

The basic syntax is that you enclose a `for` loop *inside* the list brackets `[]`. 

To make a simple (slightly contrived) example, consider: 

In [7]:
simple_list = [x for x in range(10)]
simple_list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

The equivalent `for` loop:

In [1]:
simple_list = []
for x in range(10) : 
    simple_list.append(x)
simple_list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

The list comprehension gives you much more concise syntax!

The construct 

    f(x) for x in y  
    
is *extremely* powerful! 

Anything that can be iterated can be used as the `y`. In the case above, `f(x)` is just `x` itself, but it could be any function you want (including of course a lambda function!) 


### Tuples
Lets make a simple list of tuples to see one common application of such list comprehensions: 

In [3]:
tuple_list = zip([1,2,3,4], ['a','b','c','d'])
tuple_list

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]

Now we want to extract just the letters out of this list:

In [4]:
[x[1] for x in tuple_list]

['a', 'b', 'c', 'd']

An even clearer syntax is to label the tuple elements that we are extracting from the list:

In [7]:
[letter for (num, letter) in tuple_list]

['a', 'b', 'c', 'd']

This notation is very elegant and allows us to do a reasonably complex operation (iterate over the list and extracting elements of a tuple into a new list) in a very simple way. 

Often, you will want to apply a condition on the values in the iterator when creating the new list via a list comprehension. This is done quite simply by including an `if` statement. 

For example, if we wanted only the letters corresponding to all the even values: 

In [18]:
[letter for (num, letter) in tuple_list if num%2 == 0]

['b', 'd']

A similar result can be obtained using the built-in `filter` function. 

Sometimes this offers an extra degree of flexibility and is prefered to the conditional inside a list comprehension, especially if the filtering condition is more complicated. For example, 

In [19]:
def filter_func(x) :
    num, letter = x
    return num%2 == 0

filtered_tuple_list = filter(filter_func, tuple_list)
filtered_tuple_list

[(2, 'b'), (4, 'd')]

We can of course use the results of `filter` also in a list comprehension:

In [16]:
[letter for (num,letter) in filter(filter_func, tuple_list)]

['b', 'd']

Slight **warning** here: 

the elements of the list are tuples and the function arguments don't expand the tuple automatically. That's why we have the extra line

    num, letter = x

which takes `num` and `letter` out of each tuple that gets passed to `filter_func`. 

The same would happen with a `lambda funcion`: 

In [22]:
# error
filter(lambda x,y: x%2==0, tuple_list)

TypeError: <lambda>() takes exactly 2 arguments (1 given)

In [23]:
# correct syntax
filter(lambda (x,y): x%2==0, tuple_list)

[(2, 'b'), (4, 'd')]

## Generator expressions 

Unfortunately, lists can have considerable memory overhead when they become long enough. Often, we don't need to hold the entire lists in memory, but only need the elements one by one -- this is the case with *all* reductions, for example, such as the `sum` we used above. 

In the cell below, two lists are actually created -- first, the one returned by `range` and once this one is iterated over, we have a second list resulting from the `x for x in range` part:

In [13]:
sum([x for x in range(100000)])

4999950000

When dealing with large amounts of data, the memory footprint becomes a serious concern and can make a difference between a code completing or crashing with an "out of memory" error. 

Luckily, `python` has a neat solution for this, and it's called "generator expressions". The gist is that such an expression acts like an **iterable**, but only creates the items when they are requested, computing them on the fly. 

Generator expressions work *exactly* the same way as list comprehension, but using `()` instead of `[]`. Very nice. 

So, lets see how this works: 

In [14]:
sum((x for x in range(100000)))

4999950000

The downside is that the elements of a generator expression can be accessed exactly once, i.e. there is *no* indexing!

In [15]:
list_expression = [x for x in range(100)]
list_expression[5]

5

In [16]:
gen_expression = (x for x in range(100))
gen_expression[5]

TypeError: 'generator' object has no attribute '__getitem__'

Finally, because `range` is so common and so wasteful, `python` includes the generator version of `range` which is simply `xrange` -- instead of making a list, this simply yields the elements one by one, but otherwise behaves exactly like `range`. 

In [17]:
(x for x in xrange(10))

<generator object <genexpr> at 0x106b75fa0>

In [18]:
sum((x for x in xrange(10)))

45

## Generators
Closely related to generator *expressions* are *generators* - they are functions that keep track of their internal state even when they return to the caller. When they are called again, they continue from where they left off. 

It's easy to illustrate this with writing our own version of `xrange` discussed above. 

In [2]:
def my_xrange(N) :
    i = 0
    while i < N :
        yield i
        i += 1
    raise StopIteration

In [9]:
gen = my_xrange(10)
gen

<generator object my_xrange at 0x106b75dc0>

In [10]:
gen.next()

0

In [11]:
# continuing where it left off
gen.next()

1

In [12]:
[x for x in gen]

[2, 3, 4, 5, 6, 7, 8, 9]

In [13]:
# exhausted iterator
gen.next()

StopIteration: 

This only scratches the surface of generator functionality in `python`, but for our purposes it is enough. For a more complete discussion see e.g. [the python wiki](https://wiki.python.org/moin/Generators) and [this pretty good example](http://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/). 

Generators and generator expressions are useful in general when dealing with large data objects because they allow you to iterate through the data without ever holding it in memory. 

The concept of generators will be useful when we discuss the `mapPartitions` RDD method in Spark.