### A more accurate picture of Spark Architecture

In my search for a good description of Spark Architecture, I found [this](http://0x0fff.com/spark-architecture/) interesting Blog by Alexey Grishchenko.

In particular I like the following figure. A few things to note:
* Each executor lives in a separate JVM
* A worker node might have several executors under it.
* A partition defines a unit of work that resides in a single memory space, there is no 1-1 relationship between partitions and executors.  
  
    
    
![SparkArchitecture](http://0x0fff.com/wp-content/uploads/2015/03/Spark-Architecture-On-YARN.png)

## Partitions and Glom
In order to understand the use of glom and the interaction with partitioning, we do
a simple task that uses glom.

We are given a list of ordered lists. Each list has the form $v_0,v_2,v_3,\ldots,v_{n-1}$,  
our goal is to compute the *variation* of the list, which is defined as  
$$\sum_{i=0}^{n-2} |v_{i+1} - v_{i}|$$

To start, we create an RDD called `A` that contains elements of the form `(key, (index, value))` where  
different `key` values correspond to different lists, `value` corresponds to $v$ above, and `index` defines the order of the elements. Unlike the variable $i$ above, the values of `index` might not be consecutive.

In [131]:
A=sc.parallelize(range(10))\
    .map(lambda x: (x%3,(x,(-1)**x*x)))

A.collect()

[(0, (0, 0)),
 (1, (1, -1)),
 (2, (2, 2)),
 (0, (3, -3)),
 (1, (4, 4)),
 (2, (5, -5)),
 (0, (6, 6)),
 (1, (7, -7)),
 (2, (8, 8)),
 (0, (9, -9))]

### Original partition
The initial RDD, generated using `parallelize`, does not put the elements in any particular order.  
As a result, both partitions have elements with all 3 keys: `0,1,2`

In [133]:
print 'number of partitions=',A.getNumPartitions()

B=A.glom()
B.collect()

number of partitions= 2


[[(0, (0, 0)), (1, (1, -1)), (2, (2, 2)), (0, (3, -3)), (1, (4, 4))],
 [(2, (5, -5)), (0, (6, 6)), (1, (7, -7)), (2, (8, 8)), (0, (9, -9))]]

### Why we need to repartition
In this case, `glom` is not helping us with our task. We need all elements with the same key to be in the same partition in order to use `glom` effectively.

Repartitioning the data into two partitions (i.e. not changing the number of partitions, but rather just shuffling items from partition to partition) creates a situation in which we can use `glom`

In [134]:
B=A.partitionBy(2).glom()
B.collect()

[[(0, (0, 0)),
  (2, (2, 2)),
  (0, (3, -3)),
  (2, (5, -5)),
  (0, (6, 6)),
  (2, (8, 8)),
  (0, (9, -9))],
 [(1, (1, -1)), (1, (4, 4)), (1, (7, -7))]]

Note that one partition has items with the keys `0,2` and the other partition items only with the key `1`

### A function for computing the variation
We now define the function that computes the variation. The function recieves an array **not an RDD** of `(key,(index,value))` items and outputs a list of `(key,result)` pairs.

While the list *might* have just one `key`, we cannot rely on that, so the function has to first partition the list according to `key`. What we are guaranteed is that each `key` appears only in one partition.

In [135]:
def variation(A):
    # partition can have more than one key, so we need to divide A according to key value
    D={}
    for i in range(len(A)):
        key, item =A[i]
        if key in D.keys():
            D[key].append(item)
        else:
            D[key]=[item]
    out=[]
    for key in D.keys():
        if len(D[key])<2:
            out.append((key,0))
        else:
            L=D[key]
            L.sort(key=lambda x:x[0])
            d=0
            for i in range(1,len(L)):
                d+= abs(L[i][1]-L[i-1][1])
            out.append((key,d))
    return out

In [137]:
# We use flatMap rather than map because variation (A) returns a list of tuples.
C=B.flatMap(lambda L: variation(L))
C.collect()

[(0, 27), (2, 20), (1, 16)]

### Take Home
* Partition your data and your computation in such a way that each partition holds the data needed to compute a part of the result.
* Use a partitioner to bring related to each other to the same machine.
* Distribute the load across all of the machines.

### Experiment

Check to see what happens if you change the number of partitions to be larger than the number of keys.