## PySpark Friends of Friends

**The notebook is complete, i.e. all the TODOs have been filled in.**

We are going to reproduce the Friends of Friends Hadoop example in Spark.
First things first, let's run it as a map/reduce program in Spark.

Node A with neighbors B and C propose candidate triples to it's neighbors
* B A C to node B (A<C, else B C A)
* C A B to node C (A<B else C B A)

All triples will get two proposals from it's neighbors and reduce them. If there are two matching proposals, we have a triple.

### Map/Reduce Implementation

The code below `lines_to_triples` implements the proposal to be run in `flatMap`.

In [None]:
import numpy as np

def line_to_triples(line: str):
    fids = np.array(line.split(), dtype=int)
    ret = []
    for i in range(1, len(fids) - 1):
        for j in  range(i + 1, len(fids)):
            source = fids[0]
            fi, fj = fids[i], fids[j]
            if source < fi:
                ret.append([fj, source, fi])
            else:
                ret.append([fj, fi, source])
            if source < fj:
                ret.append([fi, source, fj])
            else:
                ret.append([fi, fj, source])
    return ret    

Simple test to show what it does.

In [None]:
line_to_triples ("1 5 8 7 9")

In [None]:
sc.stop()


Now, complete the program. You will have to use the wordcount style `<triple>, 1` to get a simpler reducer to work.

In [None]:
from pyspark import SparkContext

inputdir = "../data/simple.input"
outdir = "/tmp/outputsimple.mr"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
rdd = rdd.flatMap(line_to_triples)
### TODO add a WordCount style reducer
rdd = rdd.map(lambda x: (str(x), 1))
rdd = rdd.reduceByKey(lambda x, y: x + y)
### TODO Identify triples of count >=2
rdd = rdd.filter(lambda x: x[1]>1)
### TODO format output to just triples
rdd = rdd.map(lambda x: x[0])
rdd.saveAsTextFile(outdir)            

sc.stop()

#### Gotchas
Things that were confusing in the implementation
1. `str(x)` not `x`: `reduceByKey()` needs a field that it can reduce on so tuples do not work.
2. Warnings: `Unable to load native-hadoop library for your platform`.  This can be ignored.
3. Rerunning in the same output directory
    1. Spark needs to create an output directory for each run
    2. Running twice will create error on `rdd.saveAsTextFile(outdir)`
4. Error on `sc = SparkContext("local", "App Name",)`
    1. Fix this by running `sc.stop()`
    2. We should run the whole snippet in a `try` block and stop on error. But, this is jupyter.
5. Debugging with Spark. How to build up a program.
    1. I do one rdd transformation as a time and check the output.
    2. Then change the directory, add a transformation, and check the output.

### Join Implementation

In previous years when we ran this on distributed memory, we found that an implementation using a self `join` was faster than `reduceByKey`. This is because join has efficient implementation (based on a hash map) that do not have to send all the triples around. The concept of the program is:
* output all triples (same as before)
* add an partition number to each triple
* perform a self-join on the triples with partition numbers
* use the partition index to identify when the triples came from different lists
    * self-join will produce joined triples from the same partition 
* output triples when two proposals came from different partitions
    * i.e. two proposal from different friends lists
    * you need to avoid additional output when there is no match
    
This is a pretty different usage of Spark.  The function `mapPartitions` allows the programmer to write functions that apply to an entire partition, rather than each individual element of an RDD.  The function `mapPartitionsWithIndex` makes the partition identifier available to differentiate behavior, analagous to have the thread ID in OpenMP.

Partition in this case refers to an input partition.  There will be one per file because that's how `sc.textFile()` works.  So, each friends list is in a different partition.

The following helper function will add an index to the triple.

In [25]:
### add a parition index to each triple.
def add_index (idx, part):
    for p in part:
        yield str(p), str(idx)

You will need to write a function to identify when the joined triples have different indexes.

In [28]:
def filter_diff_idx (x):
    if x[1][0] != x[1][1]:
        return x[0]

In [31]:
from pyspark import SparkContext

inputdir = "../data/simple.input"
outdir = "/tmp/output.join"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
rdd = rdd.flatMap(line_to_triples)
rdd = rdd.mapPartitionsWithIndex(add_index)
# TODO self join
rdd = rdd.join(rdd)
# TODO filter out matches from same partition
rdd = rdd.map(filter_diff_idx)
# TODO cleanup output
rdd = rdd.filter(lambda x: x!= None)
rdd.saveAsTextFile(outdir)

sc.stop()

24/11/14 13:20:36 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
                                                                                

The following `sc.stop()` can be called alone to clean up crashed environments.

In [30]:
sc.stop()

It turns out (we will see below) that this version is not faster on shared-memory. It does a lot more computation and the reduction in traffic is not as important, because memory is faster than networking.


### List Merge Version

A better idea might be to use the memory of Spark to make things faster. We are going to make the assumption that each Spark worker can hold two entire friends lists at once.  So this idea here is:
* merge pairs of lists representing, e.g. from 100 and 200, into an RDD. You should generate an RDD with entries like:
    * `'100, 200', array([300, 319, 400])` from 100  
    * `'100, 200', array([219, 300, 400])` from 200
    * note that we sorted friends in the key so that the keys will match.
    * also note that list only contains the remaining friends, i.e. not 100 or 200
* group these lists together
    * '100, 200', [array([219, 300, 400]), array([300, 319, 400])]`
* when you have two arrays in the value, compute their intersection
    * '100, 200', [300, 400]
* output the corresponding triples
    * be careful here. it's subtle to get the output right.
    
You have write any helper functions that you need. I'm including prototypes for the ones I used.

In [None]:
### Output the remaining list of friends for each friend.
def pair_lists(line: str):
    fids = np.array(line.split(), dtype=int)
    ret = []
    for i in range(1, len(fids)):
        source = fids[0]
        if source < fids[i]:
            ret.append([f"{source}, {fids[i]}", np.concatenate((fids[1:i], fids[i+1:]),)])
        else:
            ret.append([f"{fids[i]}, {source}", np.concatenate((fids[1:i], fids[i+1:]),)])            
    return ret    

In [None]:
pair_lists ("2 1 5 8 7 9")

Sample output from pair_list
```
[['1, 2', array([5, 8, 7, 9])],
 ['2, 5', array([1, 8, 7, 9])],
 ['2, 8', array([1, 5, 7, 9])],
 ['2, 7', array([1, 5, 8, 9])]]
```

In [None]:
### Output the appropriate triples after intersection.
def output_triples(x):
    output = []
    xar = np.fromstring(x[0], dtype=int, sep=",")
    for third in x[1]:
        if xar[0] < xar[1]:
            output.append((third, xar[0], xar[1]))
        else:
            output.append((third, xar[1], xar[0]))
    return output
        

In [None]:
sc.stop()

In [None]:
from pyspark import SparkContext

inputdir = "../data/simple.input"
outdir = "/tmp/output.mergenew"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
rdd = rdd.flatMap(pair_lists)
# group by and turn into lists
rdd = rdd.groupByKey().mapValues(list)
# keep only keys with two inputs
rdd = rdd.filter(lambda x: len(x[1])==2)
# TODO intersect the two lists
rdd = rdd.mapValues(lambda x: np.intersect1d(x[0], x[1]))
rdd = rdd.flatMap(output_triples)
rdd.saveAsTextFile(outdir)

sc.stop()

### Timings

Run on the full dataset to compare performance.

I've removed the code, but I've left the timing information for your reference.
  * M/R OK
  * Join slowest
  * Merge fastest
Conclusion: different implementations better for different architectures.

#### Map/Reduce version

In [None]:
%%timeit -n1 -r1
from pyspark import SparkContext

inputdir = "../data/fof.input"
outdir = "/tmp/outputfof.mr1"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
rdd = rdd.flatMap(line_to_triples)
...
rdd.saveAsTextFile(outdir)

sc.stop()                

#### Join version

In [None]:
%%timeit -n1 -r1
from pyspark import SparkContext

inputdir = "../data/fof.input"
outdir = "/tmp/outputfof.joinc"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
rdd = rdd.flatMap(line_to_triples)
...
rdd.saveAsTextFile(outdir)

sc.stop()

#### List merge version

In [None]:
%%timeit -n1 -r1
from pyspark import SparkContext

inputdir = "../data/fof.input"
outdir = "/tmp/outputfof.merge1"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
...
rdd.saveAsTextFile(outdir)

sc.stop()