## PySpark Friends of Friends

We are going to reproduce the Friends of Friends Hadoop example in Spark.
First things first, let's run it as a map/reduce program in Spark.

Node A with neighbors B and C propose candidate triples to it's neighbors
* B A C to node B (A<C, else B C A)
* C A B to node C (A<B else C B A)

All triples will get two proposals from it's neighbors and reduce them. If there are two matching proposals, we have a triple.

#### Map/Reduce Implementation

The code below `lines_to_triples` implements the proposal to be run in `flatMap`.

In [1]:
import numpy as np

def line_to_triples(line: str):
    fids = np.array(line.split(), dtype=int)
    ret = []
    for i in range(1, len(fids) - 1):
        for j in  range(i + 1, len(fids)):
            source = fids[0]
            fi, fj = fids[i], fids[j]
            if source < fi:
                ret.append([fj, source, fi])
            else:
                ret.append([fj, fi, source])
            if source < fj:
                ret.append([fi, source, fj])
            else:
                ret.append([fi, fj, source])
    return ret    

Simple test to show what it does.

In [2]:
line_to_triples ("1 5 8 7 9")

[[8, 1, 5],
 [5, 1, 8],
 [7, 1, 5],
 [5, 1, 7],
 [9, 1, 5],
 [5, 1, 9],
 [7, 1, 8],
 [8, 1, 7],
 [9, 1, 8],
 [8, 1, 9],
 [9, 1, 7],
 [7, 1, 9]]

In [3]:
sc.stop()

NameError: name 'sc' is not defined


Now, complete the program. You will have to use the wordcount style `<triple>, 1` to get a simpler reducer to work.

In [3]:
from pyspark import SparkContext

inputdir = "../data/simple.input"
outdir = "/tmp/outputsimple.mr9"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
rdd = rdd.flatMap(line_to_triples)
### TODO add a WordCount style reducer
rdd = rdd.map(lambda x: (str(x), 1))
rdd = rdd.reduceByKey(lambda x, y: x + y)
### TODO Identify triples of count >=2
rdd = rdd.filter(lambda x: x[1]>1)
### TODO format output to just triples
rdd = rdd.map(lambda x: x[0])
rdd.saveAsTextFile(outdir)

sc.stop()             

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/15 20:19:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

In [None]:
sc.stop()

#### Join Implementation

In previous years when we ran this on distributed memory, we found that an implementation using a `join` was faster. This is because join has efficient implementation (based on a hash map) that do not have to send all the triples around. The concept of the program is:
* output all triples (same as before)
* add an partition number to each triple
* perform a self-join on the triples with partition numbers
* use the partition index to identify when the triples came from different lists
    * self-join will produce joined triples from the same partition 
* output triples when two proposals came from different partitions
    * i.e. two proposal from different friends lists
    * you need to avoid additional output when there is no match
    
This is a pretty different usage of Spark.  The function `mapPartitions` allows the programmer to write functions that apply to an entire partition, rather than each individual element of an RDD.  The function `mapPartitionsWithIndex` makes the partition identifier available to differentiate behavior, analagous to have the thread ID in OpenMP.

The following helper function will add an index to the triple.

In [3]:
### add a parition index to each triple.
def add_index (idx, part):
    for p in part:
        yield str(p), str(idx)

You will need to write a function to identify when the joined triples have different indexes.

In [4]:
def filter_diff_idx (x):
    # TODO
    pass

In [5]:
from pyspark import SparkContext

inputdir = "../data/simple.input"
outdir = "/tmp/output.join"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
rdd = rdd.flatMap(line_to_triples)
rdd = rdd.mapPartitionsWithIndex(add_index)
# TODO self join
...
# TODO filter out matches from same partition
...
# TODO cleanup output
...
rdd.saveAsTextFile(outdir)

sc.stop()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/07 21:22:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

The following `sc.stop()` can be called alone to clean up crashed environments.

In [6]:
sc.stop()

It turns out (we will see below) that this version is not faster on shared-memory. It does a lot more computation and the reduction in traffic is not as important, because memory is faster than networking.


### List Merge Version

A better idea might be to use the memory of Spark to make things faster. We are going to make the assumption that each Spark worker can hold two entire friends lists at once.  So this idea here is:
* merge pairs of lists representing, e.g. from 100 and 200, into an RDD. You should generate an RDD with entries like:
    * `'100, 200', array([300, 319, 400])` from 100  
    * `'100, 200', array([219, 300, 400])` from 200
    * note that we sorted friends in the key so that the keys will match.
    * also note that list only contains the remaining friends, i.e. not 100 or 200
* group these lists together
    * '100, 200', [array([219, 300, 400]), array([300, 319, 400])]`
* when you have two arrays in the value, compute their intersection
    * '100, 200', [300, 400]
* output the corresponding triples
    * be careful here. it's subtle to get the output right.
    
You have write any helper functions that you need. I'm including prototypes for the ones I used.

In [7]:
### Output the remaining list of friends for each friend.
def pair_lists(line: str):
    # TODO
    pass

In [8]:
pair_lists ("2 1 5 8 7 9")

[['1, 2', array([5, 8, 7, 9])],
 ['2, 5', array([1, 8, 7, 9])],
 ['2, 8', array([1, 5, 7, 9])],
 ['2, 7', array([1, 5, 8, 9])]]

Sample output from pair_list
```
[['1, 2', array([5, 8, 7, 9])],
 ['2, 5', array([1, 8, 7, 9])],
 ['2, 8', array([1, 5, 7, 9])],
 ['2, 7', array([1, 5, 8, 9])]]
```

In [9]:
### Output the appropriate triples after intersection.
def output_triples(x):
    output = []
    # TODO
    return output
        

In [11]:
sc.stop()

In [12]:
from pyspark import SparkContext

inputdir = "../data/simple.input"
outdir = "/tmp/output.merge"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
rdd = rdd.flatMap(pair_lists)
# group by and turn into lists
rdd = rdd.groupByKey().mapValues(list)
# keep only keys with two inputs
rdd = rdd.filter(lambda x: len(x[1])==2)
# TODO intersect the two lists
...
rdd = rdd.flatMap(output_triples)
rdd.saveAsTextFile(outdir)

sc.stop()

                                                                                

### Timings

Run on the full dataset to compare performance.

I've removed the code, but I've left the timing information for your reference.
  * M/R OK
  * Join slowest
  * Merge fastest
Conclusion: different implementations better for different architectures.

##### Map/Reduce version

In [13]:
%%timeit -n1 -r1
from pyspark import SparkContext

inputdir = "../data/fof.input"
outdir = "/tmp/outputfof.mr"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
rdd = rdd.flatMap(line_to_triples)
...
rdd.saveAsTextFile(outdir)

sc.stop()                

                                                                                

3min 30s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


#### Join version

In [14]:
%%timeit -n1 -r1
from pyspark import SparkContext

inputdir = "../data/fof.input"
outdir = "/tmp/outputfof.join"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
rdd = rdd.flatMap(line_to_triples)
...
rdd.saveAsTextFile(outdir)

sc.stop()

                                                                                

8min 25s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


#### List merge version

In [15]:
%%timeit -n1 -r1
from pyspark import SparkContext

inputdir = "../data/fof.input"
outdir = "/tmp/outputfof.merge"

from pyspark import SparkContext
sc = SparkContext("local", "App Name",)
rdd = sc.textFile(f"{inputdir}/*")
...
rdd.saveAsTextFile(outdir)

sc.stop()

                                                                                

2min 32s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
