# Big Data Analytics with Spark

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.(From wikipedia)

This tutorial will guide you to install Spark and run some classic big data programs based on Spark. We will re-implement some functions that we already implemented in former homework based on Spark.

Please note that Spark will store data in the memory, and if the test data is too big, some problems may happen. So, the test dataset is only 100 lines, but it will be pretty easy to scale-up.

## Q0: Install Spark
### For Mac
#### Install brew

```bash
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
```

#### Install Spark 1.6 via brew

```bash
brew install homebrew/versions/apache-spark16
```

#### Link Spark with Jupyter Notebook
Add the following ENV_VAR:
```bash
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
```

#### Test Spark with example

```bash
run-example org.apache.spark.examples.SparkPi
```

#### Start notebook with Spark
```bash
pyspark --packages graphframes:graphframes:0.1.0-spark1.6 --executor-memory 4g --driver-memory 4g
```

In [1]:
import subprocess

In [2]:
p = subprocess.Popen('run-example org.apache.spark.examples.SparkPi',\
                     shell=True, stdout=subprocess.PIPE, \
                     stderr=subprocess.STDOUT)
for line in p.stdout.readlines():
    print line,
retval = p.wait()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/10/19 00:41:01 INFO SparkContext: Running Spark version 1.6.2
16/10/19 00:41:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/10/19 00:41:06 INFO SecurityManager: Changing view acls to: yuanchen
16/10/19 00:41:06 INFO SecurityManager: Changing modify acls to: yuanchen
16/10/19 00:41:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yuanchen); users with modify permissions: Set(yuanchen)
16/10/19 00:41:12 INFO Utils: Successfully started service 'sparkDriver' on port 63326.
16/10/19 00:41:12 INFO Slf4jLogger: Slf4jLogger started
16/10/19 00:41:12 INFO Remoting: Starting remoting
16/10/19 00:41:12 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.3.30:63327]
16/10/19 00:41:12 INFO Utils: Successfully started s

The following output indicates the installation is successful
```
    16/10/18 19:16:13 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1031 bytes result sent to driver
    16/10/18 19:16:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1031 bytes result sent to driver
    16/10/18 19:16:13 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 512 ms on localhost (1/2)
    16/10/18 19:16:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 543 ms on localhost (2/2)
    16/10/18 19:16:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
    16/10/18 19:16:13 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:36) finished in 0.558 s
    16/10/18 19:16:13 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:36, took 0.745534 s
    Pi is roughly 3.14766
```


In [102]:
import sys
from operator import add
from graphframes import *
import re
from random import random

## Warm-up: write your own Pi Estimation
To get more familiar with Spark's map and reduce function, we will finished the following code for Pi Estimation. The general idea to estimates Pi is to "throwing darts" at a circle. It is called Monte Carlo integration  
https://en.wikipedia.org/wiki/Monte_Carlo_integration

- We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be Pi / 4, so we use this to get our estimate.

In [108]:
def sample(p):
    """
    return the result of one sample
    """
    x, y = random(), random()
    return 1 if x*x + y*y < 1 else 0

def add(a,b):
    """
    return the sum of two values
    
    actually add is a buildin operator, you do not actually need to implement this.
    add this code to help student understand how to write function for reducer
    """
    return a + b

#generate 1000 smaples
experiment_index = sc.parallelize(xrange(0, 1000))
#using map and smaple function to get experimental value of samples
experiment_result = experiment_index.map(sample)
#reduce to get the result 

experiment_count = experiment_result.reduce(add)

print 4.0 * experiment_count / 1000


3.232


## Q1: Implement Word Count
Word count is a classic example of Map-Reduce. We can also use spark to do the word count.

In the map part we map every word in the text file into tuple (word,1)

In the reduce part, we reduce every tuple and add the value to get the final count of the word

Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are three recommended ways to do this:

- Local defs inside the function calling into Spark, for longer code.
- Lambda expressions, for simple functions that can be written as an expression. (Lambdas do not support multi-statement functions or statements that do not return a value.)

We will try these two ways in the following part


## Q1.1 Word Count Implementation
The following function is implemented to pass to Spark, it will be suitable for long code
- split_line: In this function you will need to split input line to words
- word_pair: In this function, you will need to generate word pair (word,1)
- count: In this function, you will need to add the count of each word pair

In this, we will using map, flatmap and reduce and above functions to get the word_count
- word_count: In this function, the above 3 functions will be used in this function and generate the final result

In [6]:
def split_line(line):
    """ Split line in to words and get only alphanumeric words
    Args: 
        line (string): an input line
    Returns:
        list: a list of words
    """
    words = line.split()
    words = [re.sub(r'\W+', '', word) for word in words]
    res = filter(lambda word: re.match(r'\w+', word), words)
    return res

In [7]:
line = "realDonaldTrump Trump"
split_line(line)

['realDonaldTrump', 'Trump']

In [8]:
def word_pair(word):
    """ generate word pair like (word, 1)
    Args: 
        word (string): an input word
    Returns:
        tuple: a word pair
    """
    return (word, 1)

In [9]:
word = "realDonaldTrump"
word_pair(word)

('realDonaldTrump', 1)

In [10]:
def count(a, b):
    """ generate word pair like (word, 1)
    Args: 
        a (int): an input count
        b (int): an input count
    Returns:
        int : result count
    """
    return (a + b)

In [11]:
a = 1
b = 1
count(a, b)

2

In [14]:
def word_count(path):
    """ Count the word in a text file using spark, using flatMap, map and reduceByKey function
    Args: 
        path (string): path to the file
    Returns:
        dict: a dict that word is the key and the count is the value
    """
    text_file = sc.textFile(path)
    words = text_file.flatMap(split_line)
    pairs = words.map(word_pair)
    counts = pairs.reduceByKey(count)
    count_dict = dict(counts.collect())
    return count_dict

In [67]:
path = "edges.csv"
d = word_count(path)
print d['TrumpGolf']

25


## Q1.2 Word Count Implementation With Lambda function
We can also using lambda functions in the Spark, but we need to notice that lambdas do not support multi-statement functions or statements that do not return a value.
But most of times, in each step, the code is pretty simple and using lambda seems a good choice

In [68]:
def word_count(path):
    """ Count the word in a text file using spark
    Args: 
        path (string): path to the file
        
    Returns:
        dict: a dict that word is the key and the count is the value
    """
    text_file = sc.textFile(path)
    count = text_file.flatMap(lambda line: line.split(' '))\
                    .map(lambda word: (word, 1))\
                    .reduceByKey(lambda a, b: a+b)
    count_dict = dict(count.collect())
    return count_dict

In [69]:
path = "edges.csv"
d = word_count(path)
print d['TrumpGolf']

25


## Q2: Calculate PageRank Using GraphFrames

We learned how to calculate PageRank in Homework 2, and we all noticed that, when we try to calculating a large matrix, the whole process seems really slow. But what if we want to scale up?

Spark may be the answer.

https://github.com/graphframes/graphframes

Since Python API has not implemented in GraphX, so we use GraphFrames, which warps GraphX algorithms.

We first load data from "edges.csv" file to a Vertex DataFrame with unique ID column 'id'

And then load data from "deges.csv" file and create an edge DataFrame with 'src' and 'dst' colums.

And create a GraphFrame based on Vertex and Edge

Finally use buildin function to get the pagerank

In [70]:
def load_file(path):
    """ Load text file into RDD
    Args: 
        path (string): path to the file
    Returns:
        MapPartitionsRDD
    """
    text_file = sc.textFile(path)
    return text_file

Run the following code should get the result:
    
    [u'realDonaldTrump Trump', u'realDonaldTrump TrumpGolf', u'realDonaldTrump TiffanyATrump', u'realDonaldTrump IngrahamAngle', u'realDonaldTrump mike_pence', u'realDonaldTrump TeamTrump', u'realDonaldTrump DRUDGE_REPORT', u'realDonaldTrump MrsVanessaTrump', u'realDonaldTrump LaraLeaTrump', u'realDonaldTrump seanhannity', u'realDonaldTrump foxnation', u'realDonaldTrump CLewandowski_', u'realDonaldTrump AnnCoulter', u'realDonaldTrump DiamondandSilk', u'realDonaldTrump KatrinaCampins', u'realDonaldTrump KatrinaPierson', u'realDonaldTrump MichaelCohen212', u'realDonaldTrump foxandfriends', u'realDonaldTrump MELANIATRUMP', u'realDonaldTrump GeraldoRivera', u'realDonaldTrump ericbolling', u'realDonaldTrump RealRomaDowney', u'realDonaldTrump MarkBurnettTV', u'realDonaldTrump garyplayer', u'realDonaldTrump MagicJohnson', u'realDonaldTrump VinceMcMahon', u'realDonaldTrump DanScavino', u'realDonaldTrump TrumpWaikiki', u'realDonaldTrump TrumpDoral', u'realDonaldTrump TrumpCharlotte', u'realDonaldTrump TrumpLasVegas', u'realDonaldTrump TrumpChicago', u'realDonaldTrump TrumpGolfDC', u'realDonaldTrump TrumpGolfLA', u'realDonaldTrump EricTrump', u'realDonaldTrump morningmika', u'realDonaldTrump JoeNBC'...]

In [71]:
path = "edges.csv"
file_rdd = load_file(path)
print file_rdd.collect()

[u'realDonaldTrump Trump', u'realDonaldTrump TrumpGolf', u'realDonaldTrump TiffanyATrump', u'realDonaldTrump IngrahamAngle', u'realDonaldTrump mike_pence', u'realDonaldTrump TeamTrump', u'realDonaldTrump DRUDGE_REPORT', u'realDonaldTrump MrsVanessaTrump', u'realDonaldTrump LaraLeaTrump', u'realDonaldTrump seanhannity', u'realDonaldTrump foxnation', u'realDonaldTrump CLewandowski_', u'realDonaldTrump AnnCoulter', u'realDonaldTrump DiamondandSilk', u'realDonaldTrump KatrinaCampins', u'realDonaldTrump KatrinaPierson', u'realDonaldTrump MichaelCohen212', u'realDonaldTrump foxandfriends', u'realDonaldTrump MELANIATRUMP', u'realDonaldTrump GeraldoRivera', u'realDonaldTrump ericbolling', u'realDonaldTrump RealRomaDowney', u'realDonaldTrump MarkBurnettTV', u'realDonaldTrump garyplayer', u'realDonaldTrump MagicJohnson', u'realDonaldTrump VinceMcMahon', u'realDonaldTrump DanScavino', u'realDonaldTrump TrumpWaikiki', u'realDonaldTrump TrumpDoral', u'realDonaldTrump TrumpCharlotte', u'realDonaldTr

In [72]:
def create_vertex(file_rdd):
    """ Create a Vertex DataFrame
    Args: 
        path (string): path to the file
    Returns:
        DataFrame[id: string]
    """
    def get_vertex(line):
        return line.split(' ')
    v_rdd = file_rdd.flatMap(get_vertex)
    v_rdd = v_rdd.map(lambda x: (x, )).distinct()
    v_df =v_rdd.toDF(['id']) 
    return v_df

Run the following code should get the result:
 ```
    +---------------+
    |             id|
    +---------------+
    |   HeyItsLindaC|
    | vanderkimberly|
    |  TiffanyATrump|
    |     D29Gillian|
    | HanksYanksGolf|
    | MissVanZutphen|
    |      foxnation|
    |             AP|
    |  CLewandowski_|
    |   MELANIATRUMP|
    |         JoeNBC|
    |TrumpGolfPhilly|
    |    TrumpPanama|
    |     adwnewyork|
    | KateRuthBrewer|
    | KatrinaCampins|
    |  IvankaJewelry|
    | chelseahandler|
    |          TNGCJ|
    |    IvankaTrump|
    +---------------+
    only showing top 20 rows
    ```


In [73]:
v_df = create_vertex(file_rdd)
v_df.show()

+---------------+
|             id|
+---------------+
|   EricTrumpFdn|
|   TrumpNewYork|
|     TrumpTower|
|      TrumpGolf|
|          Trump|
|  TiffanyATrump|
|   TrumpWaikiki|
| MichaelBreedGA|
|    ericbolling|
|    trumpwinery|
| TrumpTurnberry|
|realDonaldTrump|
|   TrumpJupiter|
|  Cmiddaughgolf|
|       parscale|
|   JeffreyMoser|
|   MichaelBreed|
|  TrumpScotland|
|      foxnation|
|    TrumpGolfHV|
+---------------+
only showing top 20 rows



In [89]:
def create_edge(file_rdd):
    """ Create a Edge DataFrame
    Args: 
        MapPartitionsRDD: input file
    Returns:
        DataFrame[src: string, dst: string, relationship: string]
        sort by src
    """
    e_rdd = file_rdd.map(lambda line: tuple((line.split(' ')[0],\
                                              line.split(' ')[1],'friend')))
    e_df = e_rdd.toDF(['src','dst','relationship'])
    e_df = e_df.sort('src')
    return e_df

Run the following code should get the result:
```
    +---------------+---------------+------------+
    |            src|            dst|relationship|
    +---------------+---------------+------------+
    |realDonaldTrump|          Trump|      friend|
    |realDonaldTrump|      TrumpGolf|      friend|
    |realDonaldTrump|  TiffanyATrump|      friend|
    |realDonaldTrump|  IngrahamAngle|      friend|
    |realDonaldTrump|     mike_pence|      friend|
    |realDonaldTrump|      TeamTrump|      friend|
    |realDonaldTrump|  DRUDGE_REPORT|      friend|
    |realDonaldTrump|MrsVanessaTrump|      friend|
    |realDonaldTrump|   LaraLeaTrump|      friend|
    |realDonaldTrump|    seanhannity|      friend|
    |realDonaldTrump|      foxnation|      friend|
    |realDonaldTrump|  CLewandowski_|      friend|
    |realDonaldTrump|     AnnCoulter|      friend|
    |realDonaldTrump| DiamondandSilk|      friend|
    |realDonaldTrump| KatrinaCampins|      friend|
    |realDonaldTrump| KatrinaPierson|      friend|
    |realDonaldTrump|MichaelCohen212|      friend|
    |realDonaldTrump|  foxandfriends|      friend|
    |realDonaldTrump|   MELANIATRUMP|      friend|
    |realDonaldTrump|  GeraldoRivera|      friend|
    +---------------+---------------+------------+
    only showing top 20 rows
```

In [90]:
e_df = create_edge(file_rdd)
e_df.show()

+-----+---------------+------------+
|  src|            dst|relationship|
+-----+---------------+------------+
|Trump|  TiffanyATrump|      friend|
|Trump|   TrumpJupiter|      friend|
|Trump| TrumpPalmBeach|      friend|
|Trump|      TrumpGolf|      friend|
|Trump|     TrumpTower|      friend|
|Trump|    trumpwinery|      friend|
|Trump| TrumpTurnberry|      friend|
|Trump|TrumpBedminster|      friend|
|Trump|TrumpNationalNY|      friend|
|Trump|  TrumpScotland|      friend|
|Trump|   TrumpWaikiki|      friend|
|Trump|   TrumpDoonbeg|      friend|
|Trump|    TrumpPanama|      friend|
|Trump|   TrumpToronto|      friend|
|Trump|    TrumpHotels|      friend|
|Trump|   TrumpChicago|      friend|
|Trump|  TrumpLasVegas|      friend|
|Trump|TrumpGolfPhilly|      friend|
|Trump| TrumpColtsNeck|      friend|
|Trump|    TrumpGolfHV|      friend|
+-----+---------------+------------+
only showing top 20 rows



In [79]:
def get_pagerank(v_df, e_df, d=0.85, iters=100):
    """ Create a DataFrame with 'id' and 'pagerank' column
    Args: 
        MapPartitionsRDD: input file
    Returns:
        DataFrame[id: string, pagerank: double]: a DataFrame with 'id' and 'pagerank' column
    """
    g = GraphFrame(v_df, e_df)
    res = g.pageRank(resetProbability= 1 - d, maxIter=iters)
    res = res.vertices.select("id", "pagerank")
    return res
    

Run the following code should get the result:
```
    +---------------+-------------------+
    |             id|           pagerank|
    +---------------+-------------------+
    | KatrinaPierson| 0.1531138204693519|
    |   MichaelBreed|0.15580118291790124|
    |    TrumpHotels|0.15385936436797615|
    |  TrumpScotland|0.15385936436797615|
    |  foxandfriends| 0.1531138204693519|
    |      foxnation| 0.1531138204693519|
    |   CallawayGolf|0.15580118291790124|
    |    TrumpGolfHV|0.15385936436797615|
    |     TrumpDoral|0.15697318483732803|
    |TrumpFerryPoint|0.15385936436797615|
    |  oreillyfactor| 0.1531138204693519|
    |  CLewandowski_| 0.1531138204693519|
    |   VinceMcMahon| 0.1531138204693519|
    |  DRUDGE_REPORT| 0.1531138204693519|
    |     SIGolfPlus|0.15580118291790124|
    |   MELANIATRUMP| 0.1531138204693519|
    |     DanScavino| 0.1531138204693519|
    |    CRTurnberry|0.15580118291790124|
    |MichaelCohen212| 0.1531138204693519|
    |         JoeNBC| 0.1531138204693519|
    +---------------+-------------------+
    only showing top 20 rows

```

In [91]:
pagerank = get_pagerank(v_df, e_df)
pagerank.sort('id').show()

+--------------+-------------------+
|            id|           pagerank|
+--------------+-------------------+
|   AMurrayGolf|0.15580118291790124|
|    AnnCoulter| 0.1531138204693519|
|BrendanAMurphy|0.15580118291790124|
| CLewandowski_| 0.1531138204693519|
|   CRTurnberry|0.15580118291790124|
|  CallawayGolf|0.15580118291790124|
| Cmiddaughgolf|0.15580118291790124|
| DRUDGE_REPORT| 0.1531138204693519|
|    DanScavino| 0.1531138204693519|
|DiamondandSilk| 0.1531138204693519|
|DonaldJTrumpJr|0.15697318483732803|
|     EricTrump| 0.1627743677552293|
|  EricTrumpFdn|0.15385936436797615|
| GeraldoRivera| 0.1531138204693519|
|HanksYanksGolf|0.15580118291790124|
| IngrahamAngle| 0.1531138204693519|
|   IvankaTrump|0.15697318483732803|
|  JeffreyMoser|0.15580118291790124|
|        JoeNBC| 0.1531138204693519|
|KatrinaCampins| 0.1531138204693519|
+--------------+-------------------+
only showing top 20 rows



## Q3: Implement your own PageRank Function

Since we are getting familiar Spark, now, we are going to implement our own PageRank function using Spark. Since in Spark, most of our process needs to be done in lambda function, so, in the following function, we need to yield result to get generator.  
First, we need to parse every line into tuple(followee, follower) and group them together.  
You may use lambda function or pre-defined function parse_relations to parse input line

First, we need to implement some function that will be passed to Spark
- parse_relations: parse input line into a pair (follower, followee)

In [82]:
def parse_relations(line):
        """This function will pass to map function to split a line into follower and followee
        Args: 
            line(String): input line
        Returns:
            tuple: tuple of follower and followee
        """
        words = line.split(' ')
        return (words[0], words[1])

- init_rank: in this function, we need to init the input pair and give each follower an init rank 1.0

In [83]:
def init_rank(edge):
    """Initial the start rank of each node
    Args: 
        tuple: (node, <pyspark.resultiterable.ResultIterable object>)
    Returns:
        tuple: tuples of (follower, 1.0)
    """
    follower = edge[0]
    return follower, 1.0

- compute_contributes : in this function, you will need to write a generator to calculate node's contribution to the rank of other URLs

In [84]:
def computeContribs(nodes, rank):
    num_nodes = len(nodes)
    for node in nodes:
        yield (node, rank / num_nodes)

- Calculate new rank

In [85]:
def new_rank(rank):
    """Calculate new rank
    Args: 
        rank(int): previous rank
    Returns:
        int: new rank
    """
    new_rank = rank * 0.85 + 0.15
    return new_rank


- Update Contributions

In [86]:
def update_contribution(rank):
    return computeContribs(rank[1][0], rank[1][1])

- Collect all together: Using the functions implemented before finish page_rank function

In [87]:
def page_rank(file_rdd):
    """Load file into RDD
    Args: 
        MapPartitionsRDD: input file
    Returns:
        list: a list of tuples of node and PageRank, sort by key
    """
        
    relations = file_rdd.map(parse_relations)
    edges = relations.distinct().groupByKey()
    ranks = edges.map(init_rank)
    for iteration in range(10):
        # Calculates current node contributions to the rank of other nodes.
        contribs = edges.join(ranks).flatMap(update_contribution)
        # Re-calculates current node's ranks based on neighbor nodes.
        ranks = contribs.reduceByKey(add)
        ranks = ranks.mapValues(new_rank)
    ranks = ranks.sortByKey()
    res = ranks.collect()
    return res
    

After finished the page_rank function, run the following code to calculate pagerank of the input file
You will get result:

```
    [(u'TrumpChicago', 0.15697318483734526),
     (u'GeraldoRivera', 0.15311382046935743),
     (u'piersmorgan', 0.15311382046935743),
     (u'TrumpBedminster', 0.15385936436798783),
     (u'CLewandowski_', 0.15311382046935743),
     (u'parscale', 0.1558011829179162),
     (u'IvankaTrump', 0.15697318483734526),
     (u'TrumpPalmBeach', 0.15966054728590404),
     (u'greta', 0.15311382046935743),
     (u'foxandfriends', 0.15311382046935743),
     (u'SIGolfPlus', 0.1558011829179162),
     ...
```

In [88]:
page_rank(file_rdd)

[(u'AMurrayGolf', 0.1558011829179162),
 (u'AnnCoulter', 0.15311382046935743),
 (u'BrendanAMurphy', 0.1558011829179162),
 (u'CLewandowski_', 0.15311382046935743),
 (u'CRTurnberry', 0.1558011829179162),
 (u'CallawayGolf', 0.1558011829179162),
 (u'Cmiddaughgolf', 0.1558011829179162),
 (u'DRUDGE_REPORT', 0.15311382046935743),
 (u'DanScavino', 0.15311382046935743),
 (u'DiamondandSilk', 0.15311382046935743),
 (u'DonaldJTrumpJr', 0.15697318483734526),
 (u'EricTrump', 0.16277436775526147),
 (u'EricTrumpFdn', 0.15385936436798783),
 (u'GeraldoRivera', 0.15311382046935743),
 (u'HanksYanksGolf', 0.1558011829179162),
 (u'IngrahamAngle', 0.15311382046935743),
 (u'IvankaTrump', 0.15697318483734526),
 (u'JeffreyMoser', 0.1558011829179162),
 (u'JoeNBC', 0.15311382046935743),
 (u'KatrinaCampins', 0.15311382046935743),
 (u'KatrinaPierson', 0.15311382046935743),
 (u'LaraLeaTrump', 0.15311382046935743),
 (u'MELANIATRUMP', 0.15311382046935743),
 (u'MagicJohnson', 0.15311382046935743),
 (u'MarkBurnettTV', 0.

## Some thoughts beyond this tutorial:
If you are trying to change the iteration times to a larger number, there will be a highly probablity that the code will crash. To figure this out, we need to get a deeper understanding of RDD.
RDD has two parts five components:
- lineage info: Partitions, dependencies, computation(like map,filter,join).
- optimization info: partitioner, preferred locations.

In Spark and other distributed system, Fault Recovery is very important, for MPI, because it is not convenient to achieve Fault Recovery, we need to save checkpoint to make MPI structure can be Fault Recovery. So if we start iteration, the Fault Recovery will become larger and larger.

We can assume that if the probability of a task go wrong is p, and we have N tasks to do in one iteration. So the probability of this iteration does not need Fault Recovery is 
            $$p^N$$

So from iteration 0 to iteration K, the prob is:
            $$p^{N*(k-1)}*(1 - p^N)$$

From above porb, we can find that with the iteration number increase, we need more and more resource for Fault Recovery
