## Why Spark?

Data science always comes with big data. To handle big data, you need paralization. However, paralization is a hard problem. Spark is designed for data science and the abstractions of Spark make it easier. With Spark framework, you can totally focus on data processing and model building. Because of its Simplicity, Performance and flexibility, Spark is pretty popular among almost all Big data developers and Data scientists.

![spark-concepts](http://chuantu.biz/t6/269/1522566870x-1566683256.png)

## Tutorial content

We will cover the following topics in this tutorial:
- [Spark and MapReduce](#Spark-and-MapReduce)
- [Launch Apache Spark on AWS](#Launch-Apache-Spark-on-AWS)
- [Spark Programming Basics](#Spark-Programming-Basics)
- [Advanced Apache Spark](#Advanced-Apache-Spark)
- [Optimization Suggestions](#Optimization-Suggestions)
- [Example: Calculate Pi](#Example:-Calculate-Pi)
- [Example: Logistic Regression](#Example:-Logistic-Regression)

## Spark and MapReduce

Apache Hadoop's MapReduce is a programming model for processing large amounts of data. The model contains two components, map function and reduce function. Users can use map function to process data into intermediate **key/value** pair and then use reduce function to merge all values with the same intermediate key. MapReduce relies on the distributed file system (HDFS) of Hadoop. At the begining of the job, MapReduce read the input data from HDFS. At the end of the job, MapReduce stores the result to HDFS. MapReduce has become very popular over the years, because many real tasks can be expressed easily through MapReduce. However, many tasks now (such as maching learning tasks) involve many iterative jobs. Each iterative job can be expressed as one MapReduce job. Each job must reload from HDFS and load to HDFS, which causes a performance bottleneck.

Spark is designed to solve this bottleneck. Spark tries to keep data in memory and also supports fault-tolerant, which could accelerate iterative computation over 10x. To do this, Spark introduces a new data abstraction called **Resilient Distributed Datasets** (RDDs). RDDs are cached in memory across various computational stages and are reused among multiple MapReduce jobs. Fault tolerance of RDDs is achieved by using **lineage** information, that is keeping all operations which were performed to get current state of RDDs. If a partition of an RDD is missed, the RDD can use lineage information to recover that partition from other RDDs.

![compare](http://chuantu.biz/t6/269/1522566882x-1566683256.png)

Spark is a brilliant programming model, now I'd like to show you how to write Apache Spark programs in Python, which can be deployed on real clusters and solve real tasks. **Amazaing!!**

## Launch Apache Spark on AWS

To start, you need to launch your own Apache Spark cluster on AWS.

I highly recommend a command-line tool for launching Spark clusters on AWS [Flintrock](https://github.com/nchammas/flintrock#configurable-cli-defaults). In addition, it's strongly suggested to read the README.md of this tool.

Remember to set up the environment variables AWS\_SECRET\_ACCESS_KEY and AWS\_ACCESS\_KEY\_ID before using the tool.

## Clarification

**I ran all programs in aws cluser launched by Flintrock and pasted the output to notebook cell.**

**So codes here can't be executed in jupyter notebook directly.**

## Spark Programming Basics

Let's see a Spark WordCount program in Python:

In [None]:
from pyspark import SparkConf, SparkContext

APP_NAME = "Spark WordCount Application"

conf = SparkConf().setAppName(APP_NAME)
sc = SparkContext(conf=conf)
textRDD = sc.textFile("s3a://tutorial-688/README.md")
counts = (textRDD.flatMap(lambda line: line.split(' '))
                .map(lambda term: (term, 1))
                .reduceByKey(lambda a, b: a + b))
counts.saveAsTextFile("s3a://tutorial-688/output/wordcount_output")

I use README.md of flintrock as the input file, here is part of the output:

Input: <https://s3.amazonaws.com/tutorial-688/README.md>

Output: <https://s3.amazonaws.com/tutorial-688/output/wordcount_output/part-00000>

To run previous code, you should launch your cluster first, then log in your master node by using:

Flintrock uses **Amazon S3** as input and output. To run above programe, you need to create a bucket in S3 first. Use <font color=INDIANRED>s3a://</font> as prefix (e.g. <font color=INDIANRED>s3a://bucket/path/to/file</font>) to read and write data.

Make sure export **access\_key** and **access\_key\_id** in environment of master node:

Then create <font color=INDIANRED>spark_wordcount.py</font> contains previous WordCount program.

Launch WordCount task in Spark cluster,

In [None]:
spark-submit spark_wordcount.py

After task is completed, result is stored at <font color=INDIANRED>s3a://tutorial-688/output/wordcount_output</font>.

**Line by Line** explaination how it works:

1. Use <font color=INDIANRED>SparkConf()</font> to create an <font color=INDIANRED>conf</font>, <font color=INDIANRED>setAppName()</font> gives its AppName. You can also set other attributes, see [pyspark API](https://spark.apache.org/docs/latest/api/python/pyspark.html) for details.
2. Use <font color=INDIANRED>conf</font> to create a <font color=INDIANRED>SparkContext</font>. A <font color=INDIANRED>SparkContext</font> represents a Spark cluster connection. Users can use it to create RDD and shared variables (introduced later) like broadcast variables on that cluster.
3. After we set up <font color=INDIANRED>SparkContext</font> as <font color=INDIANRED>sc</font>, we use <font color=INDIANRED>textFile()</font> to read data from HDFS and create an RDD of strings, named <font color=INDIANRED>textRDD</font>.
4. Next, split each line by using lambda function. <font color=INDIANRED>flatMap()</font> applies the lambda function to all elements of RDD and flattening all results. So far, we get a RDD of terms.
5. To convert terms into (key, value) pairs, we use <font color=INDIANRED>map()</font> on RDD. Here previous <font color=INDIANRED>flatMap()</font> and current <font color=INDIANRED>map()</font> together work as the mapper in Hadoop MapReduce.
6. <font color=INDIANRED>reduceByKey()</font> works as reducer in Hadoop MapReduce. It adds up all values with same intermediate key, therefore the final value is the frequence of each word in README.md.
7. <font color=INDIANRED>saveAsTextFile()</font> save the RDD to <font color=INDIANRED>s3a://tutorial-688/output/wordcount_output</font>.

## Advanced Apache Spark

In this section, I'll introduce some ideas and conceps of Apache Spark. All of these are pretty important when you design and implement a spark program and helpful when you try to optimize your spark programs.

### Actions and Transformations

Transformations derive new RDDs from current RDDs. Transformations are lazy operations, which aren't actually executed unless they meet an action. Results of actions aren't RDDs and most of them are stored in memeory, such as reduce(), count(), collect() and so on. The lazy policy of transformations can reduce the overhead of computation and provide Spark opportunities to compress and optimize programs. Using transformations builds RDD lineages, which can be mapped into a Directed Acyclic Graph for RDDs. With this graph, lost RDDs can recover from its parents. It's the heart of fault tolerance in Spark.

Advice: Use as less as possible actions when designing your Spark programs.

### Wide and Narrow Dependencies

* Narrow dependencies (as NDs): Each child RDD partition only need one parent RDD partition. In above WordCount example, <font color=INDIANRED>flatMap()</font> and <font color=INDIANRED>Map()</font> are NDs transformations. Because each RDD is independent during transformations, NDs transformations can be executed within previous work nodes, therefore NDs are pretty fast.

	```
	(child RDD partition) <---- (parent RDD partition)
	```

* Wide dependencies (as WDs): Each child RDD partition may need mutiple parent RDD partitions. WDs transformations always accompany with **shuffle** (same in MapReduce). Shuffle reorganizes all partitions. Therefore, shuffle is very expensive, for it involves transferring over network and [straggler](http://pages.cs.wisc.edu/~dkhan/sparkstraglers.pdf) problems. In above WordCount programe, <font color=INDIANRED>reduceByKey()</font> is a WD transformation, shuffle operation make sure all partition pairs with same key stored in same worker node.

	```
	                      <---- (parent RDD partition 0)
	(child RDD partition) <---- (parent RDD partition 1)
	                      <---- (parent RDD partition 2)
	```
	
![dependencies](http://chuantu.biz/t6/269/1522566825x-1404812823.png)
<center>Dependencies in above WordCount program</center>
                        
Advice: Avoid to use narrow dependencies as much as possible.

### Apache Spark Moniter UI

Each action forms a **job**. NDs transformations between WDs form **stage**. Each partition in one stage is called a **task**. Each job involves multiple stages.

Here we can get an equation: NumberOfTasks = NumberOfPartitions * NumberOfStages

![stage](http://chuantu.biz/t6/269/1522566895x-1566683256.png)

During the execution of the program, you can visit <font color=INDIANRED>http://[master-node-public-dns]:4040</font> for Monitor UI.

To see passed tasks, visit <font color=INDIANRED>http://[master-node-public-dns]:18080</font>.

### Partition

By default, large files in HDFS are divided into partitions with size of 64MB. In above WordCount program, <font color=INDIANRED>textFile(file_name)</font> will create an RDD with one partition if the file size is smaller than 64MB. Therefore, there is only one task for each stage. In Spark, each task is assigned to one core of one slave machine. It means only one core of one slave machine works during the whole execution period. Thus, we pay much attention to set partition number when designing our Spark program.

In [None]:
# set the minimum partition number

textRDD = sc.textFile("/input/README.md", minPartitions)

## Optimization Suggestions

### 1. Don't forget to use broadcast variables.

Spark has two types of shared variables, **broadcast variables and accumulators**. Due to lazy policy in Spark, accumulators are not reliable and may get some unexpected results, so I don't recommand to use accumulators. Accompanied with lambda function in Python, **broadcast variables** are very convenient. Here is an example of using broadcast variables to count stop words.

In [None]:
stop_words_set = set(stop_words)
bc_set = sc.broadcast(stop_words_set)

textRDD = sc.textFile("/input/README.md")
counts = (textRDD.flatMap(lambda line: line.split(' '))
                .map(lambda term: 1 if term in bc_set.value else 0)
                .reduce(lambda a, b: a + b)

### 2. Resonable partitions.

Usually **TotalCores * 4** as the partition number, where **TotalCores = number of slaves * number of cores for each slave**. I highly recommand you to partition the RDD when creating it. You can either assign the partition number in <font color=INDIANRED>parallelize()</font> and <font color=INDIANRED>textFile()</font> or use <font color=INDIANRED>RDD.repartition()</font>.

### 3. Cache RDDs.

Avoid frequent use results can avoid repeating computation. For example,

In [None]:
textRDD= sc.textFile(FILE_NAME)
sourceRDD = textRDD.flatMap(...)
sourceRDD.cache()
# sourceRDD.persist(pyspark.storagelevel.StorageLevel.MEMORY_AND_DISK)

RDDa = sourceRDD.map(...)
RDDb = sourceRDD.map(...)

RDDa.count()
RDDb.count()

With <font color=INDIANRED>sourceRDD.cache()</font> RDDb can use **sourceRDD** without executing <font color=INDIANRED>textRDD.flatMap(...)</font>, for action <font color=INDIANRED>RDDa.count()</font> has gotten the result.

When the object is very large (can't be stored in Memeory), I recommand to use <font color=INDIANRED>RDD.persist(MEMORY_AND_DISK)</font> to replace <font color=INDIANRED>RDD.cache()</font>.

Additionally, when the cached object is going to be updated, remember to use <font color=INDIANRED>RDD.unpersist()</font> first, then update it.

In [None]:
cachedRDD.unpersist()
cachedRDD = cachedRDD.map(...)
cachedRDD.cache()

### Others

1. Keep locality of large size RDDs if possible.
2. Don't break original data structure if possible.

## Example: Calculate Pi

In this example, we use [Monte Carlo integration](https://en.wikipedia.org/wiki/Monte_Carlo_integration) to caculate Pi. The idea is very simple, we cast beans into a 2x2 squre and use probability of beans in cycle to estimate the area of the cycle.

With Spark, all slave nodes can work simultaneously and master node can add all results together.

![Monte Carlo](http://chuantu.biz/t6/269/1522566910x-1566683256.png)

In [None]:
import sys
from pyspark import SparkConf, SparkContext
from random import uniform

APP_NAME = "Spark Pi"

if __name__ == "__main__":
    conf = SparkConf().setAppName(APP_NAME)
    sc = SparkContext(conf=conf)

    total_cores = int(sys.argv[1])
    
    # total number of beans
    total_n = 1000000 * total_cores
    task_list = [1000000] * total_cores


    def monte_calor(n):
        '''casting beans in square

        :param n: number of beans
        :return: [0, 1, 0, 1, ...], 0 means not in cycle, 1 means in cycle
        '''
        ret = []
        for i in range(n):
            x = uniform(-1, 1)
            y = uniform(-1, 1)

            # <= 1 means in cycle
            ret.append(1 if x ** 2 + y ** 2 <= 1 else 0)
        return ret

    # count is number of beans in cycle
    count = (sc.parallelize(task_list, total_cores)
             .flatMap(monte_calor)
             .reduce(lambda a, b: a + b))

    pi = 4.0 * count / total_n

    print("Spark Pi = {:.7f}".format(pi))

To run this code,

In [None]:
spark-submit pi.py 16

## Example: Logistic Regression

In this example, we try to use Spark to do logistic regression.

Remember to install numpy in your cluster before executing the code.

In [None]:
flintrock run-command test-cluster 'sudo yum install -y gcc'
flintrock run-command test-cluster 'pip install --user numpy'

To simplify the process and test the correctness of the LR program directly, I generate a binary classification dataset, which follows linear distribution.

Dataset Link: <https://s3.amazonaws.com/tutorial-688/final_data>

![dataset](http://chuantu.biz/t6/269/1522566850x-1566683256.png)
<center> Virtulization of the Dataset </center>

In [None]:
import sys
import numpy as np
import math
from pyspark import SparkContext
import time

APP_NAME = "Spark LR"
INPUT_FILE = "s3a://tutorial-688/data_small.csv"


def sigmoid(x):
    '''sigmoid

    :param x: x = X.dot(w)
    :return: probability of y = 1
    '''
    return 1 / (1 + math.exp(-x))


def gd_partition(samples, weights):
    '''get weight updates

    :param samples: list, training samples
    :param weights: np.array(), weights for current iteration, shape: (1 x num_features)
    :return: np.array(), updates of weights, shape: (1 x num_features)
    '''
    weight_updates = np.zeros(num_features)

    for sample in samples:
        label = float(sample[0])
        value = sample[1:]
        value = np.array(map(lambda x: float(x), value))

        pred = sigmoid(value.dot(weights))
        diff = label - pred

        weight_updates += alpha * (diff * value + L * weights)

    return weight_updates


def pred_partition(samples, weights):
    '''predict the label of current samples

    :param samples: list, training samples
    :param weights: np.array(), shape: (1 x num_features)
    :return: list, [0, 1, 0, 1, ...] 0 means wrong, 1 means correct
    '''
    ret = []
    for sample in samples:
        label = float(sample[0])
        value = sample[1:]
        value = np.array(map(lambda x: float(x), value))

        pred = sigmoid(value.dot(weights))
        pred_label = 1 if pred > 0.5 else 0
        ret.append((1, 1 if pred_label == label else 0))
    return ret


if __name__ == "__main__":

    total_cores = int(sys.argv[1])
    max_iters = int(sys.argv[2])
    num_features = int(sys.argv[3])

    # according to optimization 2
    partition_number = total_cores * 4

    w_initial_value = 0.001
    alpha = 0.01
    alpha_decay = 0.95
    # L2 regularization parameter
    L = 0.01

    sc = SparkContext(appName=APP_NAME)

    t0 = time.time()
    samples_rdd = sc.textFile(INPUT_FILE, partition_number).map(lambda x: x.split(','), preservesPartitioning=True)
    samples_rdd.cache()

    w = np.ones(num_features) * w_initial_value

    for i in range(max_iters):
        print("Now is {}th iteration, complete {:.2f}".format(i+1, (i+1)/max_iters))

        # apply updates to current weights
        w += samples_rdd.mapPartitions(lambda x: gd_partition(x, w)).reduce(lambda a, b: a + b)
        alpha *= alpha_decay

    t1 = time.time()
    print("Training complete, time cost is {:.2f}m.".format((t1-t0)/60))

    # mapPartitions() can convert each partition of the source RDD into multiple elements of the result
    # mapPartitions() is called once for each Partition
    res = samples_rdd.mapPartitions(lambda x: pred_partition(x, w)).reduce(lambda a, b: (a[0] + b[0], a[1] + b[1]))
    t2 = time.time()
    print("Test complete, time cost is {:.2f}m.".format((t2-t1)/60))
    print("Accuracy: {:.2f}".format(ret[1]/ret[0]))

We run it for 100 iterations. Results are as following, as you can see accuracy can achieve 98%.

In previous code, I use <font color=INDIANRED>mapPartitions()</font> instead of <font color=INDIANRED>map()</font>. The reason is <font color=INDIANRED>mapPartitions()</font> is faster than <font color=INDIANRED>map()</font>, for it only execute one operation for each partition. In addition, we add <font color=INDIANRED>preservesPartitioning=True</font> when we create the initial RDD, this is corresponding to 1st of Others Topic of Optimization.

To run this code, use

In [None]:
spark-submit spark_lr.py 4 100 2

## Summary and references

This tutorial is a basic introduction of spark programming, including the motivation for data scienctists to learn Spark, basic Spark python api and ideas of optimization. However, some concepts may be hard to understand for beginners. The best learning route is run to some baisc examples first, like **wordcount** and **calculate pi**. You may meet some trouble when understanding the **Logistic Regression** example, at that time, reivewing above advanced and optimization sections would be a good choice.

1. [Spark original paper](https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf)
2. [MapReduce original paper](https://static.googleusercontent.com/media/research.google.com/zh-CN//archive/mapreduce-osdi04.pdf)
3. [RDD programming guide (APIs)](http://spark.apache.org/docs/latest/rdd-programming-guide.html)
4. [Spakr quick start](http://spark.apache.org/docs/latest/quick-start.html)
5. [Narrow and wide dependencies](https://github.com/rohgar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies)
6. [Apache Spark examples](https://spark.apache.org/examples.html)
7. [Spark Moniter UI](https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html)