## Tutorial Introduction

This tutorial will go through the Apache Spark basic and high level API focusing on Spark streaming. All the codes were run on Windows 7 OS.

Apache Spark is an open source fast and general-purpose cluster computing system, it was originally developed at the University of California, Berkeley's AMPLab, later it was open sourced on Apache Software Foundation and maintenaced by Apache till now. Spark provides high level interface for clusters programming with implicit and efficient data parallelism and fault-tolerance. _(To make this tutorial easy to review in Juypter notebook, I ran Spark on a multi core CPU machine in pseudo-distributed local mode, this mode is for development and testing purposes)_

Spark is claimed to be 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. After some research, there are three advantages make Spark faster:
1. Unlike MapReduce persists each step's results on disk (e.g. Hadoop File System) and read by next step's computation as input, Spark can directly pass the result of previous step to next step, which saves lots of disk and network I/O. The advantage comes from the cheaper memory nowadays and the TB level memory addressing abilities of 64-bit platform.
2. Apache Spark has an advanced Directed Acyclic Graph (DAG) execution engine that supports cyclic data flow and in-memory computing. It can optimize many operations into one stage, where in MapReduce these operations are scheduled in multiple stages.
3. Apache Spark saves lots of Java Virtual Machine (JVM) setup time by keeping a running executor JVM on each cluster node, where in MapReduce a new JVM is created for each task.

## 0. Include Spark in Jupyter Notebook

Download pre-built Spark package: [Package Link](http://spark.apache.org/downloads.html).

Options selected while writing this tutorial:
- Spark release: 2.0.1 (Oct 03, 2016)
- Package type: Pre-built for Hadoop 2.7 and later
- Download type: direct download

Download `spark-2.0.1-bin-hadoop2.7.tgz` and unzip it to the same folder of this notebook file, then include Spark into the notebook as following:

In [1]:
import os
import sys
import random
import time

In [2]:
spark_home = os.getcwd()+'/spark-2.0.1-bin-hadoop2.7'
os.environ['SPARK_HOME'] = spark_home
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python\lib\py4j-0.10.3-src.zip'))

To test if Spark is included into this notebook successfully, please try to build SparkContext. There should be no warning or exception.

In [3]:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext

from operator import add
from pyspark.sql import SparkSession

conf = SparkConf()

# In local mode, specify the number of CPU cores Spark can use in bracket, or use * to let Spark to detect
conf.setMaster("local[*]")
conf.setAppName("Spark Tutorial")

# specify the memory Spark can use
conf.set("spark.executor.memory", "1g")    
sc = SparkContext(conf = conf)

<b>Note1</b> :_ If exception _`[Java gateway process exited before sending the driver its port number]`_ is thrown on Windows OS, modify `spark-2.0.1-bin-hadoop2.7/bin/spark-class2.cmd` line 33, remove the double quotes

<b>Note2</b> :_ While writing the tutorial, I encountered _`[global name 'accumulators' is not defined]`_ exception from `context.py`, I added <code>`print(accumulators)`</code> in <code>`_do_init_`</code> function body before the problematic code, then the exception mysteriously disappeared...

<b>Note3</b> : In the cluster environment, you can pass the cluster URL to <code>conf.setMaster()</code>, like Spark, [Mesos](http://mesos.apache.org/) or [YARN](http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html) cluster URL.

<b>Note4</b> : Always call sc.stop() before generate another SparkContext.

In [6]:
# test if Spark is functioning, count the number words in LICENSE file

spark = SparkSession.builder.appName("Spark Tutorial").getOrCreate()
lines = spark.read.text(spark_home+'LICENSE').rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')).map(lambda x: 1).reduce(add)
print counts

3453


The result should be 3453.

## 1. Spark Basic

### 1.1 Create Resilient Distributed Datasets (RDDs)

Spark's cluster programming API is centered on this RDD data structure, RDD is a read-only multiset of data items can be easily distributed amony cluster nodes, also provide an easy way for Spark to use the memory.

In [4]:
# create RDD from a list, (parallelizing an existing collection), then find the max value
test_list = [random.randint(1, sys.maxint) for i in range(10000)]
distData = sc.parallelize(test_list)
max_val = distData.reduce(lambda a, b: a if a>b else b)
print max_val == max(test_list)

True


In [6]:
# create RDD from existing data file

# Count number of appearance of each word in LICENSE file
test_file = sc.textFile(spark_home+'LICENSE')  # reads it as a collection of lines
word_counts = test_file.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)
output = word_counts.collect()
for (word, count) in output:
    print("%s: %i" % (word, count))

First ten word-count pairs:
<pre>
: 1430
all: 3
customary: 1
(org.antlr:ST4:4.0.4: 1
managed: 1
Works,: 2
APPENDIX:: 1
details.: 2
granting: 1
Subcomponents:: 1
</pre>

### 1.2 Persist RDD Object in Memory

RDD object cannot be reused after reduce since RDD object is lazily created, i.e. only execute the RDD create operation when it is needed by reduce, and only the reduced result is returned, if the same RDD object is used later, RDD create operation needs to be executed again. If the generation of RDD object is time consuming, Spark can persist the RDD into memory for future uses

In [14]:
# make a larger file by repeating LICENSE 2000 times to make file reading time longer, result file size is around 34.5 MB
filenames = [spark_home+'LICENSE' for i in range(2000)]
with open(spark_home+'LARGE_LICENSE', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            outfile.write(infile.read())

In [8]:
line_len = sc.textFile(spark_home+'LARGE_LICENSE').map(lambda x: len(x))
line_len.persist()    # persist to memory

PythonRDD[23] at RDD at PythonRDD.scala:48

In [9]:
%%time

# time with persisted line_len
max_len = line_len.reduce(lambda a,b : a if a>b else b)
min_len = line_len.reduce(lambda a,b : a if a<b else b)
total_len = line_len.reduce(add)

print max_len, min_len, total_len

139 0 35024000
Wall time: 8.32 s


In [10]:
line_len.unpersist()   # remove persisted RDD from memory, compare the time for the same operations

PythonRDD[23] at RDD at PythonRDD.scala:48

In [11]:
%%time

# time without persisted line_len
max_len = line_len.reduce(lambda a,b : a if a>b else b)
min_len = line_len.reduce(lambda a,b : a if a<b else b)
total_len = line_len.reduce(add)

print max_len, min_len, total_len

139 0 35024000
Wall time: 13.3 s


### 1.3 Use Complex Function in Map and Reduce

We can pass a complex function to map and reduce function.

In [12]:
def complexMap(s):
    '''
    Get the individual words of current line
    '''
    words = s.strip().split(" ")
    
    words_num = len(words)
    
    # only counts line with more than ten words, and the first word starts with an alphabetic character
    if words_num>10 and words[0][0].isalpha():
         return (words[0], words_len)
    else:
        return (None, 0)

In [13]:
def complexReduce(a, b):
    '''
    conditional sum reduce
    '''
    if a[0] and b[0]:
        return ("Total", a[1]+b[1])
    
    if a[0]:
        return ("Total", a[1])
    
    if b[0]:
        return ("Total", b[1])
    
    return (None, 0)

In [14]:
sc.textFile(spark_home+'LICENSE').map(complexMap).reduce(complexReduce)

('Total', 583)

## 2. Spark Streaming

Spark Streaming is an extension of the core Spark API to process streaming data. Various streaming data sources can be applied like Kafka, Flume, Kinesis, or TCP sockets, Spark Streaming application checks newly arrived/created data with pre-defined time interval. Spark provides many high-level data processing method like <code>map</code>, <code>reduce</code>, <code>join</code>, <code>window</code> and their variants. At the end, processed data can be pushed out to filesystems, databases, and live dashboards. 

More conveniently, Spark’s machine learning and graph processing algorithms can also be applied to data streams.

<img src="streaming-arch.png"/>

Spark Streaming provides a high-level abstraction on streaming data representation, called discretized stream (DStream). DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or from another DStreams. Internally, a DStream is represented as a sequence of RDDs.

<img src="streaming-dstream.png"/>

### 2.0 Spark Streaming Workflow

1. Spark Streaming context initialization
2. Input DStream creation (streaming data source)
3. Streaming data processing definition, i.e. applying transformation and output operations to DStreams
4. Start the application with <code>streamingContext.start()</code>
5. Wait for the finish of processing with <code>streamingContext.awaitTermination()</code>, the streaming application will be terminated when exception occurs or <code>streamingContext.stop()</code> is called.

### 2.1 Initialize Spark Streaming Context

In [4]:
ssc = StreamingContext(sc, 1)  # time interval is defined as 1 second

<b>Note1: </b>Spark Streaming needs at least one node/thread as data receiver, and needs extra nodes/threads to do the processing job. In pseudo-distributed local mode, use `local` or `local[1]` as master URL will leave Spark no data processing thread since the only one thread acts as data receiver.

<b>Note2</b> : The batch interval should be set based on the latency requirements of the application and available cluster resources to make Spark streaming application stable, i.e. the application should be able to process data as fast as the streaming data being generated and received. More details on how to figure out the batch interal is on [Spark documentation](http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval).

### 2.2 Create Input DStream

In cluster environment, we can create input DStream from socket, text file and binary file, each has a initialization function respectively. The text file and binary file are required to be hosted on Hadoop-compatible file system, everytime a new file is <b>moved</b> or <b>renamed</b> in the monitored directory, since Spark reads file data at once when new file is found by name, copied file or editing file will be seen as empty file by Spark.

Since the local file system is not Hadoop-compatible, and Sparking streaming is more stable on distributed file system, we use another way to generate the input DStream in this tutorial, <code>queueStream()</code>. <code>queueStream()</code> generates a data stream from a collection of RDDs, then feed the StreamingContext one RDD per time interval. 

We use Twitter API to get a random sample of most recent tweets (around 1% of total new tweets), then combine a number tweets into a group, then each group is transformed into a distributed dataset that can be operated on in parallel, a seriers of such dataset will be formed into data stream using <code>queueStream()</code>.

In [16]:
# install twitter package into Jupyter notebook
!pip install twitter
import twitter

In [5]:
def connect_twitter():
    '''
    Connect to Twitter with developer API keys, then call Twitter API on TwitterStream
    '''
    consumer_key = "bm7qkiTCNPMzsBIkkSnwgnzVU"
    consumer_secret = "fkID5ttsNogh4eQyWiKpgRg7P80yXbsglj9nAYA6peN4QGSNlX"
    access_token = "794210841440686081-utUrhHReNXcUcD3KligzLb95MpyXv7c"
    access_secret = "PDD7XWdAoJNYaMbwEzNHNfc1UueWNfepQIep4ABPoHHpq"
    auth = twitter.OAuth(token = access_token, token_secret = access_secret, consumer_key = consumer_key, consumer_secret = consumer_secret)
    return twitter.TwitterStream(auth=auth)

twitter_stream = connect_twitter()

In [6]:
def get_tweet(content_generator):
    '''
    Get valid Twitter content from a generator returned by Twitter sample API
    '''
    while True:
        tweet = content_generator.next()
        if 'delete' in tweet:
            continue

        return tweet['text'].encode('utf-8')

In [7]:
def gen_rdd_queue(twitter_stream, tweets_num=10, queue_len=10):
    '''
    Generate a RDD list out of the groups of tweets, this list will be transformed into data stream
    '''
    rddQueue = []
    
    # Get most recent tweets samples
    content_generator = twitter_stream.statuses.sample(block=True)
    
    for q in range(queue_len):
        contents = []
        for i in range(tweets_num):
            contents.append(get_tweet(content_generator))
        
        # Generate the distributed dataset from a group of tweets content
        rdd = ssc.sparkContext.parallelize(contents, 5)
        
        rddQueue += [rdd]
        
    return rddQueue

rddQueue = gen_rdd_queue(twitter_stream)

### 2.3 Stream Data Processing Definition

We will get the word with top occurences from the samples within every time interval, and print the result.

In [8]:
def process_tweet(new_values, last_sum):
    '''
    Word count update function
    '''
    return sum(new_values) + (last_sum or 0)

In [9]:
def output(time, rdd, top = 10):
    '''
    Print the words with top occurences
    '''
    result = []
    read = rdd.take(top)

    print("Time: %s:" % time)

    for record in read:
        print(record)

    print("")

### 2.4 Start and Stop the Streaming application

In [None]:
ssc.checkpoint("./checkpoint-tweet")

# Get input DStream
lines = ssc.queueStream(rddQueue)

# Get words by split space of raw tweet text, the count word ocurrence and sort in descending order
counts = lines.flatMap(lambda line: line.split(" "))\
              .map(lambda word: (word, 1))\
              .updateStateByKey(process_tweet)\
              .transform(lambda rdd: rdd.sortBy(lambda x: x[1],False))

# Print result on the result of each time interval
counts.foreachRDD(output)

ssc.start()
print "Spark Streaming started"

# To save time, the streaming application will be terminated manually after 30 seconds
time.sleep(30)

ssc.stop(stopSparkContext=False, stopGraceFully=True)
print "Spark Streaming finished"

Sample output:

Spark Streaming started<br/>
Time: 2016-11-03 18:19:05:<br/>
('RT', 8)<br/>
("j'ai", 2)<br/>
('de', 2)<br/>
('en', 2)<br/>
('que', 2)<br/>
('\xd8\x8c', 2)<br/>
('#MourinhoOut', 1)<br/>
(')', 1)<br/>
('somos', 1)<br/>
('m\xc3\xa1s', 1)<br/>
('#Gala9GH17', 1)<br/>
<br/>
Time: 2016-11-03 18:19:06:<br/>
('RT', 13)<br/>
('e', 4)<br/>
('\xd8\xa7\xd9\x84\xd9\x84\xd9\x87', 3)<br/>
('\xd9\x88\xd8\xa7\xd9\x84\xd9\x84\xd9\x87', 2)<br/>
('\xce\xba\xce\xb1\xce\xb9', 2)<br/>
('en', 2)<br/>
('\xd9\x85\xd9\x86', 2)<br/>
('de', 2)<br/>
("j'ai", 2)<br/>
('que', 2)<br/>
('\xd8\xb9\xd9\x86', 2)<br/>
<br/>
Time: 2016-11-03 18:19:07:<br/>
('RT', 18)<br/>
('de', 6)<br/>
('e', 5)<br/>
('to', 4)<br/>
('for', 4)<br/>
('no', 4)<br/>
('\xd8\xa7\xd9\x84\xd9\x84\xd9\x87', 3)<br/>
('it', 3)<br/>
('', 2)<br/>
('dos', 2)<br/>
('win', 2)<br/>
...

## 3. Further Resources

### 3.1 Spark DataFrames and SQL

Spark DataFrame is a distributed collection of data organized into named columns, more powerful lambda functions can be applied on. It is conceptually equivalent to a table in a relational database or a data frame in Python, but with more optimizations under the hood. Spark SQL provides ability to execute SQL queries on Spark data.

Reference Link: <http://spark.apache.org/docs/latest/sql-programming-guide.html>

### 3.2 Spark Machine Learning Library (MLlib)

MLlib is a rich library on common machine learning algorithms and methods, it makes the machine learning application more scalable and easier. These functionalities provided by MLlib can be applied on RDD easily.

Reference Link: <http://spark.apache.org/docs/latest/ml-guide.html>

## 4. References

1. Apache Spark Wikipedia: (https://en.wikipedia.org/wiki/Apache_Spark)
2. Spark Programming Guides: (http://spark.apache.org/docs/latest/programming-guide.html)
3. Why Spark is faster: (https://www.quora.com/What-makes-Spark-faster-than-MapReduce)