
# Introduction
This tutorial will introduce you to some basic methods for using spark to process data analysis.
I assume everyone already know about MapReduce before we start this tutorial. If not, please refer to this [link](https://en.wikipedia.org/wiki/MapReduce). You only need to go over the logical view part for the basic computation model and the map and reduce concept. 

Spark is an open source cluster computing framework developed at the UC Berkeley AMPLab. It uses in-memory primitives that allow it to perform over 100x faster than traditional MapReduce for certain applications. 

To use Spark, you need to write a driver program to connect to a Spark cluster of workers.



Let's begin!

## Set up for Spark and Notebook
#### Install Spark on Mac OS
```
    > brew install scala
    > brew install hadoop
    > brew install apache-spark
    > pyspark 
```
Type '`sc`' in the pyspark shell, should have output like this: `<pyspark.context.SparkContext at 0x11222ab90>`
Great! Spark is working! Now stop the pyspark by Ctr+D
    
#### Setup environment variable
```
    > export PYSPARK_DRIVER_PYTHON="jupyter" 
    > export PYSPARK_DRIVER_PYTHON_OPTS="notebook" 
    > pyspark 
``` 
The PYSPARK_DRIVER_PYTHON parameter and the PYSPARK_DRIVER_PYTHON_OPTS parameter are used to launch the PySpark shell in Jupyter Notebook. 

**Now run pyspark again, a new jupyter notebook should be opened in your browser and you are able to run the following testing code. Close the old notebook and all the code should be able to run here.**


In [142]:
# should have output like this:
# <pyspark.context.SparkContext at 0x11222ab90>
sc

<pyspark.context.SparkContext at 0x10d9c9b90>

## Basic Concepts 
#### SparkContext, RDD

- In the PySpark shell, a special interpreter-aware **SparkContext** is already created for you, in the variable called sc. Making your own SparkContext will not work.

- At the heart of the Spark framework lies a new data abstraction called **Resilient Distributed Datasets** or (**RDD**s), which allow for a distributed dataset to remain in-memory in the nodes of a cluster during various stages of computation. RDDs also store the lineage information about the data, keeping a record of all the operations that were performed to bring an RDD to its present state. This way, if a node fails on a Spark cluster, the data that was in-memory and lost can be re-loaded from the source (often the distributed file system) and the operations that were recorded in the lineage information can be re-applied to bring the data to its present state. Thus, data can remain in memory through multiple stages of transformation without spilling to disk, and applications can run many times faster than traditional frameworks that rely on disk accesses in between stages.

#### Creating RDD

- There are two ways to create RDDs: **parallelizing** an existing collection in your driver program, or **loading from file**, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Once created, the rdd can be operated on in parallel. 

#### Look at your RDD & Output
- To print all elements on the driver, one can use the `collect()` method to first bring the RDD to the driver. This can cause the driver to run out of memory when your data is big, though, because `collect()` fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the `take()`. Or use `count()` to get the total number of elements.

- There are different methods to save your output as files into hadoop file system or local file systems. Following is  an example of save to local file system use `saveAsTextFile()`.


In [143]:
# create rdd by parallelizing an existing collection
rdd1 = sc.parallelize([1,2,3,4,5])
rdd1

ParallelCollectionRDD[762] at parallelize at PythonRDD.scala:475

In [158]:
# create rdd from external file
rdd2 = sc.textFile("data.txt")
rdd2

data.txt MapPartitionsRDD[828] at textFile at null:-1

In [145]:
# count elements in rdd
print rdd1.count()
# take sample of 4 
print rdd1.take(4)
# take first sample
print rdd1.first()
# collect rdd to a collection list data structure
rdd1.collect()

5
[1, 2, 3, 4]
1


[1, 2, 3, 4, 5]

In [146]:
# print rdd2
print rdd2.take(2)
print rdd2.first()
# a list of str
rdd2.collect()

[u'one', u'two']
one


[u'one', u'two', u'three']

In [147]:
# output to textfile
rdd2.saveAsTextFile('sample_output1')
# output saved as a directory under sample_output in seperate parts as distributed dataset

## RDD Operations
RDDs support two types of operations: 
* **Transformation**, which create a new rdd from an existing one

* **Action**, which return a value to the driver program after running a computation on the dataset. 

For example, `map` is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, `reduce` is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel `reduceByKey` that returns a distributed dataset).

#### Common methods of Transformation and Action
- Transformation

``` map, flatMap, filter, sample, union, intersection, distinct, groupByKey, reduceByKey, aggregateByKey, sortByKey, join```

- Action

```reduce, collect, count, first, take, countByKey, saveAsTextFile```

Most of them are pretty straight forward as suggest by their names, and using the right method would improve your calculation performance a lot. We would show how to use some of them next.


#### filter

In [160]:
rdd = sc.parallelize([1,2,3])
print 'rdd = ', rdd.collect()
filtered = rdd.filter(lambda x: x > 1)
print 'filter: ', filtered.collect()

rdd =  [1, 2, 3]
filter:  [2, 3]


#### map vs. flatMap

In [172]:
# compare map, flatMap
rdd1 = rdd.flatMap(lambda x:[x / 2, x])
print 'flatMap: ', rdd1.collect()
rdd2= rdd.map(lambda x:[x / 2, x])
print 'map: ',rdd2.collect()

 flatMap:  [0, 1, 1, 2, 1, 3]
map:  [[0, 1], [1, 2], [1, 3]]


#### reduce vs. reduceByKey

In [174]:
# compare reduceByKey, reduce
rdd = sc.parallelize([1,2,3])
rdd2= rdd.map(lambda x:[x / 2, x])
print 'rdd = ', rdd.collect()
# total is a number instead of rdd
total = rdd.reduce(lambda a,b: a + b)
print 'reduce: ', total
print 'rdd2 = ', rdd2.collect()
print 'reduceByKey:', rdd2.reduceByKey(lambda a,b: a+b).collect()


rdd =  [1, 2, 3]
reduce:  6
rdd2 =  [[0, 1], [1, 2], [1, 3]]
reduceByKey: [(0, 1), (1, 5)]


#### groupByKey

In [175]:
rdd2= rdd.map(lambda x:[x / 2, x])
print 'rdd2 = ', rdd2.collect()
print 'groupByKey:', rdd2.groupByKey().collect()
rdd3 = rdd2.groupByKey()
print 'convert to list: ', rdd3.map(lambda x: {x[0]: list(x[1])}).collect()

rdd2 =  [[0, 1], [1, 2], [1, 3]]
groupByKey: [(0, <pyspark.resultiterable.ResultIterable object at 0x110e74710>), (1, <pyspark.resultiterable.ResultIterable object at 0x110e74e90>)]
convert to list:  [{0: [1]}, {1: [2, 3]}]


All **transformations** in Spark are **`lazy`**, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the `persist` (or `cache`) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

#### lazy computing of transformation

In [176]:
rdd2 = sc.textFile("data.txt")
print 'rdd2 = ', rdd2.collect()
# get length of each line in rdd2
lineLengths_rdd = rdd2.map(lambda s: len(s))
# Important!! It is not immediately computed, due to laziness
print lineLengths_rdd
# see what is in the rdd, collect to a list
lineLengths_rdd.collect()

rdd2 =  [u'one', u'two', u'three']
PythonRDD[939] at RDD at PythonRDD.scala:48


[3, 3, 5]

#### persist

In [177]:
rdd2 = sc.textFile("data.txt")
lineLengths_rdd = rdd2.map(lambda s: len(s))
# sum up all length in the rdd
total1 = lineLengths_rdd.reduce(lambda a, b: a + b)
# If we also wanted to use lineLengths again later, we could add:
lineLengths_rdd.persist()
total2 = lineLengths_rdd.reduce(lambda a, b: a + b)
# if lineLengths_rdd not persist in memory, 
# the total2 will be calculating starting from rdd2 again
total1, total2

(11, 11)

#### function call
As you would notice we are using Lambda expressions for simple functions for map and reduce. We can also define a longer function and call the function in spark.

Following example defines a function count each characters for given string.


In [152]:
from collections import defaultdict
def my_func(line):
    ch_count = defaultdict(int)
    for ch in line:
        if ch != ' ':
            ch_count[ch] += 1
    return list(ch_count.items())
my_func('three')

[('h', 1), ('r', 1), ('e', 2), ('t', 1)]

Call the defined function in `flatMap()`

In [153]:
# flatMap
rdd2 = sc.textFile("data.txt")
print 'rdd2 = ', rdd2
rdd3 = rdd2.flatMap(my_func)
print 'rdd3 = ', rdd3.collect()
# reduceByKey
rdd4 = rdd3.reduceByKey(lambda a,b:a + b)
# rdd4 is the count for each character in all three words of rdd2
print 'rdd4 = ', rdd4.collect()

rdd2 =  data.txt MapPartitionsRDD[798] at textFile at null:-1
rdd3 =  [(u'e', 1), (u'o', 1), (u'n', 1), (u'o', 1), (u't', 1), (u'w', 1), (u'h', 1), (u'r', 1), (u'e', 2), (u't', 1)]
rdd4 =  [(u'e', 3), (u'o', 2), (u'w', 1), (u'h', 1), (u'r', 1), (u't', 2), (u'n', 1)]


Call the defined function in `filter()`

In [183]:
lines = ['this is a good log','error: something wrong','exception: unknown']
error_words = ['exception', 'error']
def has_error(line):
    for w in error_words:
        if w in line:
            return True
    return False
err = sc.parallelize(lines)\
    .filter(has_error)
print err
print err.collect()

PythonRDD[954] at RDD at PythonRDD.scala:48
['error: something wrong', 'exception: unknown']


## Examples of data analysis using Spark 
We will be analyzing tweets data from homework2, for the sake of not display too many data I am running code on a small subset of the original data. Data fields are as following:
```creen_name,created_at,retweet_count,favorite_count,text```

#### Example 1
Let's first take a look at the familiar wordcount, we would be count all the words of all tweets.
Only one line of code!

    

In [155]:
text = sc.textFile("smalltweet.csv")

counts = text.flatMap(lambda line: line.split(",")[-1].split()) \
        .map(lambda word: (word, 1)) \
        .reduceByKey(lambda a, b: a + b)
counts.take(4)

[(u'and', 3), (u'Matt', 1), (u'#MAGA', 1), (u'questions', 1)]

#### Example 2  

Find most frequent words for each tweet user, return top 5 words and word counts for each user(except stopwords)

In [188]:
# As we did in hw3 get stopwords
from collections import Counter
import nltk
STOPWORDS=nltk.corpus.stopwords.words('english')
THRESHOLD = 1

# return the top 5 count of words
def top5((name, list_wordcount)):
    dic = {word:count for word, count in list_wordcount}
    top5 = Counter(dic).most_common(5)
    return name, top5

text = sc.textFile("smalltweet.csv")

user_tweet = text.map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)>4 and line[0] != 'screen_name') \
    .map(lambda line: (line[0],line[-1].split())) 
print 'user_tweet:', user_tweet.take(1)

user_wordcount = user_tweet.flatMap(lambda (user, tweet) : [(user, word) for word in tweet]) \
        .filter(lambda(user, word): word not in STOPWORDS) \
        .map(lambda key : (key, 1)) \
        .reduceByKey(lambda a,b : a + b) \
        .filter(lambda (key,count): count > THRESHOLD) \
        .map(lambda ((name, word), count): (name,(word,count)))
print 'user_wordcount:', user_wordcount.take(1)

user_top5 = user_wordcount.groupByKey() \
            .map(top5)
user_top5.collect()

user_tweet: [(u'realDonaldTrump', [u'Final', u'poll', u'results', u'from', u'NBC', u'on', u'last', u'nights', u'Commander-in-Chief', u'Forum.', u'Thank', u'you!', u'#ImWithYou', u'#MAGA', u'https://t.co/C5ipaxUN7B'])]
user_wordcount: [(u'realDonaldTrump', (u'poll', 3))]


[(u'realDonaldTrump',
  [(u'last', 5),
   (u'nights', 3),
   (u'Mexico', 3),
   (u'results', 3),
   (u'Hillary', 3)])]

#### Example 3

Let's do some data analysis on a graphic dataset.
Dataset is the same one we used in hw2: edges.csv
Data fields are as following:
```follower, followee```

* How many distinct edges?

In [192]:
edges = sc.textFile("edges.csv") 
edge_count = edges.distinct() \
            .count()
print 'distinct edge count = ', edge_count

distinct edge count =  16180


* How many distinct users?

In [194]:
edges = sc.textFile("edges.csv") 
user_count = edges.flatMap(lambda line: line.split(',')) \
            .distinct() \
            .count()
print 'distinct user count = ', user_count

distinct user count =  12405
