![](images/spark.png)
<img style="float: right" src="images/surfsara.png">
## **Introduction to Apache Spark**

Below are number of exercises in PySPark. Press Shift-Enter to execute the code. You can use code completion by using tab.

During the exercises you may want to refer to [The PySpark documentation](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html#pyspark.RDD) for more information on possible transformations and actions.

Let us first create a simple RDD, based on a list of words. We will be using two partitions here for the RDD. We will use a SparkContext sc that has already been created for us.

In [None]:
import numpy as np

wordsList = ['Dog', 'Cat', 'Rabbit', 'Hare', 'Deer', 'Gull', 'Woodpecker', 'Mole']
wordsRDD = sc.parallelize(wordsList, 2)
# Print out the type of wordsRDD
print type(wordsRDD)

##**Map transformation **

We now want to change all words in the wordsRDD to their plural form. We will do this using a [map](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=map#pyspark.RDD.map) transformation.
Remember that the map function will apply the function to each element of the RDD. 

First, we will write a simple function that takes a single word as argument and return the word with an 's' added to it. In the next step we will use this function in a map transformation of the wordsRDD.

Take a look at the function below and fill in the code at the tag <FILL IN>

In [None]:
# TODO: Replace <FILL IN> with appropriate code
def makePlural(word):
    """Adds an 's' to `word`.

    Note:
        This is a simple function that only adds an 's'.  

    Args:
        word (str): A string.

    Returns:
        str: A string with 's' added to it.
    """
    return <FILL IN>

print makePlural('cat')

Next, we will use the makePlural function as input for the map transformation on wordsRDD.
The action [collect()](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=collect#pyspark.RDD.collect) transfers the content of the RDD to the driver. Note, that a large RDD may be scattered over many machines. In such a case a collect may not be a good idea. 

In [None]:
# TODO: Replace <FILL IN> with appropriate code
pluralRDD = wordsRDD.map<FILL IN>
print pluralRDD.collect()

##**Using lambda functions**

We can achieve the same functionality by using lambda functions. In this case we define makePlural as a lambda function. 

Hint: The map function needs a lambda function as argument. This function needs one argument, let's call that x. The body of the function adds an 's' to the end of x.

In [None]:
# A lambda function for adding s at the end of a string
lambdaPluralRDD = wordsRDD.map(lambda x : x + 's')
print lambdaPluralRDD.collect()

Let's do another map transformation. For each word in wordRDD determine its length. The Python function len(s) will return the length for a string s.

You can do this with a lambda function, but there is another way... 

In [None]:
# TODO: Replace <FILL IN> with appropriate code
wordLengths = (<FILL IN>
                 .collect())
print wordLengths

Test your solution by running the following cell

In [None]:
# TEST Length of each word 
from test_helper import Test
   
Test.assertEquals(wordLengths, [3, 3, 6, 4, 4, 4, 10, 4],
                  'incorrect values')

##**Key Value Pairs**
In order to count words in parallel we are going to use an RDD which consist of simple key value pairs. We will call this RDD wordPairs and it will be result of a transformation of wordsRDD. For every word wordsRDD we want to have a (word, 1) tuple. Please fill in the code in the next cell at the place indicated and run the test.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
wordPairs = <FILL IN>
print wordPairs.collect()

In [None]:
# TEST Pair RDDs (1f)
Test.assertEquals(wordPairs.collect(),
                  [('Dog', 1), ('Cat', 1), ('Rabbit', 1), ('Hare', 1), ('Deer', 1), ('Gull', 1), ('Woodpecker', 1), ('Mole', 1)],
                  'incorrect value for wordPairs')

##**reduceByKey**

Next, we are going to count all words by using [reduceByKey](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=reducebykey#pyspark.RDD.reduceByKey).

reducebyKey expects the RDD to consist of key value pairs an it will perform a reduce operation per key. 
It will need a two-argument function as input that will work on the values only. Remember that a reduce function needs two arguments and will reduce all elements of the RDD to a single value.  


In [None]:
# TODO: Replace <FILL IN> with appropriate code
# Note that reduceByKey takes in a function that accepts two values and returns a single value
# The function that is input to reduceByKey only works on the values. Spark will execute this function per key

#from operator import add

wordCounts = wordPairs.reduceByKey(lambda x,y : <FILL IN>)
print wordCounts.collect()

In [None]:
# TEST Counting using reduceByKey (2c)
Test.assertEquals(wordCounts.collect(), [('this', 1), ('a', 1), ('message', 1), ('was', 2), ('bad', 1), ('His', 1), ('idea', 1)],
                  'incorrect value for wordCounts')

##**groupByKey**

Another transformation on RDDs is [groupByKey](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=reducebykey#pyspark.RDD.groupByKey)

groupByKey works on key value pairs, tuples in Python. It groups all values with the same key together in a list. 

In [None]:
wc = wordPairsRDD.groupByKey()

# The take action allows us to get the first n records from the RDD, in contrast with collect() which returns the 
# complete contents of the RDD
# Here we take 3 records to print out, which is in this case happens to be the complete RDD...

print wc.take(3)

When printing out these records we see tuples, with readable keys 'one', 'two', 'three', followed by ResultIterable objects, which Python does not not how to print. These objects are lists containing the values. We can print their contents by converting them to proper Python lists. To convert a ResultIterable y to a list we can simply use list(y).

Let's think how to do this. The RDD is list of tuples (x,y), where y is the ResultIterable which we want to convert.
A (lambda) function to convert one record would then take as input (x,y) and return (x, list(y))
We then use map to to do this for the entire RDD.

In [None]:
# TODO: Replace <FILL IN> with appropriate code

# Provide a lambda function to convert each ResultIterable object to a list

wc1 = <FILL IN>
print wc1.take(3)

As we have seen the result of groupBy is an RDD of which the records are tuples. Each tuple is a key and a list with values. We can get access to the elements the list by simply treating them as python lists. 

Let's print the first element of all these lists... which are all 1s of course

In [None]:
print wc1.map(lambda (x,y): y[0]).collect()

##**groupByKey vs reduceByKey **
Here we will demonstrate the difference between [groupByKey](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=groupbykey#pyspark.RDD.groupByKey) (or group in general) and reducebyKey. Both can be used in similar ways and will, when applied correctly, lead to the same answers. However, the way they are computed by Spark is quite different. 
First, take a look at reduceByKey once more. 

In [None]:
# This example will perform a word count by using reduceByKey

words = ["one", "two", "two", "three", "three", "three"]
# We create the RDD, immediately followed by a map - in a single statement
# we could have done this in two steps as we have done above
wordPairsRDD = sc.parallelize(words).map(lambda word: (word, 1))

wordCountsWithReduce = (wordPairsRDD.reduceByKey(lambda x,y: x+y) 
                        .collect())
print wordCountsWithReduce

In [None]:
Test.assertEquals(sorted(wordCountsWithReduce), [('one', 1), ('three', 3), ('two', 2)], 'Error in word count with Reduce')

#### ** reduceByKey**
The picture below shows how reduceByKey is computed on different workers. The reduceByKey function in the figure is equivalent to the one in the previous cell.

![reduceByKey](https://dl.dropboxusercontent.com/u/7526640/reduce.png)
(Picture by DataBricks)

Now let's see how this is different from groupBy.

In [None]:
wc = wordPairsRDD.groupByKey()
print wc.take(5)

In [None]:
# TODO: Replace <FILL IN> with appropriate code
# Fill in a lambda function in groupBy
# 
wordCountsWithGroup = (wordPairsRDD
                       .groupByKey()
                       <FILL IN>
                       .collect())
        
print wordCountsWithGroup 

In [None]:
Test.assertEquals(wordCountsWithReduce, [('one', 1), ('three', 3), ('two', 2)], 'Error in word count with Reduce')

## **groupBy**

The figure below shows an example of [groupByKey](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=groupby#pyspark.RDD.groupByKey). Note that all key-value pairs are send to different workers. This leads to a lot of network traffic which will hamper performance. 

Also, similar to MapReduce Spark determines to which machine a pair should be send to, by calling a partitioning function on the key of the pair. Note, that if there are many keys, with each very few values, this approach scales badly.

In general groupByKey should be avoided, particular when using large data sets. You can also look at [*foldByKey*](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=foldbykey#pyspark.RDD.foldByKey) and [*combineByKey*](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=combinebykey#pyspark.RDD.combineByKey) for alternatives.

![groupByKey](images/groupby.png)
(Picture by DataBricks)

##** Optional: Tweets Analysis**

As an example of how to analyse a file, we will look at a file with Dutch tweets. We will read in the file by making use of [sc.textFile](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=textfile#pyspark.SparkContext.textFile).

We will be using 4 partitions here. Take a look at the line where we filter and then map the data to utf-8 encoding. Note the way transformations are chained together.

Print out the first tweet in the RDD by making use of the [take](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=take#pyspark.RDD.take) action. It needs as argument the number of elements in the RDD that it will send to the driver.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
# Use take to print out the first tweet

tweetRDD = (sc.textFile('data/tweets.txt', 4)
            .filter(lambda x : len(x) > 0)
            .map(lambda x: x.encode('utf-8')))

print tweetRDD.<FILL IN>

##**Conversion to json**

Next, we are going to convert the tweets into JSON format. This will return a dictionary where each key is an attribute of the key. Some attributes, like *user* have sub-attributes.

In the next cell the conversion takes place and the first tweet is shown. 

Just execute the cell, there's noting to fill in.

In [None]:
import json
import re

jsonTweetRDD = tweetRDD.map(lambda x: json.loads(x))
parsed = jsonTweetRDD.take(1)
print json.dumps(parsed, indent=2)

##**Access to fields in the tweets**

We have made some selections for you to show how to access fields in tweets.

This is pure Python, although the data is contained in an RDD. You probably see what's going on here.

In [None]:
jsonTweetRDDtext = jsonTweetRDD.map(lambda x: [x['text'],
                                               x['created_at'],
                                               x['entities']['hashtags'],
                                               x['user']['name'], 
                                               x['user']['screen_name'], 
                                               x['user']['followers_count'], 
                                               x['user']['description']])
print jsonTweetRDDtext.take(1)

##**RDDs are distributed over workers**

It is important to understand what code is executed on workers, and what code on the driver. To move data to and from the driver to the workers is very expensive.

RDDs are distributed over workers and transformations define a sequence of RDDs. Never try to define an RDD inside an RDD and beware of what code is executed by the driver.

Let's make a quick list of all attributes in a tweet. We'll do it the wrong way first, by doing a map on the RDD.

In [None]:
print jsonTweetRDD.map(lambda x: x.keys()).take(1) 

##**Another attempt**

The previous code is very inefficient, since all tweets in the RDD are processed, and we end up with an RDD with all keys for all tweets. It would be better to take a single tweet and then outside an RDD compute the keys. Note that then the computation of the keys is done by the driver.

Try to do this in a single statement.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
print jsonTweetRDD.take<FILL IN>

##**Selecting Text**##

We select only the text from the tweets and clean it up a bit. Then we [cache](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=cache#pyspark.RDD.cache) the new RDD. 

In [None]:
# TODO: Replace <FILL IN> with appropriate code
tweetTextRDD = (jsonTweetRDD.map(lambda x: x['text'])
                            .map(lambda x: x.encode('utf-8'))
                            .map(lambda x: re.sub(r'[^\w]',' ', x))                   
                            .cache()
                            )
print tweetTextRDD.take(1)

##**Filtering**##

Use the [filter](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=filter#pyspark.RDD.filter) transformation to select only tweets that contain the word 'ik'.
Filter takes a boolean function as argument and returns those elements of the RDD which are true in respect to this function.

Make sure to convert the words in the tweets to lower case, before filtering.
Then count the words using [count](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=count#pyspark.RDD.count) and print the result

In [None]:
# TODO: Replace <FILL IN> with appropriate code
ikRDD = tweetTextRDD.<FILL IN>
count = <FILL IN>
print count

In [None]:
Test.assertEquals(count, 221, 'Wrong count')

## **Counting words in tweets**
Next count the words in tweets by applying four transformations in a chain on tweetTextRDD.

Note that tweetTextRDD is an RDD which contains lines (strings).

First, use string split on a single white space for every line in the RDD. Instead of Map, use [flatMap](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=flatmap#pyspark.RDD.flatMap) (Why?)<br>
Second, we only want to see words with a length larger than 1.<br> 
Third, use a map transformation to convert each word to lower case and create a (word, 1) tuple.<br>
Finally, use reduceByKey to add the result for each word.<br>

We will print the result by making use of the [takeOrdered](https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=takeordered#pyspark.RDD.takeOrdered.) action.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
# Use flatmap, filter, map and reduceByKey, to split the lines, filter words smaller that 2 characters, 
# create (word, 1) tuples and add up the results
aRDD = (tweetTextRDD.<FILL IN>
        .<FILL IN>
        .<FILL IN>
        .<FILL IN>
print aRDD.takeOrdered(35, lambda x : -x[1])