# Apache Spark | Word Count
We will cover someSpark functions to build a word counter and test it on *La Divina Commedia* by Dante Alighieri

In [None]:
# Import Spark
# NOTE: This may differ depending on your system!
import os
execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))

## Part 1: Creating a base RDD and pair RDDs

### Create a base RDD
We'll start by generating a base RDD by using a Python list and the `sc.parallelize` method.  Then we'll print out the type of the base RDD.

In [None]:
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)

print type(wordsRDD)

### Pluralize 

Let's use a `map()` transformation to add the letter 's' to each string in the base RDD we just created.

In [None]:
# One way of completing the function
def makePlural(word):
    return word + 's'

print makePlural('cat')

### Apply `makePlural` to the base RDD

Now pass each item in the base RDD into a [map()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.map) transformation that applies the `makePlural()` function to each element. And then call the [collect()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.collect) action to see the transformed RDD.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
pluralRDD = wordsRDD.<FILL IN>
print pluralRDD.collect()

### Pass a `lambda` function to `map`

Let's create the same RDD using a `lambda` function.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
pluralLambdaRDD = wordsRDD.<FILL IN>
print pluralLambdaRDD.collect()

### Length of each word

Now use `map()` and a `lambda` function to return the number of characters in each word.  We'll `collect` this result directly into a variable.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
pluralLengths = (pluralRDD
                 .<FILL IN>
                 .collect())
print pluralLengths

### Pair RDDs

The next step in writing our word counting program is to create a new type of RDD, called a pair RDD. A pair RDD is an RDD where each element is a pair tuple `(k, v)` where `k` is the key and `v` is the value. In this example, we will create a pair consisting of `('<word>', 1)` for each word element in the RDD.
We can create the pair RDD using the `map()` transformation with a `lambda()` function to create a new RDD.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
wordPairs = wordsRDD.map(<FILL IN>)
print wordPairs.collect()

## Part 2: Counting with pair RDDs

Now, let's count the number of times a particular word appears in the RDD. There are multiple ways to perform the counting, but some are much less efficient than others.

### `groupByKey()` approach
An approach you might first consider (we'll see shortly that there are better ways) is based on using the [groupByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.groupByKey) transformation. As the name implies, the `groupByKey()` transformation groups all the elements of the RDD with the same key into a single list in one of the partitions.

There are two problems with using `groupByKey()`:
  + The operation requires a lot of data movement to move all the values into the appropriate partitions.
  + The lists can be very large. Consider a word count of English Wikipedia: the lists for common words (e.g., the, a, etc.) would be huge and could exhaust the available memory in a worker.

Use `groupByKey()` to generate a pair RDD of type `('word', iterator)`.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
# Note that groupByKey requires no parameters
wordsGrouped = wordPairs.<FILL IN>
for key, value in wordsGrouped.collect():
    print '{0}: {1}'.format(key, list(value))

### Use `groupByKey()` to obtain the counts

Using the `groupByKey()` transformation creates an RDD containing 3 elements, each of which is a pair of a word and a Python iterator.

Now sum the iterator using a `map()` transformation.  The result should be a pair RDD consisting of (word, count) pairs.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
wordCountsGrouped = wordsGrouped.map(<FILL IN>)
print wordCountsGrouped.collect()

### Counting using `reduceByKey`

A better approach is to start from the pair RDD and then use the [reduceByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey) transformation to create a new pair RDD. The `reduceByKey()` transformation gathers together pairs that have the same key and applies the function provided to two values at a time, iteratively reducing all of the values to a single value. `reduceByKey()` operates by applying the function first within each partition on a per-key basis and then across the partitions, allowing it to scale efficiently to large datasets.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
# Note that reduceByKey takes in a function that accepts two values and returns a single value
wordCounts = wordPairs.reduceByKey(<FILL IN>)
print wordCounts.collect()

### All together

The expert version of the code performs the `map()` to pair RDD, `reduceByKey()` transformation, and `collect` in one statement.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
wordCountsCollected = (wordsRDD
                       <FILL IN>
                       .collect())
print wordCountsCollected

## Part 3: Finding unique words and a mean value

### Unique words

Calculate the number of unique words in `wordsRDD`.  You can use other RDDs that you have already created to make this easier.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
uniqueWords = wordCounts.<FILL IN>
print uniqueWords

### Mean using `reduce`

Find the mean number of words per unique word in `wordCounts`.

Use a `reduce()` action to sum the counts in `wordCounts` and then divide by the number of unique words.  First `map()` the pair RDD `wordCounts`, which consists of (key, value) pairs, to an RDD of values.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
from operator import add
totalCount = (wordCounts
              <FILL IN>
             )
average = totalCount / float(uniqueWords)
print round(average, 2)

In [None]:
assert round(average, 2) == 1.67

## Part 4: Apply word count to a file

In this section we will finish developing our word count application.  We'll have to build the `wordCount` function, deal with real world problems like capitalization and punctuation, load in our data source, and compute the word count on the new data.

### `wordCount` function

First, define a function for word counting.  You should reuse the techniques that have been covered in earlier parts of this lab.  This function should take in an RDD that is a list of words like `wordsRDD` and return a pair RDD that has all of the words and their associated counts.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
def wordCount(wordListRDD):
    """Creates a pair RDD with word counts from an RDD of words.

    Args:
        wordListRDD (RDD of str): An RDD consisting of words.

    Returns:
        RDD of (str, int): An RDD consisting of (word, count) tuples.
    """
    return wordListRDD.<FILL IN>
print wordCount(wordsRDD).collect()

In [None]:
assert sorted(wordCount(wordsRDD).collect()) == [('cat', 2), ('elephant', 1), ('rat', 2)]

### Capitalization and punctuation

Real world files are more complicated than the data we have been using in this lab. Some of the issues we have to address are:
  + Words should be counted independent of their capitialization (e.g., Spark and spark should be counted as the same word).
  + All punctuation should be removed.
  + Any leading or trailing spaces on a line should be removed.

We defined the function `removePunctuation` that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces.

In [None]:
import re
import string
def removePunctuation(text):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        text (str): A string.

    Returns:
        str: The cleaned up string.
    """
    pattern = re.compile('[%s]' % string.punctuation)
    t =  re.sub(pattern, '', text.lower())
    return re.sub(' +',' ',t).strip()

print removePunctuation('Hi, you!')
print removePunctuation(' No under_score!')
print removePunctuation(' *      Remove punctuation then spaces  * ')

In [None]:
assert removePunctuation(" The Elephant's   4 cats.  ") == 'the elephants 4 cats'

### Load a text file

For the next part of this lab, we will use the La Divina Commedia by Dante Alighieri. To convert a text file into an RDD, we use the `SparkContext.textFile()` method. We also apply the recently defined `removePunctuation()` function using a `map()` transformation to strip out the punctuation and change all text to lower case.  Since the file is large we use `take(15)`, so that we only print 15 lines.

In [None]:
danteRDD = sc.textFile('input/divina_commedia.txt').map(removePunctuation)
danteRDD.zipWithIndex().take(15)

### Words from lines

Before we can use the `wordcount()` function, we have to address two issues with the format of the RDD:
  + The first issue is that  that we need to split each line by its spaces.
  + The second issue is we need to filter out empty lines.

Apply a transformation that will split each element of the RDD by its spaces. For each element of the RDD, you should apply Python's string [split()](https://docs.python.org/2/library/string.html#string.split) function. You might think that a `map()` transformation is the way to do this, but think about what the result of the `split()` function will be.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
danteWordsRDD = danteRDD.<FILL IN>
danteWordCount = danteWordsRDD.count()
print danteWordCount

In [None]:
assert danteWordCount == 96972

** Remove empty elements **

The next step is to filter out the empty elements.  Remove all entries where the word is `''`.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
danteWordsRDD2 = danteWordsRDD.<FILL IN>
danteWordsCount2 = danteWordsRDD2.count()
print danteWordsCount2

In [None]:
assert danteWordsCount2 == 96561

### Count the words

We now have an RDD that is only words.  Next, let's apply the `wordCount()` function to produce a list of word counts. We can view the top 25 words by using the `takeOrdered()` action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.

You'll notice that many of the words are common English words. These are called stopwords. In a later lab, we will see how to eliminate them from the results.
Use the `wordCount()` function and `takeOrdered()` to obtain the fifteen most common words and their counts.

In [None]:
top25WordsAndCounts = danteWordsRDD2.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).takeOrdered(25, lambda s: -s[1])
print '\n'.join(map(lambda (w, c): u'{0}: {1}'.format(w, c), top25WordsAndCounts))