# Building a word count application

### Overview

This lab will build on the techniques covered in the Spark tutorial to develop a simple word count application. I will write code that calculates the most common words with fundamental Spark functions such as "map", "flatmap", "keyGroupby", etc. The text data used in this work is "Macbeth" written by shakespeare from "OpensourceShakespeare(http://www.opensourceshakespeare.org/)"

### Procedure

①Create a RDD generated from text data and remove marks

②Split each lines to words by its spaces and filter out empty elements

③Sorted and shown to 15 common words in Machbeth

In [1]:
def wordCount(wordListRDD):
    return wordListRDD.map(lambda w: (w, 1)).reduceByKey(lambda a, b: a + b)

In [2]:
macbethRDD = (sc
              .textFile("Macbeth.txt")
              .map(lambda x: x.replace(',',' ').replace('.',' ').replace('-',' ').replace('?',' ').replace('[',' ').replace(']',' ')
                   .lower()))

print '\n'.join(macbethRDD
                .zipWithIndex()  # to (line, lineNum)
                .map(lambda (l, num): '{0}: {1}'.format(num, l))  # to 'lineNum: line'
                .take(10))

0:  thunder and lightning  enter three witches 
1: 
2:     first witch  when shall we three meet again
3:     in thunder  lightning  or in rain  
4: 
5:     second witch  when the hurlyburly's done 
6:     when the battle's lost and won  5
7: 
8:     third witch  that will be ere the set of sun  
9: 


In [3]:
macbethWordsRDD = (macbethRDD
                   .flatMap(lambda line: line.split(' '))
                   .filter(lambda w: w != ''))
macbethWordsCount = macbethWordsRDD.count()
top15WordsAndCounts = wordCount(macbethWordsRDD ).takeOrdered(15, lambda (k, v): v * -1)

print '\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))
print macbethWordsCount

the: 731
and: 565
to: 398
of: 342
i: 314
macbeth: 267
a: 250
that: 225
in: 205
my: 192
you: 190
is: 185
not: 159
with: 155
it: 145
18882
