# Finished!

## Word Count
Counting the number of occurances of words in a text is one of the most popular first exercises when learning Map-Reduce Programming. It is the equivalent to `Hello World!` in regular programming.

We will do it two way, a simpler way where sorting is done after the RDD is collected, and a more sparky way, where the sorting is also done using an RDD.

In [None]:
# First, check that the text file is where we expect it to be
%ls ../Data/Moby-Dick.txt

### Read the text file into an RDD
Note that, as execution is Lazy, this does not necessarily mean that actual reading of the file content has occured.

In [None]:
%%time
text_file = sc.textFile(u'../Data/Moby-Dick.txt')
type(text_file)

### Count the words
Next, we count the number of words that occured in the text. Again, this is only setting the plan.

In [None]:
%%time
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
type(counts)

### Have a look a the execution plan
Note that the earliest node in the dependency graph is the file `../../Data/Moby-Dick.txt`. It is possible that that even the first element in that file has not yet been read!

In [None]:
print counts.toDebugString()

### Count!
Finally we count the number of times each word has occured.
Note that this cell, finally, the Lazy execution model finally performs some actual work.

In [None]:
%%time
Count=counts.count() #Count stands for the number of distinct words
Sum=counts.map(lambda (w,i): i).reduce(lambda x,y:x+y) # Count the number of total words
print 'count=%f, sum=%f, mean=%f'%(Count,Sum,float(Sum)/Count)

### Collect the `Sum` RDD into the driver node
This also takes significant work.

In [None]:
%%time
C=counts.collect()
type(C)

### Sort 
Now that we have collected the Sum RDD into the driver node, we no longer rely on Spark. The following two cells
are simple python commands.

In [None]:
C.sort(key=lambda x:x[1])
print 'most common words',C[-10:]
print 'Least common words',C[:10]

### Compute the mean number of occurances per word.

In [None]:
Count2=len(C)
Sum2=sum([i for w,i in C])
print 'count2=%f, sum2=%f, mean2=%f'%(Count2,Sum2,float(Sum2)/Count2)


## Word Count in Pure Spark
We now show how to perform word count, including sorting, using RDDs, returning to the driver node just the top 10 words.

In [None]:
%%time
RDD=text_file.flatMap(lambda x: x.split(' '))\
    .filter(lambda x: x!='')\
    .map(lambda word: (word,1))

In [None]:
%%time
RDD1=RDD.reduceByKey(lambda x,y:x+y)

In [None]:
%%time
RDD2=RDD1.map(lambda (c,v):(v,c))

In [None]:
%%time
RDD3=RDD2.sortByKey(False)

In [None]:
print RDD3.toDebugString()

In [None]:
%%time
RDD3.take(10)