## Word Count
Counting the number of occurances of words in a text is one of the most popular first eercises when learning Map-Reduce Programming. It is the equivalent to `Hello World!` in regular programming.

We will do it two way, a simpler way where sorting is done after the RDD is collected, and a more sparky way, where the sorting is also done using an RDD.

In [1]:
# First, check that the text file is where we expect it to be
%ls -l ../../Data/Moby-Dick.txt

-rw-rw-r-- 1 elliott elliott 1237539 Mar 11 14:21 ../../Data/Moby-Dick.txt


In [2]:
#start the SparkContext
from pyspark import SparkContext
sc = SparkContext(master="local[3]")

### Define an RDD that will read the file
Note that, as execution is Lazy, this does not necessarily mean that actual reading of the file content has occured.

In [3]:
%%time
text_file = sc.textFile(u'../../Data/Moby-Dick.txt')
type(text_file)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 502 ms


### Define the code for counting the words
* split line by spaces.
* map `word` to `(word,1)`
* count the number of occurances of each word.

In [4]:
%%time
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .filter(lambda x: x!='')\
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
type(counts)

CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 103 ms


### Have a look a the execution plan
Note that the earliest node in the dependency graph is the file `../../Data/Moby-Dick.txt`. 

In [5]:
print counts.toDebugString()

(2) PythonRDD[6] at RDD at PythonRDD.scala:48 []
 |  MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:422 []
 |  ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(2) PairwiseRDD[3] at reduceByKey at <timed exec>:1 []
    |  PythonRDD[2] at reduceByKey at <timed exec>:1 []
    |  ../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
    |  ../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []


### Count!
Finally we count the number of times each word has occured.
Now, finally, the Lazy execution model finally performs some actual work, which takes a significant amount of time.

In [6]:
%%time
Count=counts.count()
Sum=counts.map(lambda (w,i): i).reduce(lambda x,y:x+y)
print 'Count=%f, sum=%f, mean=%f'%(Count,Sum,float(Sum)/Count)

Count=33301.000000, sum=212113.000000, mean=6.369568
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 912 ms


## Finding the most common words
* `counts`: RDD with 33301 pairs of the form `(word,count)`. 
* Find the 2 most frequent words. 
* **Method1:** `collect` and sort on head node.
* **Method2:** Pure Spark, `collect` only at the end.

## Method1: `collect` and sort on head node 
### Collect the RDD into the driver node
* Collect can take significant time.

In [7]:
%%time
C=counts.collect()
print type(C)

<type 'list'>
CPU times: user 24 ms, sys: 0 ns, total: 24 ms
Wall time: 115 ms


### Sort 
* RDD collected into list in driver node.
* No longer using spark parallelism.
* Sort in python
* will nor scale to very large documents.

In [8]:
C.sort(key=lambda x:x[1])
print 'most common words\n','\n'.join(['%s:\t%d'%c for c in C[-5:]])
print '\nLeast common words\n','\n'.join(['%s:\t%d'%c for c in C[:5]])

most common words
to:	4440
a:	4477
and:	5883
of:	6481
the:	13609

Least common words
funereal:	1
unscientific:	1
lime-stone,:	1
shouted,:	1
pitch-pot,:	1


### Compute the mean number of occurances per word.

In [9]:
Count2=len(C)
Sum2=sum([i for w,i in C])
print 'count2=%f, sum2=%f, mean2=%f'%(Count2,Sum2,float(Sum2)/Count2)


count2=33301.000000, sum2=212113.000000, mean2=6.369568


## **Method2:** Pure Spark, `collect` only at the end.
* Collect into the head node only the more frquent words.
* Requires multiple **stages**

**Step 1** split, clean and map to `(word,1)`

In [10]:
%%time
RDD=text_file.flatMap(lambda x: x.split(' '))\
    .filter(lambda x: x!='')\
    .map(lambda word: (word,1))

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 51 µs


**Step 2** Count occurances of each word.

In [11]:
%%time
RDD1=RDD.reduceByKey(lambda x,y:x+y)

CPU times: user 8 ms, sys: 4 ms, total: 12 ms
Wall time: 27.7 ms


**Step 3** Reverse `(word,count)` to `(count,word)` and sort by key

In [12]:
%%time
RDD2=RDD1.map(lambda (c,v):(v,c))
RDD3=RDD2.sortByKey(False)

CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 394 ms


### Full execution plan

We now have a complete plan to compute the most common words in the text. Nothing has been executed yet! Not even one one bye has been read from the file `Moby-Dick.txt` !

For more on execution plans and lineage see [jace Klaskowski's blog](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-lineage.html#toDebugString)

In [13]:
print 'RDD3:'
print RDD3.toDebugString()

RDD3:
(2) PythonRDD[19] at RDD at PythonRDD.scala:48 []
 |  MapPartitionsRDD[18] at mapPartitions at PythonRDD.scala:422 []
 |  ShuffledRDD[17] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(2) PairwiseRDD[16] at sortByKey at <timed exec>:2 []
    |  PythonRDD[15] at sortByKey at <timed exec>:2 []
    |  MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 []
    |  ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 []
    +-(2) PairwiseRDD[10] at reduceByKey at <timed exec>:1 []
       |  PythonRDD[9] at reduceByKey at <timed exec>:1 []
       |  ../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
       |  ../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []


In [15]:
print text_file.toDebugString()

(2) ../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
 |  ../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []


**Step 4** Take the top 5 words. **only now the computer executes the plan!**

In [38]:
%%time
C=RDD3.take(5)
print 'most common words\n','\n'.join(['%d:\t%s'%c for c in C])

most common words
13609:	the
6481:	of
5883:	and
4477:	a
4440:	to
CPU times: user 4.96 ms, sys: 2.17 ms, total: 7.13 ms
Wall time: 46.9 ms


In [40]:
**HW Below **

SyntaxError: invalid syntax (<ipython-input-40-05bb7468b02c>, line 1)

## Excercise
A `k`-mer is a sequence of `k` consecutive words. 

For example, the `3`-mers in the line `you are my sunshine my only sunsine` are

* `you are my`
* `are my sunshine`
* `my sunshine my`
* `sunshine my only`
* `my only sunsine`

For the sake of simplicity we consider only the `k`-mers that appear in a single line. In other words, we ignore `k`-mers that span more than one line.

Write a function, using spark all the way to the end, to find to top 10 `k`-mers in a given text for a given `k`.

Specifically write functions with the following signatures:
```python
def map_kmers(text,k):
    \\ text: an RDD of text lines. Lines contain only lower-case letters and spaces. Spaces should be ignored.
    \\ k: length of `k`-mers
    return  singles
    \\ singles: an RDD of pairs of the form (tuple of k words,1)
def count_kmers(singles):
    \\ singles: as above
    return counts
    \\ count: RDD of the form: (tuple of k words, number of occurances)
def sort_counts(count):
    \\ count: as above
    return sorted_counts
    \\ sorted_counts: RDD of the form (number of occurances, tuple of k words) sorted in decreasing number of occurances.
```
Combining these operations as follows:
```python 
k=3
singles=map_kmers(text,k)
count=count_kmers(singles)
sorted_counts=sort_counts(count)
C=sorted_counts.take(5)
print 'most common %d-mers\n'%k,'\n'.join(['%d:\t%s'%c for c in C])
```

Should pring the 5 most common 3-mers

In [1]:
sc.stop()

NameError: name 'sc' is not defined