# Word Count

Counting the number of occurances of words in a text is a popular first exercise using map-reduce.

### Task
**Input:** A text file consisisting of words separated by spaces.  
**Output:** A list of words and their counts, sorted from the most to the least common.

We will use the book "Moby Dick" as our input.

In [1]:
#start the SparkContext
import findspark
findspark.init()

from pyspark import SparkContext
sc=SparkContext(master="local[4]")

# set import path
import sys
sys.path.append('../../Utils/')
from plan2table import plan2table

### Download data file from S3

In [2]:
%%time
import requests
data_dir='../../Data'
filename='Moby-Dick.txt'
url = "https://mas-dse-open.s3.amazonaws.com/"+filename
local_path = data_dir+'/'+filename

# Copy URL content to local_path
r = requests.get(url, allow_redirects=True)
open(local_path, 'wb').write(r.content)

# check that the text file is where we expect it to be
!ls -l $local_path

-rw-r--r--  1 yoavfreund  staff  1257260 Feb 24 13:09 ../../Data/Moby-Dick.txt
CPU times: user 146 ms, sys: 48.9 ms, total: 195 ms
Wall time: 1.64 s


### Define an RDD that will read the file
* As execution is **lazy**, this does not necessarily mean that actual reading of the file content has occured.

In [3]:
%%time
text_file = sc.textFile(data_dir+'/'+filename)
type(text_file)

CPU times: user 2.43 ms, sys: 2.61 ms, total: 5.05 ms
Wall time: 667 ms


### Steps for counting the words

* split line by spaces.
* map `word` to `(word,1)`
* count the number of occurances of each word.

In [4]:
%%time
words =     text_file.flatMap(lambda line: line.split(" "))
not_empty = words.filter(lambda x: x!='')
key_values=  not_empty.map(lambda word: (word, 1)) 
counts=     key_values.reduceByKey(lambda a, b: a + b)

CPU times: user 10.2 ms, sys: 3.81 ms, total: 14 ms
Wall time: 83.7 ms


### We have a plan!
In this cell we defined the execution plan, but we have not started to execute it.

Note that preparing the plan took ~100ms, which is a non-trivial amount of time, but still much less than the time it will take to execute it.

### Lets have a look a the execution plan
Note that the earliest node in the dependency graph is the file `../../Data/Moby-Dick.txt`. 

### Print the execution plan for each of the RDDs
Note that the execution plan for `words`, `not_empty` and `key_values` are all the same.

In [5]:
plan2table(text_file.toDebugString())


| Execution plan   | RDD |
| :---------------------------------------------------------------- | :------------: |
|`(2)_../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at Native`| rdd |
|`_/__../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodA`| rdd |


In [6]:
plan2table(words.toDebugString())


| Execution plan   | RDD |
| :---------------------------------------------------------------- | :------------: |
|`(2)_PythonRDD[6] at RDD at PythonRDD.scala:48 []`| rdd |
|`_/__../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at Native`| rdd |
|`_/__../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodA`| rdd |


In [7]:
plan2table(not_empty.toDebugString())


| Execution plan   | RDD |
| :---------------------------------------------------------------- | :------------: |
|`(2)_PythonRDD[7] at RDD at PythonRDD.scala:48 []`| rdd |
|`_/__../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at Native`| rdd |
|`_/__../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodA`| rdd |


In [8]:
plan2table(key_values.toDebugString())


| Execution plan   | RDD |
| :---------------------------------------------------------------- | :------------: |
|`(2)_PythonRDD[8] at RDD at PythonRDD.scala:48 []`| rdd |
|`_/__../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at Native`| rdd |
|`_/__../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodA`| rdd |


In [9]:
plan2table(counts.toDebugString())


| Execution plan   | RDD |
| :---------------------------------------------------------------- | :------------: |
|`(2)_PythonRDD[9] at RDD at PythonRDD.scala:48 []`| rdd |
|`_/__MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:436 []`| rdd |
|`_/__ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:0 [`| rdd |
|`_+-(2)_PairwiseRDD[3] at reduceByKey at <timed exec>:4 []`| rdd |
|`____/__PythonRDD[2] at reduceByKey at <timed exec>:4 []`| rdd |
|`____/__../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at Nat`| rdd |
|`____/__../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMeth`| rdd |


| Execution plan   | RDD |
| :---------------------------------------------------------------- | :------------: |
|`(2)_PythonRDD[6] at RDD at PythonRDD.scala:48 []`| **counts** |
|`_/__MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:436 []`| **---"---** |
|`_/__ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:0 [`| **---"---** |
|`_+-(2)_PairwiseRDD[3] at reduceByKey at <timed exec>:4 []`| **---"---** |
|`____/__PythonRDD[2] at reduceByKey at <timed exec>:4 []`| **words, not_empty, key_values** |
|`____/__../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at Nat`| **text_file** |
|`____/__../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMeth`| **---"---** |

### Count!
Finally we count the number of times each word has occured.
Now, finally, the Lazy execution model finally performs some actual work, which takes a significant amount of time.

In [10]:
%%time
## Run #1
Count=counts.count()
Sum=counts.map(lambda x:x[1]).reduce(lambda x,y:x+y)
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))

Different words=33781, total words=215133, mean no. occurances per word=6.37
CPU times: user 12 ms, sys: 5.69 ms, total: 17.6 ms
Wall time: 1.27 s


In [11]:
%%time
## Run #2
Count=counts.count()
Sum=counts.map(lambda x:x[1]).reduce(lambda x,y:x+y)
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))

Different words=33781, total words=215133, mean no. occurances per word=6.37
CPU times: user 10.8 ms, sys: 4.46 ms, total: 15.2 ms
Wall time: 133 ms


In [12]:
%%time
## Run #3, cache
Count=counts.cache().count()
Sum=counts.map(lambda x:x[1]).reduce(lambda x,y:x+y)
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))

Different words=33781, total words=215133, mean no. occurances per word=6.37
CPU times: user 8.47 ms, sys: 4.21 ms, total: 12.7 ms
Wall time: 152 ms


In [13]:
%%time
#Run #4
Count=counts.count()
Sum=counts.map(lambda x:x[1]).reduce(lambda x,y:x+y)
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))

Different words=33781, total words=215133, mean no. occurances per word=6.37
CPU times: user 9.82 ms, sys: 4.37 ms, total: 14.2 ms
Wall time: 83.8 ms


In [14]:
%%time
#Run #5
Count=counts.count()
Sum=counts.map(lambda x:x[1]).reduce(lambda x,y:x+y)
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))

Different words=33781, total words=215133, mean no. occurances per word=6.37
CPU times: user 12 ms, sys: 5.11 ms, total: 17.1 ms
Wall time: 90.7 ms


## Finding the most common words
* `counts`: RDD with 33301 pairs of the form `(word,count)`. 
* Find the 2 most frequent words. 
* **Method1:** `collect` and sort on head node.
* **Method2:** Pure Spark, `collect` only at the end.

### Method1: `collect` and sort on head node 
#### Collect the RDD into the driver node
* Collect can take significant time.

In [15]:
%%time
C=counts.collect()

CPU times: user 55.2 ms, sys: 12.6 ms, total: 67.9 ms
Wall time: 107 ms


#### Sort 
* RDD collected into list in driver node.
* No longer using spark parallelism.
* Sort in python
* will not scale to very large documents.

In [16]:
C.sort(key=lambda x:x[1])
print('most common words\n','\n'.join(['%s:\t%d'%c for c in C[-5:]]))
print('\nLeast common words\n','\n'.join(['%s:\t%d'%c for c in C[:5]]))

most common words
 to:	4510
a:	4533
and:	5951
of:	6587
the:	13766

Least common words
 EBook:	1
Author::	1
Last:	1
January:	1
Posting:	1


### Compute the mean number of occurances per word.

In [17]:
Count2=len(C)
Sum2=sum([i for w,i in C])
print('count2=%f, sum2=%f, mean2=%f'%(Count2,Sum2,float(Sum2)/Count2))


count2=33781.000000, sum2=215133.000000, mean2=6.368462


### **Method2:** Pure Spark, `collect` only at the end.
* Collect into the head node only the more frquent words.
* Requires multiple **stages**

#### Step 1 split, clean and map to `(word,1)`

In [18]:
%%time
word_pairs=text_file.flatMap(lambda x: x.split(' '))\
    .filter(lambda x: x!='')\
    .map(lambda word: (word,1))

CPU times: user 43 µs, sys: 1 µs, total: 44 µs
Wall time: 71.8 µs


#### **Step 2** Count occurances of each word.

In [19]:
%%time
counts=word_pairs.reduceByKey(lambda x,y:x+y)

CPU times: user 8.53 ms, sys: 2.59 ms, total: 11.1 ms
Wall time: 51.5 ms


#### **Step 3** Reverse `(word,count)` to `(count,word)` and sort by key

In [20]:
%%time
reverse_counts=counts.map(lambda x:(x[1],x[0]))   # reverse order of word and count
sorted_counts=reverse_counts.sortByKey(False)

CPU times: user 21 ms, sys: 7.31 ms, total: 28.4 ms
Wall time: 490 ms


#### Full execution plan

We now have a complete plan to compute the most common words in the text. Nothing has been executed yet! Not even a single byte has been read from the file `Moby-Dick.txt` !

For more on execution plans and lineage see [jace Klaskowski's blog](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-lineage.html#toDebugString)

In [21]:
print('word_pairs:')
plan2table(word_pairs.toDebugString())
print('\ncounts:')
plan2table(counts.toDebugString())
print('\nreverse_counts:')
plan2table(reverse_counts.toDebugString())
print('\nsorted_counts:')
plan2table(sorted_counts.toDebugString())

word_pairs:

| Execution plan   | RDD |
| :---------------------------------------------------------------- | :------------: |
|`(2)_PythonRDD[30] at RDD at PythonRDD.scala:48 []`| rdd |
|`_/__../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at Native`| rdd |
|`_/__../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodA`| rdd |

counts:

| Execution plan   | RDD |
| :---------------------------------------------------------------- | :------------: |
|`(2)_PythonRDD[31] at RDD at PythonRDD.scala:48 []`| rdd |
|`_/__MapPartitionsRDD[23] at mapPartitions at PythonRDD.scala:436 []`| rdd |
|`_/__ShuffledRDD[22] at partitionBy at NativeMethodAccessorImpl.java:0 `| rdd |
|`_+-(2)_PairwiseRDD[21] at reduceByKey at <timed exec>:1 []`| rdd |
|`____/__PythonRDD[20] at reduceByKey at <timed exec>:1 []`| rdd |
|`____/__../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at Nat`| rdd |
|`____/__../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMeth`| rdd |

reverse_c

sorted_counts:

| Execution plan   | RDD |
| :---------------------------------------------------------------- | :------------: |
|`(2)_PythonRDD[20] at RDD at PythonRDD.scala:48 []`| **sorted_counts** |
|`_/__MapPartitionsRDD[19] at mapPartitions at PythonRDD.scala:436 []`| **---"---** |
|`_/__ShuffledRDD[18] at partitionBy at NativeMethodAccessorImpl.java:0 `| **---"---** |
|`_+-(2)_PairwiseRDD[17] at sortByKey at <timed exec>:2 []`| **---"---** |
|`____/__PythonRDD[16] at sortByKey at <timed exec>:2 []`| ** counts, reverse_counts** |
|`____/__MapPartitionsRDD[13] at mapPartitions at PythonRDD.scala:436 []`| **---"---** |
|`____/__ShuffledRDD[12] at partitionBy at NativeMethodAccessorImpl.java`| **---"---** |
|`____+-(2)_PairwiseRDD[11] at reduceByKey at <timed exec>:1 []`| rdd |
|`_______/__PythonRDD[10] at reduceByKey at <timed exec>:1 []`| **word_pairs** |
|`_______/__../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at `| **---"---** |
|`_______/__../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeM`| **---"---** |

#### **Step 4** Take the top 5 words. **only now the computer executes the plan!**

In [22]:
%%time
C=RDD3.take(5)
print('most common words\n','\n'.join(['%d:\t%s'%c for c in C]))

NameError: name 'RDD3' is not defined

### Summary
This was our first pyspark program, hurray!

What did we learn?

1) The core data structure of Spark is an RDD. 

2) An RDD is a distributed array. However, it appears to the programmer as an abstract object on which the programmer can perform `map` and `reduce` operations.

3) `map` operations transform an RDD into another RDD.

4) `reduce` and `collect` operations transform an RDD back into a python array.

5) Once an RDD is reduce, your program is running on a single computer - the head node.

### Next
Complete the HW notebook, submit it for evaluation, and continue to the next class, in which we will learn more about spark.