## Word Count
Counting the number of occurances of words in a text is one of the most popular first eercises when learning Map-Reduce Programming. It is the equivalent to `Hello World!` in regular programming.

In [1]:
#start the SparkContext
import findspark
findspark.init()

from pyspark import SparkContext
sc=SparkContext(master="local[4]")

## Read text into an RDD

### Download data file from S3

In [2]:
%%time
import urllib
data_dir='../../Data'
filename='Moby-Dick.txt'
f = urllib.request.urlretrieve ("https://mas-dse-open.s3.amazonaws.com/"+filename, data_dir+'/'+filename)

# First, check that the text file is where we expect it to be
!ls -l $data_dir/$filename

-rw-r--r--  1 yoavfreund  staff  1257260 Feb  6 19:20 ../../Data/Moby-Dick.txt
CPU times: user 37.2 ms, sys: 26 ms, total: 63.2 ms
Wall time: 1.54 s


### Define an RDD that will read the file
Note that, as execution is Lazy, this does not necessarily mean that actual reading of the file content has occured.

In [3]:
%%time
text_file = sc.textFile(data_dir+'/'+filename)
type(text_file)

CPU times: user 3.63 ms, sys: 3.79 ms, total: 7.42 ms
Wall time: 786 ms


## Counting the words
* split line by spaces.
* map `word` to `(word,1)`
* count the number of occurances of each word.

In [4]:
%%time
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .filter(lambda x: x!='')\
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
type(counts)

CPU times: user 11.3 ms, sys: 3.89 ms, total: 15.2 ms
Wall time: 123 ms


### Have a look a the execution plan
Note that the earliest node in the dependency graph is the file `../../Data/Moby-Dick.txt`. 

In [5]:
counts.toDebugString().splitlines()

[b'(2) PythonRDD[6] at RDD at PythonRDD.scala:48 []',
 b' |  MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:436 []',
 b' |  ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:0 []',
 b' +-(2) PairwiseRDD[3] at reduceByKey at <timed exec>:1 []',
 b'    |  PythonRDD[2] at reduceByKey at <timed exec>:1 []',
 b'    |  ../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []',
 b'    |  ../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []']

### Count!
Finally we count the number of times each word has occured.
Now, finally, the Lazy execution model finally performs some actual work, which takes a significant amount of time.

In [6]:
%%time
Count=counts.count()
Sum=counts.map(lambda x:x[1]).reduce(lambda x,y:x+y)
print('Different words=%5.0f, total words=%6.0f, mean no. occurances per word=%4.2f'%(Count,Sum,float(Sum)/Count))

Different words=33781, total words=215133, mean no. occurances per word=6.37
CPU times: user 10.3 ms, sys: 4.27 ms, total: 14.6 ms
Wall time: 1.31 s


## Finding the most common words
* `counts`: RDD with 33301 pairs of the form `(word,count)`. 
* Find the 2 most frequent words. 
* **Method1:** `collect` and sort on head node.
* **Method2:** Pure Spark, `collect` only at the end.

### Method1: `collect` and sort on head node 
#### Collect the RDD into the driver node
* Collect can take significant time.

In [7]:
%%time
C=counts.collect()

CPU times: user 12 ms, sys: 5.58 ms, total: 17.6 ms
Wall time: 78.2 ms


#### Sort 
* RDD collected into list in driver node.
* No longer using spark parallelism.
* Sort in python
* will not scale to very large documents.

In [8]:
C.sort(key=lambda x:x[1])
print('most common words\n','\n'.join(['%s:\t%d'%c for c in C[-5:]]))
print('\nLeast common words\n','\n'.join(['%s:\t%d'%c for c in C[:5]]))

most common words
 to:	4510
a:	4533
and:	5951
of:	6587
the:	13766

Least common words
 EBook:	1
Author::	1
Last:	1
January:	1
Posting:	1


### Compute the mean number of occurances per word.

In [9]:
Count2=len(C)
Sum2=sum([i for w,i in C])
print('count2=%f, sum2=%f, mean2=%f'%(Count2,Sum2,float(Sum2)/Count2))


count2=33781.000000, sum2=215133.000000, mean2=6.368462


### **Method2:** Pure Spark, `collect` only at the end.
* Collect into the head node only the more frquent words.
* Requires multiple **stages**

#### Step 1 split, clean and map to `(word,1)`

In [10]:
%%time
RDD=text_file.flatMap(lambda x: x.split(' '))\
    .filter(lambda x: x!='')\
    .map(lambda word: (word,1))

CPU times: user 28 µs, sys: 0 ns, total: 28 µs
Wall time: 33.1 µs


#### **Step 2** Count occurances of each word.

In [11]:
%%time
RDD1=RDD.reduceByKey(lambda x,y:x+y)

CPU times: user 5.97 ms, sys: 2.71 ms, total: 8.68 ms
Wall time: 20.5 ms


#### **Step 3** Reverse `(word,count)` to `(count,word)` and sort by key

In [12]:
%%time
RDD2=RDD1.map(lambda x:(x[1],x[0]))   # reverse order of word and count
RDD3=RDD2.sortByKey(False)

CPU times: user 14.9 ms, sys: 4.86 ms, total: 19.7 ms
Wall time: 313 ms


#### Full execution plan

We now have a complete plan to compute the most common words in the text. Nothing has been executed yet! Not even a single byte has been read from the file `Moby-Dick.txt` !

For more on execution plans and lineage see [jace Klaskowski's blog](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-lineage.html#toDebugString)

In [13]:
print('RDD3:')
RDD3.toDebugString().splitlines()

RDD3:


[b'(2) PythonRDD[19] at RDD at PythonRDD.scala:48 []',
 b' |  MapPartitionsRDD[18] at mapPartitions at PythonRDD.scala:436 []',
 b' |  ShuffledRDD[17] at partitionBy at NativeMethodAccessorImpl.java:0 []',
 b' +-(2) PairwiseRDD[16] at sortByKey at <timed exec>:2 []',
 b'    |  PythonRDD[15] at sortByKey at <timed exec>:2 []',
 b'    |  MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:436 []',
 b'    |  ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 []',
 b'    +-(2) PairwiseRDD[10] at reduceByKey at <timed exec>:1 []',
 b'       |  PythonRDD[9] at reduceByKey at <timed exec>:1 []',
 b'       |  ../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []',
 b'       |  ../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []']

#### **Step 4** Take the top 5 words. **only now the computer executes the plan!**

In [14]:
%%time
C=RDD3.take(5)
print('most common words\n','\n'.join(['%d:\t%s'%c for c in C]))

most common words
 13766:	the
6587:	of
5951:	and
4533:	a
4510:	to
CPU times: user 9.05 ms, sys: 3.19 ms, total: 12.2 ms
Wall time: 134 ms


### Summary
This was our first pyspark program, hurray!

What did we learn?

1) The core data structure of Spark is an RDD. 

2) An RDD is a distributed array. However, it appears to the programmer as an abstract object on which the programmer can perform `map` and `reduce` operations.

3) `map` operations transform an RDD into another RDD.

4) `reduce` and `collect` operations transform an RDD back into a python array.

5) Once an RDD is reduce, your program is running on a single computer - the head node.

### Next
Complete the HW notebook, submit it for evaluation, and continue to the next class, in which we will learn more about spark.