## Initialize spark
Here we initialize spark to use 3 of the 4 cores on my laptop.

In [2]:
from pyspark import SparkContext
sc = SparkContext(master="local[3]")

## Word Count
Counting the number of occurances of words in a text is one of the most popular first eercises when learning Map-Reduce Programming. It is the equivalent to `Hello World!` in regular programming.

We will do it two way, a simpler way where sorting is done after the RDD is collected, and a more sparky way, where the sorting is also done using an RDD.

In [3]:
# First, check that the text file is where we expect it to be
filename='../../Distributed\ Computation/notebooks/Moby-Dick.txt'
%ls -l $filename

-rw-r--r--  1 yoavfreund  503  1237539 Jan  4 13:20 ../../Distributed Computation/notebooks/Moby-Dick.txt


### Read the text file into an RDD
Note that, as execution is Lazy, this does not necessarily mean that actual reading of the file content has occured.

In [4]:
%%time
text_file = sc.textFile(filename)
type(text_file)

CPU times: user 1.4 ms, sys: 1.59 ms, total: 2.99 ms
Wall time: 537 ms


### Count the words
Next, we count the number of words that occured in the text. Again, this is only setting the plan.

In [5]:
%%time
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
print type(counts)

<class 'pyspark.rdd.PipelinedRDD'>
CPU times: user 9.29 ms, sys: 3.58 ms, total: 12.9 ms
Wall time: 178 ms


### Have a look a the execution plan
Note that the earliest node in the dependency graph is the file `../../Data/Moby-Dick.txt`. It is possible that that even the first element in that file has not yet been read!

In [6]:
print counts.toDebugString()

(2) PythonRDD[6] at RDD at PythonRDD.scala:43 []
 |  MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:374 []
 |  ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:-2 []
 +-(2) PairwiseRDD[3] at reduceByKey at <timed exec>:1 []
    |  PythonRDD[2] at reduceByKey at <timed exec>:1 []
    |  ../../Distributed\ Computation/notebooks/Moby-Dick.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 []
    |  ../../Distributed\ Computation/notebooks/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:-2 []


### Count!
Finally we count the number of times each word has occured.
Note that this cell, finaally, the Lazy execution model finally performs some actual work.

In [7]:
%%time
Count=counts.count()
Sum=counts.map(lambda (w,i): i).reduce(lambda x,y:x+y)
print 'count=%f, sum=%f, mean=%f'%(Count,Sum,float(Sum)/Count)

count=33302.000000, sum=216169.000000, mean=6.491172
CPU times: user 11.1 ms, sys: 4.02 ms, total: 15.1 ms
Wall time: 1.37 s


### Collect the `Sum` RDD into the driver node
This also takes significant work.

In [8]:
%%time
C=counts.collect()
type(C)

CPU times: user 21.6 ms, sys: 6.31 ms, total: 27.9 ms
Wall time: 119 ms


### Sort 
Now that we have collected the Sum RDD into the driver node, we no longer rely on Spark. The following two cells
are simple python commands.

In [9]:
C.sort(key=lambda x:x[1])
print 'most common words',C[-10:]
print 'Least common words',C[:10]

most common words [(u'I', 1724), (u'his', 2415), (u'that', 2680), (u'in', 3824), (u'', 4056), (u'to', 4440), (u'a', 4477), (u'and', 5883), (u'of', 6481), (u'the', 13609)]
Least common words [(u'funereal', 1), (u'unscientific', 1), (u'lime-stone,', 1), (u'shouted,', 1), (u'pitch-pot,', 1), (u'cod-liver', 1), (u'prices', 1), (u'prefix', 1), (u'slew.', 1), (u'too?"', 1)]


### Compute the mean number of occurances per word.

In [10]:
Count2=len(C)
Sum2=sum([i for w,i in C])
print 'count2=%f, sum2=%f, mean2=%f'%(Count2,Sum2,float(Sum2)/Count2)


count2=33302.000000, sum2=216169.000000, mean2=6.491172


## Word Count in Pure Spark
We now show how to perform word count, including sorting, using RDDs, returning to the driver node just the top 10 words.

In [11]:
%%time
RDD=text_file.flatMap(lambda x: x.split(' '))\
    .filter(lambda x: x!='')\
    .map(lambda word: (word,1))

CPU times: user 35 µs, sys: 13 µs, total: 48 µs
Wall time: 45.1 µs


In [12]:
%%time
RDD1=RDD.reduceByKey(lambda x,y:x+y)

CPU times: user 7.8 ms, sys: 2.43 ms, total: 10.2 ms
Wall time: 16.5 ms


In [13]:
%%time
RDD2=RDD1.map(lambda (c,v):(v,c))

CPU times: user 20 µs, sys: 5 µs, total: 25 µs
Wall time: 27.9 µs


In [14]:
%%time
RDD3=RDD2.sortByKey(False)

CPU times: user 19.2 ms, sys: 5.14 ms, total: 24.4 ms
Wall time: 510 ms


In [15]:
print RDD3.toDebugString()

(2) PythonRDD[19] at RDD at PythonRDD.scala:43 []
 |  MapPartitionsRDD[18] at mapPartitions at PythonRDD.scala:374 []
 |  ShuffledRDD[17] at partitionBy at NativeMethodAccessorImpl.java:-2 []
 +-(2) PairwiseRDD[16] at sortByKey at <timed exec>:1 []
    |  PythonRDD[15] at sortByKey at <timed exec>:1 []
    |  MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:374 []
    |  ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:-2 []
    +-(2) PairwiseRDD[10] at reduceByKey at <timed exec>:1 []
       |  PythonRDD[9] at reduceByKey at <timed exec>:1 []
       |  ../../Distributed\ Computation/notebooks/Moby-Dick.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 []
       |  ../../Distributed\ Computation/notebooks/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:-2 []


In [16]:
%%time
RDD3.take(10)

CPU times: user 5.53 ms, sys: 4.37 ms, total: 9.9 ms
Wall time: 168 ms


[(13609, u'the'),
 (6481, u'of'),
 (5883, u'and'),
 (4477, u'a'),
 (4440, u'to'),
 (3824, u'in'),
 (2680, u'that'),
 (2415, u'his'),
 (1724, u'I'),
 (1646, u'with')]

In [17]:
sc.stop()