# Word Counting with Hadoop / Spark

The aim of this document is to show and compare several ways of counting word with MapReduce.


## Data

Following [this blog entry](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/), we use the following text data (without changing the names of files).

- [The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson](http://www.gutenberg.org/etext/20417)
- [The Notebooks of Leonardo Da Vinci](http://www.gutenberg.org/etext/5000)
- [Ulysses by James Joyce](http://www.gutenberg.org/ebooks/4300)

First of all we put these text file on HDFS. We assume that there are *only the three text files in the current directry*. Then we can put the text files by the following two commands

    $ hadoop fs -mkdir word_count
$ hadoop fs -put * word_count

We may check by the following command if the files are in the directry `word_count`.

    $ hadoop fs -ls -R word_count
    -rw-r--r--   1 crescent supergroup     674570 2017-02-26 21:23 word_count/20417.txt.utf-8
    -rw-r--r--   1 crescent supergroup    1580927 2017-02-26 21:23 word_count/4300-0.txt
    -rw-r--r--   1 crescent supergroup    1428841 2017-02-26 21:23 word_count/5000-8.txt

### Remarks

- For the terminology and idea for MapReduce see [this tutorial](https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm).
- To illustrate the structure of the code, the implementation is as small as we can. 

## Hadoop Streaming

Hadoop Streaming allows us to make use of MapReduce in the most intuitive way in any programming languages. We only need two executable scripts/binaries: `mapper` and `reducer`. These correspond "Map" and "Reduce" in MapReduce respectively. 

- Unfortunately a script in [Python3 is not working](http://serverfault.com/q/807839).
- The following codes are written in Perl.

### mapper

In [1]:
with open('mapper.pl','r') as fo:
    for line in fo:
        print(line.rstrip())

#!/usr/bin/perl

use strict;

while (my $line = <STDIN>) {
  chomp($line);
  my @words = split /\s+/, $line;
  foreach my $word (@words) {
    print $word . "\t1\n";
  }
}

exit;


### reducer

In [2]:
with open('reducer.pl','r') as fo:
    for line in fo:
        print(line.rstrip())

#!/usr/bin/perl

use strict;

my $current_word = undef;
my $current_count = 0;

while (my $line = <STDIN>) {
  my ($word, $count) = split /\s+/, $line;

  if ($word ne $current_word) {
    emit_pair($current_word,$current_count);
    $current_word = $word;
    $current_count = 0;
  }

  $current_count += $count;
}

emit_pair($current_word,$current_count);

exit;

sub emit_pair{
  print "$_[0]\t$_[1]\n";
}


---

We can check with the following command if the scripts work.

    $ cat data/* | ./mapper.pl | sort -k1,1 | ./reducer.pl

The following command starts the MapReduce procedure.

    $ hadoop jar /opt/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
    > -mapper mapper.pl -reducer reducer.pl \
    > -input word_count -output word_count_output \
    > -file mapper.pl -file reducer.pl

The output directory `word_count_ouptput` must not exsist. If it exists, we must delete it or use a different directory.

## mrjob

`mrjob` is a [Python library for MapReduce](https://pythonhosted.org/mrjob/).

- We can write multi-steps job in a single script.
- We can execute the script for testing without MapReduce.
- We do not have to put the input files on HDFS in advance. We get the result on the stdout.

In [3]:
with open('with_mrjob.py','r') as fo:
    for line in fo:
        print(line.rstrip())

#!/usr/bin/python3

from mrjob.job import MRJob

class WordCount(MRJob):
    def mapper(self, _, line):
        for word in line.rstrip().split():
            yield (word, 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
     WordCount.run()


---

Executing the following command, we can check the code without MapReduce:

    ./with_mrjob.py data/* > count_without_mr

We can execute the code with with MapReduce by the following command.

    $ ./with_mrjob.py -r hadoop hdfs:///user/crescent/word_count/* > count_with_mr

We can specify local files instead of files on HDFS.

Note that the results might be different.

In [4]:
from subprocess import Popen, PIPE

diff = Popen(["diff","count_with_mr","count_without_mr"],stdout=PIPE).communicate()[0]
print(diff.decode('utf8'))

23973c23973
< "The"	3524
---
> "The"	3523
27938a27939,27940
> "\ufeff"	1
> "\ufeffThe"	1



## Spark

Spark can execute a MapReduce procedure more efficiently than Hadoop. Moreover the Spark APIs are very simple and convenient to write a MapReduce task.

In [5]:
from pyspark import SparkContext
sc = SparkContext() # this takes a while

In [6]:
text_files = sc.textFile('hdfs://localhost:9000/user/crescent/word_count')
#text_files = sc.textFile('data') ## for local files
text_files

hdfs://localhost:9000/user/crescent/word_count MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [7]:
text_files.take(10) ## first 10 elements

['The Project Gutenberg EBook of The Outline of Science, Vol. 1 (of 4), by ',
 'J. Arthur Thomson',
 '',
 'This eBook is for the use of anyone anywhere at no cost and with',
 'almost no restrictions whatsoever.  You may copy it, give it away or',
 're-use it under the terms of the Project Gutenberg License included',
 'with this eBook or online at www.gutenberg.org',
 '',
 '',
 'Title: The Outline of Science, Vol. 1 (of 4)']

In [8]:
## write a MR procedure
word_count = text_files.flatMap(lambda line: [(w,1) for w in line.rstrip().split()])\
                       .reduceByKey(lambda a,b: a+b)
word_count ## this does not start the MR procedure. (lazy evaluation)

PythonRDD[7] at RDD at PythonRDD.scala:48

In [9]:
word_count.take(10) ## starts the MapReduce procedure. It finishes within one minute.

[('(Multifarnham.', 1),
 ('_non', 1),
 ('Lucifer,', 1),
 ('divers', 17),
 ('Wanted,', 1),
 ('528:', 1),
 ('bob.', 6),
 ('(black', 3),
 ('vuole,', 1),
 ('Ladies’', 1)]