<a href="https://colab.research.google.com/github/smduarte/spbd-2223/blob/main/lab2/SPBD_Labs_mapreduce2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MrJob MapReduce Python Example

Word count implemented in pure Python, using the library MrJob.

[MrJob](https://mrjob.readthedocs.io/en/latest/) can be used to write MapReduce jobs and run them on several platforms.

Some key advantages:
+ Write **multi-step** MapReduce jobs in pure Python;
+ Test on your **local machine**;
+ Deploy jobs in several cloud plataforms of several vendors.

In [1]:
#@title Download the dataset and install MrJob
!wget -q -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0
!pip install mrjob --quiet
!wget -q -O /etc/mrjob.conf https://raw.githubusercontent.com/smduarte/spbd-2223/main/lab2/mrjob.conf

[?25l[K     |▊                               | 10 kB 6.1 MB/s eta 0:00:01[K     |█▌                              | 20 kB 2.7 MB/s eta 0:00:01[K     |██▎                             | 30 kB 4.0 MB/s eta 0:00:01[K     |███                             | 40 kB 1.7 MB/s eta 0:00:01[K     |███▊                            | 51 kB 1.8 MB/s eta 0:00:01[K     |████▌                           | 61 kB 2.1 MB/s eta 0:00:01[K     |█████▏                          | 71 kB 2.2 MB/s eta 0:00:01[K     |██████                          | 81 kB 2.5 MB/s eta 0:00:01[K     |██████▊                         | 92 kB 2.7 MB/s eta 0:00:01[K     |███████▌                        | 102 kB 2.1 MB/s eta 0:00:01[K     |████████▏                       | 112 kB 2.1 MB/s eta 0:00:01[K     |█████████                       | 122 kB 2.1 MB/s eta 0:00:01[K     |█████████▊                      | 133 kB 2.1 MB/s eta 0:00:01[K     |██████████▍                     | 143 kB 2.1 MB/s eta 0:00:01[K     

### MrJob WordCount Example
Read the words from input and count them.

The processing is split into two main phases:

+ The mapper emits for each line the number of words
+ The reduces sums all the tuples produced by the mapper stage...

Using MrJob, a MapReduce job can be expressed in a single Python class,
with methods for each of the phases. The reducer phase is called separately for each key, with the collection of values to be reduced.

In [2]:
%%file wordcount.py

import string  
from mrjob.job import MRJob

class MRWordCount(MRJob):

    def mapper(self, _, line):
      # remove leading and trailing whitespace
      line = line.strip()
      # remove punctuation characters
      line = line.translate(str.maketrans('', '', string.punctuation+'«»'))
      # split the line into words
      yield "words", len(line.split())

    def reducer(self, key, values):
        yield key, sum(values)
            
if __name__ == '__main__':
    MRWordCount.run()

Writing wordcount.py


## Execution of MrJob programs

### Local Execution

In [3]:
import wordcount

# prepare the mapreduce job for local execution
mr_job = wordcount.MRWordCount(args=['-r', 'local','os_maias.txt'])

# execute the job and print the output results
with mr_job.make_runner() as runner:
    runner.run()
    for key, value in mr_job.parse_output(runner.cat_output()):
        print( key, value)

words 213359


## Supplying a combiner...


In [4]:
%%file wordcount2.py

import string  
from mrjob.job import MRJob

class MRWordCount(MRJob):

    def mapper(self, _, line):
      # remove leading and trailing whitespace
      line = line.strip()
      # remove punctuation characters
      line = line.translate(str.maketrans('', '', string.punctuation+'«»'))
      # split the line into words
      yield "words", len(line.split())

    def combiner(self, key, values):
        yield key, sum(values)

    def reducer(self, key, values):
        yield key, sum(values)
            
if __name__ == '__main__':
    MRWordCount.run()

Writing wordcount2.py


In [5]:
import wordcount2

# prepare the mapreduce job for local execution
mr_job = wordcount2.MRWordCount(args=['-r', 'local','os_maias.txt'])

# execute the job and print the output results
with mr_job.make_runner() as runner:
    runner.run()
    for key, value in mr_job.parse_output(runner.cat_output()):
        print( key, value)

words 213359


### MrJob Hadoop Execution

The main difference is that the input files and the output results need to be stored in HDFS (Hadoop distributed filesystem)

For now, we will use MrJob in local mode.