##mrjob##

__mrjob__ is a software package developed by the restaurant recommendation company _Yelp_. 
It's goal is to simplify the deployment of map-reduce jobs based on streaming and python onto different 
frameworks such as Hadoop on a private cluster or hadoop on AWS (called EMR).

* You can read more about mrjob here: https://pythonhosted.org/mrjob/index.html  
* and you can clone it from github here: https://github.com/yelp/mrjob

In this notebook we run a simple word-count example, add to it some logging commands, and look at two modes of running the job.

**mrjob Command line** is described here: https://pythonhosted.org/mrjob/guides/emr-tools.html

### This notebook
This notebook is an example of performing a very simple word counting task using mrjob.

In [62]:
import os
import sys
from time import time

# Get enviroment variables set from utils/setup.sh
home_dir = os.environ['HOME']
root_dir = os.environ['BD_GitRoot']

# Add utils to the python system path
sys.path.append(root_dir + '/utils')
# Read AWS credentials from 'EC2_VAULT'/Creds.pkl 
from read_mrjob_creds import *
(key_id, secret_key, s3_bucket, username) = read_credentials()
print s3_bucket,key_id,username

examples_dir = root_dir + '/data/text/'
!ls -l $examples_dir

s3://yoavfreunddefault/ AKIAIH6ZWS6WZ7FQMZ7A Yoav_Freund
total 2472
-rw-r--r--  1 yoavfreund  staff     1296 Jun  2 09:57 CorruptedParagraph.txt
-rw-r--r--  1 yoavfreund  staff  1257260 Jun  2 09:57 Moby-Dick.txt
-rw-r--r--  1 yoavfreund  staff     1468 Jun  2 09:57 OneParagraph.txt


In [63]:
%%writefile mr_word_freq_count.py
"""
The classic MapReduce job: count the frequency of words.

A few addtional twists:
1. We use counters to keep track of the tasks
2. This job has two steps, rather than just one. The first step counts the number of times each word occurs, 
   The second sorts the word from the least to the most common.
"""

from mrjob.job import MRJob
from mrjob.step import MRStep
import re
import sys

WORD_RE = re.compile(r"[\w']+")
Large=10000000  # maximal number of occurances of a single word.

#logfile=open('logFile','w')
logfile=sys.stderr

log_marker='##LOG## '

class MRWordFreqCount(MRJob):

    def mapper1(self, _, line):
        for word in WORD_RE.findall(line):
            logfile.write(log_marker+'mapper1 '+word.lower()+'\n')
            self.increment_counter('group', 'mapper1', 1)
            yield (word.lower(), 1)

    def combiner1(self, word, counts):
        l_counts=[c for c in counts]  # extract list from iterator
        S=sum(l_counts)
        logfile.write(log_marker+'combiner1 '+word+' ['+','.join([str(c) for c in l_counts])+']='+str(S)+'\n')
        self.increment_counter('group', 'combiner1', 1)
        yield (word, S)

    def reducer1(self, word, counts):
        l_counts=[c for c in counts]  # extract list from iterator
        S=sum(l_counts)
        logfile.write(log_marker+'reducer1 '+word+' ['+','.join([str(c) for c in l_counts])+']='+str(S)+'\n')
        self.increment_counter('group', 'reducer1', 1)
        yield (word, S)

        
    def mapper2(self,word,count):
        logfile.write(log_marker+'mapper2 '+word+': '+str(count)+'\n')
        self.increment_counter('group', 'mapper2', 1)
        yield('%010d'%count,word) # add leading zeros so that lexical sorting is the same as numerical
        
    def reducer2(self,count,word):
        wlist=[w for w in word]                      
        logfile.write(log_marker+'reducer2 '+count+': '+wlist[0]+'\n')
        self.increment_counter('group', 'reducer2', 1)
        yield(count,wlist)
        
    def steps(self):
        return [
            MRStep(mapper=self.mapper1,
                   reducer=self.reducer1),
            MRStep(mapper=self.mapper2,
                   reducer=self.reducer2)
        ]



if __name__ == '__main__':
    MRWordFreqCount.run()


Writing mr_word_freq_count.py


In [64]:
!ls -l $examples_dir

total 2472
-rw-r--r--  1 yoavfreund  staff     1296 Jun  2 09:57 CorruptedParagraph.txt
-rw-r--r--  1 yoavfreund  staff  1257260 Jun  2 09:57 Moby-Dick.txt
-rw-r--r--  1 yoavfreund  staff     1468 Jun  2 09:57 OneParagraph.txt


In [65]:
!cat $examples_dir/OneParagraph.txt

Another thing. Flask was the last person down at the dinner, and Flask
is the first man up. Consider! For hereby Flask's dinner was badly
jammed in point of time. Starbuck and Stubb both had the start of him;
and yet they also have the privilege of lounging in the rear. If Stubb
even, who is but a peg higher than Flask, happens to have but a small
appetite, and soon shows symptoms of concluding his repast, then Flask
must bestir himself, he will not get more than three mouthfuls that day;
for it is against holy usage for Stubb to precede Flask to the deck.
Therefore it was that Flask once admitted in private, that ever since he
had arisen to the dignity of an officer, from that moment he had never
known what it was to be otherwise than hungry, more or less. For what
he ate did not so much relieve his hunger, as keep it immortal in him.
Peace and satisfaction, thought Flask, have for ever departed from
my stomach. I am an officer; but, how I wish I could fish a bit of
old-

In [66]:
# Running code in "inline" mode
!python mr_word_freq_count.py $examples_dir/OneParagraph.txt > counts_local.txt

using configs in /Users/yoavfreund/.mrjob.conf
creating tmp directory /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150602.221456.513517
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150602.221456.513517/step-0-mapper_part-00000
##LOG## mapper1 another
##LOG## mapper1 thing
##LOG## mapper1 flask
##LOG## mapper1 was
##LOG## mapper1 the
##LOG## mapper1 last
##LOG## mapper1 person
##LOG## mapper1 down
##LOG## mapper1 at
##LOG## mapper1 the
##LOG## mapper1 dinner
##LOG## mapper1 and
##LOG## mapper1 flask
##LOG## mapper1 is
##LOG## mapper1 the
##LOG## mapper1 first
##LOG## mapper1 man
##LOG## mapper1 up
##LOG## mapper1 consider
##LOG## mapper1 for
##LOG## mapper1 hereby
##LOG## mapper1 flask's
##LOG## mapper1 dinner
##LOG## mapper1 was
##LOG## mapper1 badly
##LOG## mapper1 jammed
##LOG## mapper1 in
##LOG## mapper1 point
##LOG## mapper1 of
##LOG## mapper1 time
##LOG## mapper1 starbuck
##LOG## mapper1 and
##LOG##

In [67]:
!tail counts_local.txt

"0000000001"	["admitted", "aft", "ahab", "all", "also", "am", "ample", "another", "any", "appetite", "arisen", "ate", "awful", "badly", "be", "beef", "besides", "bestir", "bit", "both", "cabin", "capacity", "concluding", "consider", "could", "day", "deck", "departed", "did", "dignity", "do", "down", "dumfoundered", "even", "fashioned", "first", "fish", "forecastle", "fruits", "glory", "go", "grudge", "happens", "hereby", "higher", "himself", "holy", "how", "hunger", "hungry", "immortal", "insanity", "jammed", "keep", "known", "last", "less", "life", "light", "lounging", "man", "mast", "mere", "moment", "mouthfuls", "much", "must", "my", "never", "now", "obtain", "official", "old", "once", "or", "order", "otherwise", "peace", "peep", "peg", "pequod", "person", "point", "precede", "private", "privilege", "promotion", "rear", "relieve", "repast", "satisfaction", "shows", "silly", "since", "sitting", "sky", "small", "soon", "starbuck", "start", "stomach", "symptoms", "then", "therefore", "

## Different modes of running a mrjob map-reduce job ##

Once the mapper, combiner and reducer have been written and tested, you can run the job on different types of infrastructure:

1. __inline__ run the job as a single process on the local machine.
1. __local__ run the job on the local machine, but using multiple processes to simulate parallel processing.
1. __EMR__ (Elastic Map Reduce) run the job on a hadoop cluster running on the amazon cloud.

Below we run the same process we ran at the top using __local__ instead of the default __inline__. Observe that in this case the reducers have some non-trivial work to do even when combiners are used.

## Running in local mode

In [68]:
from time import time
t0=time()
!python mr_word_freq_count.py --runner=local $examples_dir/OneParagraph.txt > counts_local.txt
t1=time()
print 'total time',t1-t0

using configs in /Users/yoavfreund/.mrjob.conf
creating tmp directory /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150602.221507.472995
writing wrapper script to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150602.221507.472995/setup-wrapper.sh
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150602.221507.472995/step-0-mapper_part-00000
> sh -ex setup-wrapper.sh /Users/yoavfreund/anaconda/bin/python mr_word_freq_count.py --step-num=0 --mapper /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150602.221507.472995/input_part-00000 > /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150602.221507.472995/step-0-mapper_part-00000
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150602.221507.472995/step-0-mapper_part-00001
> sh -ex setup-wrapper.sh /Users/yoavfreund/anacond

In [69]:
!wc counts_local.txt
!tail -200 counts_local.txt

      10     170    1567 counts_local.txt
"0000000001"	["admitted", "aft", "ahab", "all", "also", "am", "ample", "another", "any", "appetite", "arisen", "ate", "awful", "badly", "be", "beef", "besides", "bestir", "bit", "both", "cabin", "capacity", "concluding", "consider", "could", "day", "deck", "departed", "did", "dignity", "do", "down", "dumfoundered", "even", "fashioned", "first", "fish", "forecastle", "fruits", "glory", "go", "grudge", "happens", "hereby", "higher", "himself", "holy", "how", "hunger", "hungry", "immortal", "insanity", "jammed", "keep", "known", "last", "less", "life", "light", "lounging", "man", "mast", "mere", "moment", "mouthfuls", "much", "must", "my", "never", "now", "obtain", "official", "old", "once", "or", "order", "otherwise", "peace", "peep", "peg", "pequod", "person", "point", "precede", "private", "privilege", "promotion", "rear", "relieve", "repast", "satisfaction", "shows", "silly", "since", "sitting", "sky", "small", "soon", "starbuck", "start", "st

## Setting up configuration

In [49]:
from find_waiting_flow import *
flows_dict = find_waiting_flow(key_id,secret_key)
print 'first dict is:',flows_dict[0]
flow_id, node = (flows_dict[0]['flow_id'],flows_dict[0]['node'])
print 'chosen flow=',flow_id, node 
#input_file = 'hdfs://'+node+':9000/weather.raw_data/ALL.csv'

0 j-2YTN78MP8WQG9 ec2-54-161-53-198.compute-1.amazonaws.com WAITING
1 j-11HA0APWXXDQN ec2-54-81-217-77.compute-1.amazonaws.com WAITING
2 j-34852Y03UYJ9J ec2-54-224-0-158.compute-1.amazonaws.com WAITING
3 j-3R0H4CX9PUGQZ ec2-54-162-188-199.compute-1.amazonaws.com WAITING
4 j-1FVV0CQW6H8NB ec2-54-144-57-57.compute-1.amazonaws.com WAITING
5 j-23GGEIWFSWTLY ec2-54-162-37-207.compute-1.amazonaws.com WAITING
6 j-2OBNM3R7JOBUM ec2-54-81-125-208.compute-1.amazonaws.com WAITING
first dict is: {'node': u'ec2-54-161-53-198.compute-1.amazonaws.com', 'flow_state': u'WAITING', 'flow_id': u'j-2YTN78MP8WQG9'}
chosen flow= j-2YTN78MP8WQG9 ec2-54-161-53-198.compute-1.amazonaws.com


## Running in EMR mode on existing job flow (hadoop cluster)

In [59]:
# It is important that the output directory be empty before the job is started
# otherwise the job will crash (the idea is to prevent you from over-writing the output
# of a long job by mistake)
output_dir='s3://yoavfreunddefault/simple_mrjob/'
!s3cmd del --recursive $output_dir

File s3://yoavfreunddefault/simple_mrjob/_SUCCESS deleted
File s3://yoavfreunddefault/simple_mrjob/part-00000 deleted
File s3://yoavfreunddefault/simple_mrjob/part-00001 deleted
File s3://yoavfreunddefault/simple_mrjob/part-00002 deleted
File s3://yoavfreunddefault/simple_mrjob/part-00003 deleted
File s3://yoavfreunddefault/simple_mrjob/part-00004 deleted
File s3://yoavfreunddefault/simple_mrjob/part-00005 deleted
File s3://yoavfreunddefault/simple_mrjob/part-00006 deleted
File s3://yoavfreunddefault/simple_mrjob/part-00007 deleted
File s3://yoavfreunddefault/simple_mrjob/part-00008 deleted


In [60]:
!python mr_word_freq_count.py -r emr  $examples_dir/Moby-Dick.txt --emr-job-flow-id=$flow_id --output-dir=$output_dir  --no-output

using configs in /Users/yoavfreund/.mrjob.conf
creating tmp directory /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150602.215802.890698
Copying non-input files into s3://yoavfreunddefault/tmp/mr_word_freq_count.yoavfreund.20150602.215802.890698/files/
Adding our job to existing job flow j-2YTN78MP8WQG9
Job launched 32.3s ago, status RUNNING: Running step (mr_word_freq_count.yoavfreund.20150602.215802.890698: Step 1 of 2)
Job launched 64.6s ago, status RUNNING: Running step (mr_word_freq_count.yoavfreund.20150602.215802.890698: Step 1 of 2)
Job launched 97.3s ago, status RUNNING: Running step (mr_word_freq_count.yoavfreund.20150602.215802.890698: Step 1 of 2)
Job launched 129.2s ago, status RUNNING: Running step (mr_word_freq_count.yoavfreund.20150602.215802.890698: Step 2 of 2)
Job launched 161.8s ago, status RUNNING: Running step (mr_word_freq_count.yoavfreund.20150602.215802.890698: Step 2 of 2)
Job completed.
Running time was 140.0s (not counting 

In [61]:
#here is the output divided into parts - one part per compute node.
!s3cmd ls $output_dir

2015-06-02 22:00         0   s3://yoavfreunddefault/simple_mrjob/_SUCCESS
2015-06-02 22:00     21169   s3://yoavfreunddefault/simple_mrjob/part-00000
2015-06-02 22:00     98421   s3://yoavfreunddefault/simple_mrjob/part-00001
2015-06-02 22:00      6112   s3://yoavfreunddefault/simple_mrjob/part-00002
2015-06-02 22:00      8733   s3://yoavfreunddefault/simple_mrjob/part-00003
2015-06-02 22:00     14708   s3://yoavfreunddefault/simple_mrjob/part-00004
2015-06-02 22:00     36529   s3://yoavfreunddefault/simple_mrjob/part-00005
2015-06-02 22:00      5237   s3://yoavfreunddefault/simple_mrjob/part-00006
2015-06-02 22:00      6532   s3://yoavfreunddefault/simple_mrjob/part-00007
2015-06-02 22:00     10790   s3://yoavfreunddefault/simple_mrjob/part-00008


### Getting the detailed output

To get the detailed output use the script `/utils/get_emr_logs.py` 
this is an interactive script that will help you identify the map-reduce steps that you ran.
It is especially useful when things go wrong.

use the script **in a terminal, not in the notebook**.  The cell below contains the log for finding and downloading the files from the two steps above.


### Finding the relevant information in the logs

As you see above, a single MR step generates a large number of files. These files are organized in two directories:

* **steps** which contains the job-level logs. These logs contain mostly messages from the Hadoop system. However, it does contain one piece that is relevant for us: the sum of the counters is given in `syslog` (see below). The counters are accumulated across all of the tasks, so they provide a useful summary of what the job achieved in total.
* **task-attempts** which contains the task-level logs. Recall that each map and each reduce command gets partitioned into a large number of *tasks*. Each of those generates three files: `syslog, stdout` and `stderr`. Of those the ones that are most useful for debugging are the `stderr` files, to which we sent our log messages. (don't make the mistake of sending the log to `stdout`. If you do that, log messages from the mapper will be recieved as key-value pairs by the combiners/reducers and cause them to crash.