##mrjob##

__mrjob__ is a software package developed by the restaurant recommendation company _Yelp_. 
It's goal is to simplify the deployment of map-reduce jobs based on streaming and python onto different 
frameworks such as Hadoop on a private cluster or hadoop on AWS (called EMR).

* You can read more about mrjob here: https://pythonhosted.org/mrjob/index.html  
* and you can clone it from github here: https://github.com/yelp/mrjob

In this notebook we run a simple word-count example, add to it some logging commands, and look at two modes of running the job.

**mrjob Command line** is described here: https://pythonhosted.org/mrjob/guides/emr-tools.html

In [1]:
import os
import sys
from time import time

# Get enviroment variables set from utils/setup.sh
home_dir = os.environ['HOME']
root_dir = os.environ['BD_GitRoot']

# Add utils to the python system path
sys.path.append(root_dir + '/utils')
# Read AWS credentials from 'EC2_VAULT'/Creds.pkl 
from read_mrjob_creds import *
(key_id, secret_key, s3_bucket, username) = read_credentials()
print s3_bucket,key_id,username

examples_dir = root_dir + '/data/text/'
!ls -l $examples_dir

s3://cse-devagarwal/ AKIAIJJJKJ3L43JB5HRA devagarwal
total 2472
-rw-r--r--  1 devagarwal  staff     1296 Apr  2 16:02 CorruptedParagraph.txt
-rw-r--r--  1 devagarwal  staff  1257260 Apr  2 16:02 Moby-Dick.txt
-rw-r--r--  1 devagarwal  staff     1468 Apr  2 16:02 OneParagraph.txt


In [14]:
%%writefile mr_word_freq_count.py
#!/usr/bin/python
# Copyright 2009-2010 Yelp
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re
from sys import stderr

WORD_RE = re.compile(r"[\w']+")

#logfile=open('log','w')
logfile=stderr

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            logfile.write('mapper '+word.lower()+'\n')
            yield (word.lower(), 1)

#    def combiner(self, word, counts):
#        #yield (word, sum(counts))
#        l_counts=[c for c in counts]  # extract list from iterator
#        S=sum(l_counts)
#        logfile.write('combiner '+word+' ['+','.join([str(c) for c in l_counts])+']='+str(S)+'\n')
#        yield (word, S)

    def reducer(self, word, counts):
        #yield (word, sum(counts))
        l_counts=[c for c in counts]  # extract list from iterator
        S=sum(l_counts)
        logfile.write('reducer '+word+' ['+','.join([str(c) for c in l_counts])+']='+str(S)+'\n')
        yield (word, S)

if __name__ == '__main__':
    MRWordFreqCount.run()


Overwriting mr_word_freq_count.py


In [153]:
%%writefile mr_word_freq_count.py
#!/usr/bin/python
# Copyright 2009-2010 Yelp
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from sys import stderr

WORD_RE = re.compile(r"[\w']+")
Large=10000000  # maximal number of occurances of a single word.

logfile=open('logFile','w')
#logfile=stderr

class MRWordFreqCount(MRJob):

    def mapper1(self, _, line):
        for word in WORD_RE.findall(line):
            logfile.write('mapper '+word.lower()+'\n')
            yield (word.lower(), 1)

    def reducer1(self, word, counts):
        #yield (word, sum(counts))
        l_counts=[c for c in counts]  # extract list from iterator
        S=sum(l_counts)
        logfile.write('reducer '+word+' ['+','.join([str(c) for c in l_counts])+']='+str(S)+'\n')
        yield (word, S)
        
    def mapper2(self,word,count):
        logfile.write('mapper2 '+word+': '+str(count)+'\n')
        yield('%010d'%count,word) # add leading zeros so that lexical sorting is the same as numerical
        
    def reducer2(self,count,word):
        wlist=[w for w in word]
        logfile.write('reducer2 '+count+': '+wlist[0]+'\n')
        yield(count,wlist)
        
    def steps(self):
        return [
            MRStep(mapper=self.mapper1,
                   reducer=self.reducer1),
            MRStep(mapper=self.mapper2,
                   reducer=self.reducer2)
        ]



if __name__ == '__main__':
    MRWordFreqCount.run()


Overwriting mr_word_freq_count.py


In [130]:
!ls -l $examples_dir

total 2472
-rw-r--r--  1 yoavfreund  staff     1296 Mar 27 12:23 CorruptedParagraph.txt
-rw-r--r--@ 1 yoavfreund  staff  1257260 Mar 27 12:23 Moby-Dick.txt
-rw-r--r--  1 yoavfreund  staff     1468 Mar 27 12:23 OneParagraph.txt


In [131]:
!cat $examples_dir/OneParagraph.txt

Another thing. Flask was the last person down at the dinner, and Flask
is the first man up. Consider! For hereby Flask's dinner was badly
jammed in point of time. Starbuck and Stubb both had the start of him;
and yet they also have the privilege of lounging in the rear. If Stubb
even, who is but a peg higher than Flask, happens to have but a small
appetite, and soon shows symptoms of concluding his repast, then Flask
must bestir himself, he will not get more than three mouthfuls that day;
for it is against holy usage for Stubb to precede Flask to the deck.
Therefore it was that Flask once admitted in private, that ever since he
had arisen to the dignity of an officer, from that moment he had never
known what it was to be otherwise than hungry, more or less. For what
he ate did not so much relieve his hunger, as keep it immortal in him.
Peace and satisfaction, thought Flask, have for ever departed from
my stomach. I am an officer; but, how I wish I could fish a bit of
old-

In [132]:
# Running code in "inline" mode
!python mr_word_freq_count.py $examples_dir/Moby-Dick.txt > counts_local.txt

using configs in /Users/yoavfreund/.mrjob.conf
creating tmp directory /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224213.528652
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224213.528652/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224213.528652/step-0-mapper-sorted
> sort /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224213.528652/step-0-mapper_part-00000
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224213.528652/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224213.528652/step-1-mapper_part-00000
Counters from step 2:
  (no counters found)
writing to /v

In [133]:
!cat counts_local.txt

"0000000002"	["'and", "'bout", "'canallers", "'come", "'damn", "'dention", "'gainst", "'is", "'mong", "'oh", "'so", "'stern", "'teak", "'then", "'there", "'twill", "'very", "'we", "'well", "'where", "'who's", "'why", "'will", "'you", "15th", "1671", "1775", "1778", "180", "1807", "1820", "1825", "1828", "1836", "1851", "1st", "2001", "24", "3d", "40", "400", "45", "50", "500", "501", "64", "72", "89", "a'ready", "a'top", "aback", "abaft", "abandonment", "abased", "abashed", "abating", "abed", "abiding", "ablutions", "abode", "abridged", "abruptly", "absorbing", "abundantly", "accelerating", "accept", "accepted", "accessible", "accessory", "accommodate", "accompany", "accomplish", "accurately", "acquaintance", "acquaintances", "acquired", "acre", "actions", "adequate", "adjacent", "adjoining", "admeasurements", "admirably", "admiral", "admiral's", "admitting", "admonished", "adopted", "adoration", "adroop", "adult", "affidavit", "affording", "affords", "affright", "affrights", "afire", 

**Excercise:** create a two-step job which, after counting the number of occurances of each word, sorts the words in order from most to least frequent.

## What is the meaning of "yield" ?

The keyword __yield__ is somewhat similar to __return__ however, while __return__ terminates the function and returns the result, 
__yield__, the first time it is encountered, return an object called a __generator__, without executing the function even once. On subsequent calls, the function is executed until one or more __yield__ commands are encountered, these values are returned, and the function halts (but does not terminate) until it is called again.

Here is a simple example:

In [24]:
def myrange(start,stop,step):
    value=start
    while value<=stop:
        yield value
        value += step
print [x for x in myrange(1.0,3.0,0.3)]

[1.0, 1.3, 1.6, 1.9000000000000001, 2.2, 2.5, 2.8]


In [25]:
print myrange(1.0,3.0,0.3)

<generator object myrange at 0x107c19640>


In [26]:
gen1=myrange(1.0,3.0,0.3)
gen2=myrange(2.0,5.0,0.7)
print 'gen1:',[x for x in gen1]
print 'gen1:',[x for x in gen1]  # after the generator terminated, it does not yield any more values.
print 'gen2:',[x for x in gen2]

gen1: [1.0, 1.3, 1.6, 1.9000000000000001, 2.2, 2.5, 2.8]
gen1: []
gen2: [2.0, 2.7, 3.4000000000000004, 4.1000000000000005, 4.800000000000001]


A generator is similar to an array or a list, all of those are __iterable__ objects. However, while list store all of the values in memory and can be read in any order, generators create the values on the fly and can only traversed __once__ and __in order__

It is the fact that values are generated on the fly and then discarded which makes generators attractive when processing large amounts of data - only a small amount of intermedite results, the outputs of the mapper which are inputs to the reducer, need to be stored in memory. How much depends on the communication speed between mappers and reducers.

It is instructive to see how generators can be cascaded by passing a generator as a parameter to another generator.

In [27]:
def mycumul(values):   # values can be a list or a generator.
    s=0
    for value in values:
        s+=value
        yield s

In [28]:
# Here we pass a generator as an input to another generator.
gen3=mycumul(myrange(1.0,3.0,0.3))   

In [29]:
print 'gen3:',[x for x in gen3]

gen3: [1.0, 2.3, 3.9, 5.8, 8.0, 10.5, 13.3]


## Different modes of running a mrjob map-reduce job ##

Once the mapper, combiner and reducer have been written and tested, you can run the job on different types of infrastructure:

1. __inline__ run the job as a single process on the local machine.
1. __local__ run the job on the local machine, but using multiple processes to simulate parallel processing.
1. __EMR__ (Elastic Map Reduce) run the job on a hadoop cluster running on the amazon cloud.

Below we run the same process we ran at the top using __local__ instead of the default __inline__. Observe that in this case the reducers have some non-trivial work to do even when combiners are used.

## Running in local mode

In [134]:
from time import time
t0=time()
!python mr_word_freq_count.py --runner=local $examples_dir/Moby-Dick.txt > counts_local.txt
t1=time()
print 'total time',t1-t0

using configs in /Users/yoavfreund/.mrjob.conf
creating tmp directory /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224334.457009
writing wrapper script to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224334.457009/setup-wrapper.sh
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224334.457009/step-0-mapper_part-00000
> sh -ex setup-wrapper.sh /Users/yoavfreund/anaconda/bin/python mr_word_freq_count.py --step-num=0 --mapper /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224334.457009/input_part-00000 > /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224334.457009/step-0-mapper_part-00000
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150504.224334.457009/step-0-mapper_part-00001
> sh -ex setup-wrapper.sh /Users/yoavfreund/anacond

In [141]:
!wc counts_emr.txt
!tail -200 counts_local.txt

     291   18201  208231 counts_emr.txt
"0000000093"	["because"]
"0000000094"	["that's"]
"0000000095"	["face", "few", "get"]
"0000000096"	["however", "nantucket", "sun"]
"0000000097"	["does", "went"]
"0000000098"	["he's", "shall", "strange"]
"0000000099"	["hold", "it's"]
"0000000100"	["flask"]
"0000000101"	["moment"]
"0000000102"	["end", "new", "sail"]
"0000000103"	["voyage"]
"0000000105"	["sight"]
"0000000107"	["body", "stand"]
"0000000108"	["high"]
"0000000110"	["along", "heard", "work"]
"0000000112"	["times"]
"0000000113"	["another", "saw", "thy"]
"0000000114"	["nothing", "poor", "towards"]
"0000000115"	["make"]
"0000000116"	["called"]
"0000000117"	["just", "place", "why"]
"0000000118"	["found", "she"]
"0000000119"	["don't"]
"0000000120"	["between"]
"0000000121"	["think", "till"]
"0000000123"	["something"]
"0000000124"	["pequod"]
"0000000125"	["whale's"]
"0000000126"	["both", "under"]
"0000000127"	["feet"]
"0000000128"	["mast"]
"0000000129"	["each", "soon"]
"0000000130"	["came", "ha

## Setting up configuration

In [5]:
from find_waiting_flow import *
flows_dict = find_waiting_flow(key_id,secret_key)
flow_id, node = (flows_dict[0]['flow_id'],flows_dict[0]['node'])
print flow_id, node 
input_file = 'hdfs://'+node+':9000/weather.raw_data/ALL.csv'

1 j-2NAO98NKRH37G ec2-54-198-235-167.compute-1.amazonaws.com WAITING
2 j-29TWCEID490R1 ec2-23-20-223-54.compute-1.amazonaws.com WAITING
3 j-1K4OUFYCVV74C ec2-54-159-119-133.compute-1.amazonaws.com WAITING
4 j-3V3JJACKWH5O0 ec2-54-224-209-44.compute-1.amazonaws.com RUNNING
j-2NAO98NKRH37G ec2-54-198-235-167.compute-1.amazonaws.com


In [71]:
#flow_id='j-1I5Z82JUB89Q1'

## Running in EMR mode on existing job flow (hadoop cluster)

In [6]:
import uuid

# Create unique output directory in the student's s3_bucket
output_dir = s3_bucket + str(uuid.uuid4()) + "/"

print output_dir

s3://cse-devagarwal_tmp4/1fa269e5-7367-4ab0-a7f7-c320a9a2158c/


In [7]:
#!python mr_word_freq_count.py -r emr $root_dir/README.md --emr-job-flow-id=$flow_id --output-dir=$output_dir  > counts_emr.txt
!python mr_word_freq_count.py -r emr  $examples_dir/Moby-Dick.txt --emr-job-flow-id=$flow_id --output-dir=$output_dir  > counts_emr.txt

using configs in /home/sachin/.mrjob.conf
creating tmp directory /tmp/mr_word_freq_count.sachin.20150515.190457.508930
Copying non-input files into s3://cse-devagarwal_tmp4/tmp/mr_word_freq_count.sachin.20150515.190457.508930/files/
Adding our job to existing job flow j-3V3JJACKWH5O0
Job on job flow j-3V3JJACKWH5O0 failed with status WAITING: Waiting after step failed
Logs are in s3://cse255-emr/log/j-3V3JJACKWH5O0/
ec2_key_pair_file not specified, going to S3
Scanning S3 logs for probable cause of failure
Waiting 5.0s for S3 eventual consistency
Attempting to terminate job...
Traceback (most recent call last):
  File "mr_word_freq_count.py", line 62, in <module>
    MRWordFreqCount.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 461, in run
    mr_job.execute()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 479, in execute
    super(MRJob, self).execute()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 151, in execute


In [152]:
!mrjob fetch-logs --list $flow_id

using configs in /Users/yoavfreund/.mrjob.conf
Task attempts:

Steps:

Jobs:

Nodes:

Removing all files in s3://yoavfreunddefault/tmp/no_script.yoavfreund.20150505.003755.068466/


In [151]:
!ls -lrt

total 2072
-rw-r--r--  1 yoavfreund  staff    1800 Mar 27 12:23 mr_word_freq_counters.py
-rw-r--r--  1 yoavfreund  staff    6449 Apr 29 21:49 Jobs_and_runners.ipynb
-rw-r--r--  1 yoavfreund  staff   63120 May  2 21:02 Simple use of mrjob-WithCounters.ipynb
-rw-r--r--  1 yoavfreund  staff    2012 May  4 15:42 mr_word_freq_count.py
-rw-r--r--  1 yoavfreund  staff  208231 May  4 15:43 counts_local.txt
-rw-r--r--  1 yoavfreund  staff       0 May  4 15:58 log
-rw-r--r--  1 yoavfreund  staff  208231 May  4 16:02 counts_emr.txt
-rw-r--r--  1 yoavfreund  staff    3201 May  4 17:26 counts_only.txt
-rw-r--r--  1 yoavfreund  staff  556934 May  4 17:28 Simple use of mrjob.ipynb


In [146]:
!wc counts_emr.txt
!cat counts_emr.txt

     291   18201  208231 counts_emr.txt
"0000000018"	["plenty", "rush", "revealed", "precious", "stone", "picture", "islands", "west", "whereas", "yonder", "voyages", "he'll", "flag", "beach", "brave", "breath", "yard", "interest", "kings", "opposite", "leave", "spite", "pieces", "nameless", "pots", "understand", "instance", "jump", "modern", "regularly", "roll", "lo", "thine", "spirit", "seized", "license", "mentioned", "prodigious", "longer", "inside", "lead", "glory", "horizon", "delight", "guess", "floated", "drink", "content", "beef", "bedford", "chains", "blacksmith", "hook", "intense", "observed", "sought", "rate", "thereby", "lowering", "period", "lips", "height", "crossing", "engaged", "evening", "future", "dinner", "gives", "dropping", "difference", "careful", "agreement", "cannibal", "hill", "example", "box"]
"0000000024"	["cause", "hunt", "exactly", "orders", "perils", "remains", "lad", "tall", "swift", "ran", "year", "pursuit", "pulling", "striking", "regular", "war", "wal

In [147]:
!cut -b 2-11 counts_emr.txt > counts_only.txt
!head -100 counts_only.txt

0000000003
0000000018
0000000024
0000000030
0000000039
0000000045
0000000051
0000000066
0000000072
0000000087
0000000093
0000000105
0000000126
0000000153
0000000168
0000000228
0000000249
0000000297
0000000321
0000000378
0000000438
0000000588
0000000783
0000001335
0000000001
0000000016
0000000022
0000000037
0000000043
0000000058
0000000064
0000000070
0000000079
0000000085
0000000091
0000000103
0000000118
0000000124
0000000130
0000000151
0000000193
0000000205
0000000211
0000000232
0000000247
0000000268
0000000307
0000000334
0000000376
0000000382
0000000409
0000000415
0000000436
0000000538
0000000619
0000000646
0000000943
0000001171
0000001543
0000001645
0000001747
0000000008
0000000014
0000000020
0000000029
0000000035
0000000041
0000000056
0000000062
0000000077
0000000083
0000000098
0000000101
0000000116
0000000137
0000000143
0000000158
0000000164
0000000170
0000000185
0000000224
0000000245
0000000272
0000

In [19]:
!ls ../../data/text/Moby-Dick.txt

../../data/text/Moby-Dick.txt


In [47]:
'%5d'%5

'    5'