##mrjob##

__mrjob__ is a software package developed by the restaurant recommendation company _Yelp_. 
It's goal is to simplify the deployment of map-reduce jobs based on streaming and python onto different 
frameworks such as Hadoop on a private cluster or hadoop on AWS (called EMR).

* You can read more about mrjob here: https://pythonhosted.org/mrjob/index.html  
* and you can clone it from github here: https://github.com/yelp/mrjob

In this notebook we run a simple word-count example, add to it some logging commands, and look at two modes of running the job.

In [2]:
import os
import sys
from time import time

# Get enviroment variables set from utils/setup.sh
home_dir = os.environ['HOME']
root_dir = os.environ['BD_GitRoot']

# Add utils to the python system path
sys.path.append(root_dir + '/utils')
# Read AWS credentials from 'EC2_VAULT'/Creds.pkl 
from read_mrjob_creds import *
(key_id, secret_key, s3_bucket, username) = read_credentials()
print s3_bucket

examples_dir = root_dir + '/data/text/'
!ls -l $examples_dir

s3://cse-devagarwal_tmp4/
total 1236
-rw-rw-r-- 1 sachin sachin    1296 Mar 27 14:08 CorruptedParagraph.txt
-rw-rw-r-- 1 sachin sachin 1257260 Mar 27 14:08 Moby-Dick.txt
-rw-rw-r-- 1 sachin sachin    1468 Mar 27 14:08 OneParagraph.txt


In [17]:
filename=examples_dir+'mr_word_freq_count.py'
print filename

!ls -al $filename

/Users/yoavfreund/BigData/UCSD_BigData/data/text/mr_word_freq_count.py
ls: /Users/yoavfreund/BigData/UCSD_BigData/data/text/mr_word_freq_count.py: No such file or directory


In [29]:
%%writefile mr_word_freq_count.py
#!/usr/bin/python
# Copyright 2009-2010 Yelp
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re
from sys import stderr

WORD_RE = re.compile(r"[\w']+")

#logfile=open('log','w')
logfile=stderr

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            logfile.write('mapper '+word.lower()+'\n')
            self.increment_counter('group', 'mapper', 1)
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        #yield (word, sum(counts))
        l_counts=[c for c in counts]  # extract list from iterator
        S=sum(l_counts)
        logfile.write('combiner '+word+' ['+','.join([str(c) for c in l_counts])+']='+str(S)+'\n')
        self.increment_counter('group', 'combiner', 1)
        yield (word, S)

    def reducer(self, word, counts):
        #yield (word, sum(counts))
        l_counts=[c for c in counts]  # extract list from iterator
        S=sum(l_counts)
        logfile.write('reducer '+word+' ['+','.join([str(c) for c in l_counts])+']='+str(S)+'\n')
        self.increment_counter('group', 'reducer', 1)
        yield (word, S)

if __name__ == '__main__':
    MRWordFreqCount.run()


Overwriting mr_word_freq_count.py


In [30]:
!ls -l $examples_dir

total 2472
-rw-r--r--  1 yoavfreund  staff     1296 Mar 27 12:23 CorruptedParagraph.txt
-rw-r--r--@ 1 yoavfreund  staff  1257260 Mar 27 12:23 Moby-Dick.txt
-rw-r--r--  1 yoavfreund  staff     1468 Mar 27 12:23 OneParagraph.txt


In [11]:
!cat $examples_dir/OneParagraph.txt

Another thing. Flask was the last person down at the dinner, and Flask
is the first man up. Consider! For hereby Flask's dinner was badly
jammed in point of time. Starbuck and Stubb both had the start of him;
and yet they also have the privilege of lounging in the rear. If Stubb
even, who is but a peg higher than Flask, happens to have but a small
appetite, and soon shows symptoms of concluding his repast, then Flask
must bestir himself, he will not get more than three mouthfuls that day;
for it is against holy usage for Stubb to precede Flask to the deck.
Therefore it was that Flask once admitted in private, that ever since he
had arisen to the dignity of an officer, from that moment he had never
known what it was to be otherwise than hungry, more or less. For what
he ate did not so much relieve his hunger, as keep it immortal in him.
Peace and satisfaction, thought Flask, have for ever departed from
my stomach. I am an officer; but, how I wish I could fish a bit of
old-

In [15]:
# Running code in "inline" mode
!python mr_word_freq_count.py $examples_dir/OneParagraph.txt > counts_local.txt

using configs in /Users/yoavfreund/.mrjob.conf
creating tmp directory /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150502.230020.602985
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150502.230020.602985/step-0-mapper_part-00000
Counters from step 1:
  group:
    combiner: 160
    mapper: 278
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150502.230020.602985/step-0-mapper-sorted
> sort /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150502.230020.602985/step-0-mapper_part-00000
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150502.230020.602985/step-0-reducer_part-00000
Counters from step 1:
  group:
    combiner: 160
    mapper: 278
    reducer: 160
Moving /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150502.230020.602985/step-0-reducer_part-00000 -

In [60]:
!cat counts_local.txt

"a"	5
"admitted"	1
"aft"	1
"against"	2
"ahab"	1
"all"	1
"also"	1
"am"	1
"ample"	1
"an"	2
"and"	7
"another"	1
"any"	1
"appetite"	1
"arisen"	1
"as"	2
"at"	3
"ate"	1
"awful"	1
"badly"	1
"be"	1
"beef"	1
"before"	2
"besides"	1
"bestir"	1
"bit"	1
"both"	1
"but"	3
"cabin"	1
"capacity"	1
"concluding"	1
"consider"	1
"could"	1
"day"	1
"deck"	1
"departed"	1
"did"	1
"dignity"	1
"dinner"	3
"do"	1
"down"	1
"dumfoundered"	1
"even"	1
"ever"	2
"fashioned"	1
"first"	1
"fish"	1
"flask"	9
"flask's"	2
"for"	5
"forecastle"	1
"from"	2
"fruits"	1
"get"	2
"glory"	1
"go"	1
"grudge"	1
"had"	5
"happens"	1
"have"	3
"he"	4
"hereby"	1
"higher"	1
"him"	2
"himself"	1
"his"	2
"holy"	1
"how"	1
"hunger"	1
"hungry"	1
"i"	5
"if"	2
"immortal"	1
"in"	7
"insanity"	1
"is"	3
"it"	5
"jammed"	1
"keep"	1
"known"	1
"last"	1
"less"	1
"life"	1
"light"	1
"lounging"	1
"man"	1
"mast"	1
"mere"	1
"moment"	1
"more"	2
"mouthfuls"	1
"much"	1
"must"	1

**Excercise:** create a two-step job which, after counting the number of occurances of each word, sorts the words in order from most to least frequent.

## What is the meaning of "yield" ?

The keyword __yield__ is somewhat similar to __return__ however, while __return__ terminates the function and returns the result, 
__yield__, the first time it is encountered, return an object called a __generator__, without executing the function even once. On subsequent calls, the function is executed until one or more __yield__ commands are encountered, these values are returned, and the function halts (but does not terminate) until it is called again.

Here is a simple example:

In [24]:
def myrange(start,stop,step):
    value=start
    while value<=stop:
        yield value
        value += step
print [x for x in myrange(1.0,3.0,0.3)]

[1.0, 1.3, 1.6, 1.9000000000000001, 2.2, 2.5, 2.8]


In [25]:
print myrange(1.0,3.0,0.3)

<generator object myrange at 0x107c19640>


In [26]:
gen1=myrange(1.0,3.0,0.3)
gen2=myrange(2.0,5.0,0.7)
print 'gen1:',[x for x in gen1]
print 'gen1:',[x for x in gen1]  # after the generator terminated, it does not yield any more values.
print 'gen2:',[x for x in gen2]

gen1: [1.0, 1.3, 1.6, 1.9000000000000001, 2.2, 2.5, 2.8]
gen1: []
gen2: [2.0, 2.7, 3.4000000000000004, 4.1000000000000005, 4.800000000000001]


A generator is similar to an array or a list, all of those are __iterable__ objects. However, while list store all of the values in memory and can be read in any order, generators create the values on the fly and can only traversed __once__ and __in order__

It is the fact that values are generated on the fly and then discarded which makes generators attractive when processing large amounts of data - only a small amount of intermedite results, the outputs of the mapper which are inputs to the reducer, need to be stored in memory. How much depends on the communication speed between mappers and reducers.

It is instructive to see how generators can be cascaded by passing a generator as a parameter to another generator.

In [27]:
def mycumul(values):   # values can be a list or a generator.
    s=0
    for value in values:
        s+=value
        yield s

In [28]:
# Here we pass a generator as an input to another generator.
gen3=mycumul(myrange(1.0,3.0,0.3))   

In [29]:
print 'gen3:',[x for x in gen3]

gen3: [1.0, 2.3, 3.9, 5.8, 8.0, 10.5, 13.3]


## Different modes of running a mrjob map-reduce job ##

Once the mapper, combiner and reducer have been written and tested, you can run the job on different types of infrastructure:

1. __inline__ run the job as a single process on the local machine.
1. __local__ run the job on the local machine, but using multiple processes to simulate parallel processing.
1. __hadoop__ run the job on a hadoop cluster (such as the one we have in SDSC)
1. __EMR__ (Elastic Map Reduce) run the job on a hadoop cluster running on the amazon cloud.

Below we run the same process we ran at the top using __local__ instead of the default __inline__. Observe that in this case the reducers have some non-trivial work to do even when combiners are used.

## Running in local mode

In [61]:
t0-time()
!python mr_word_freq_count.py --runner=local $examples_dir/OneParagraph.txt > counts_local.txt
t1=time()
print 'total time',t1-t0

using configs in /Users/yoavfreund/.mrjob.conf
creating tmp directory /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150502.042414.760337
writing wrapper script to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150502.042414.760337/setup-wrapper.sh
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150502.042414.760337/step-0-mapper_part-00000
> sh -ex setup-wrapper.sh /Users/yoavfreund/anaconda/bin/python mr_word_freq_count.py --step-num=0 --mapper /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150502.042414.760337/input_part-00000 | sort | sh -ex setup-wrapper.sh /Users/yoavfreund/anaconda/bin/python mr_word_freq_count.py --step-num=0 --combiner > /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150502.042414.760337/step-0-mapper_part-00000
writing to /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq

In [63]:
!tail -10 counts_local.txt

"vanity"	1
"vengeance"	1
"was"	6
"were"	1
"what"	2
"when"	1
"who"	1
"will"	1
"wish"	1
"yet"	1


## Setting up configuration

In [None]:
from find_waiting_flow import *
flow_id = find_waiting_flow(key_id,secret_key)
flow_id

from Admin_flows import *
node = ''
job_flows=find_all_flows(key_id,secret_key)
for job in job_flows:
    if job.jobflowid == flow_id:
        node = job.masterpublicdnsname
print flow_id
print node
 
input_file = 'hdfs://'+node+':9000/weather.raw_data/ALL.csv'
 

j-153AYJ9PNJYUE WAITING
j-2NAO98NKRH37G WAITING
j-2UD4PDR1WMGH0 WAITING
j-29TWCEID490R1 WAITING
j-1K4OUFYCVV74C WAITING
j-3V3JJACKWH5O0 WAITING
got job runner
made EMR connection
<boto.emr.emrobject.JobFlow object at 0x7fab0d199610> no_script.devagarwal.20150512.221755.637027 j-153AYJ9PNJYUE WAITING
<boto.emr.emrobject.JobFlow object at 0x7fab0d0cdb90> no_script.devagarwal.20150512.230829.532600 j-2NAO98NKRH37G WAITING
<boto.emr.emrobject.JobFlow object at 0x7fab0d762150> no_script.devagarwal.20150515.015901.815658 j-2UD4PDR1WMGH0 WAITING
<boto.emr.emrobject.JobFlow object at 0x7fab0d762410> no_script.devagarwal.20150515.015927.004268 j-29TWCEID490R1 WAITING
<boto.emr.emrobject.JobFlow object at 0x7fab0d1f65d0> no_script.devagarwal.20150515.021620.182099 j-1K4OUFYCVV74C WAITING
<boto.emr.emrobject.JobFlow object at 0x7fab0d216710> no_script.devagarwal.20150515.021737.859314 j-3V3JJACKWH5O0 WAITING
got job runner

In [32]:
#flow_id='j-1I5Z82JUB89Q1'

## Running in EMR mode on existing job flow (hadoop cluster)

In [33]:
import uuid

# Create unique output directory in the student's s3_bucket
output_dir = s3_bucket + str(uuid.uuid4()) + "/"

print output_dir

s3://dse-yfreund/bd92b0b6-3d1b-461d-921d-a87514ffbadd/


In [34]:
#!python mr_word_freq_count.py -r emr $root_dir/README.md --emr-job-flow-id=$flow_id --output-dir=$output_dir  > counts_emr.txt
!python mr_word_freq_count.py -r emr  $examples_dir/Moby-Dick.txt --emr-job-flow-id=$flow_id --output-dir=$output_dir  > counts_emr.txt

using configs in /Users/yoavfreund/.mrjob.conf
creating tmp directory /var/folders/80/c2kfvdvx5cx570r4vlzqgb840000gq/T/mr_word_freq_count.yoavfreund.20150503.035920.264922
Copying non-input files into s3://dse-yfreund/tmp/mr_word_freq_count.yoavfreund.20150503.035920.264922/files/
Adding our job to existing job flow j-1R1T1KONOLLZW
Job launched 30.6s ago, status WAITING: Waiting after step completed
Job launched 61.4s ago, status RUNNING: Running step (mr_word_freq_count.yoavfreund.20150503.035920.264922: Step 1 of 1)
Job launched 92.0s ago, status RUNNING: Running step (mr_word_freq_count.yoavfreund.20150503.035920.264922: Step 1 of 1)
Job completed.
Running time was 62.0s (not counting time spent waiting for the EC2 instances)
ec2_key_pair_file not specified, going to S3
Fetching counters from S3...
Waiting 5.0s for S3 eventual consistency
Counters may not have been uploaded to S3 yet. Try again in 5 minutes with: mrjob fetch-logs --counters j-1R1T1KONOLLZW
Counters from step 1:
  (n

In [46]:
!tail -40 counts_emr.txt

"wilds"	1
"will"	396
"winces"	1
"wind"	69
"windbound"	1
"wines"	1
"winnebago"	1
"wipe"	1
"witchery"	1
"withal"	3
"withdrawal"	1
"withdrawals"	1
"withhold"	2
"without"	164
"witty"	2
"womb"	1
"women's"	1
"wondered"	2
"wonderfully"	2
"wonderfulness"	1
"wondrously"	2
"wonst"	1
"wooden"	27
"woods"	10
"wool"	2
"woracious"	1
"worshipped"	3
"wrestlings"	1
"wretched"	8
"wrinkles"	15
"writers"	3
"writing"	5
"wrought"	6
"yawning"	3
"yells"	2
"yojo's"	2
"you're"	6
"yourselbs"	1
"yourself"	26
"zones"	3


In [19]:
!ls ../../data/text/Moby-Dick.txt

../../data/text/Moby-Dick.txt
