#DATASCI W261: Machine Learning at Scale

# Write some words to a file

In [1]:
!echo foo foo quux labs foo bar quux > WordCount.txt

# MrJob class for wordcount

In [2]:
%%writefile WordCount.py
from mrjob.job import MRJob
from mrjob.step import MRJobStep
import re
 
WORD_RE = re.compile(r"[\w']+")
 
class MRWordFreqCount(MRJob):
    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, word, counts):
        yield word, sum(counts)

if __name__ == '__main__':
    MRWordFreqCount.run()

Writing WordCount.py


#Run the code in command line locally

In [3]:
!python WordCount.py WordCount.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/WordCount.ssatpati.20151001.154753.264949
writing to /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/WordCount.ssatpati.20151001.154753.264949/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/WordCount.ssatpati.20151001.154753.264949/step-0-mapper-sorted
> sort /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/WordCount.ssatpati.20151001.154753.264949/step-0-mapper_part-00000
writing to /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/WordCount.ssatpati.20151001.154753.264949/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/WordCount.ssatpati.20151001.154753.264949/step-0-reducer_part-00000 -> /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/WordCount.ssatp

The code above is straightforward. Mapper outputs (word, 1) key value pairs, and then conbiner combines the sum locally. At last, Reducer sums them up. 

# Run the code through python driver locally

####  Reminder: You cannot use the programmatic runner functionality in the same file as your job class. That is because the file with the job class is sent to Hadoop to be run. Therefore, the job file cannot attempt to start the Hadoop job, or you would be recursively creating Hadoop jobs!

Use make_runner() to run an MRJob
1. seperate driver from mapreduce jobs
2. now we can run it within python notebook 
3. In python, typically one class is in each file. Each mrjob job is a seperate class, should be in a seperate file

In [4]:
from WordCount import MRWordFreqCount
mr_job = MRWordFreqCount(args=['WordCount.txt'])
with mr_job.make_runner() as runner: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)

('bar', 1)
('foo', 3)
('labs', 1)
('quux', 2)


#Put your access key info in configuration  file

##Check you configration file path

If the config file path is empty, you need to set environment varible MRJOB_CONF to be '~/.mrjob.conf'

Please have a look of the find_mrjob_conf function in the following link to understand how mrjob find the config file
http://pydoc.net/Python/mrjob/0.4.0/mrjob.conf/

In [1]:
from mrjob import conf
conf.find_mrjob_conf()

'/Users/ssatpati/.mrjob.conf'

##Create or replace .mrjob.conf file

# Run the code in command line in AWS

In [2]:
!python WordCount.py WordCount.txt -r emr

using configs in /Users/ssatpati/.mrjob.conf
using existing scratch bucket mrjob-d5ed1dc31babbe2c
using s3://mrjob-d5ed1dc31babbe2c/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/WordCount.ssatpati.20151003.032533.193129
writing master bootstrap script to /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/WordCount.ssatpati.20151003.032533.193129/b.py

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

Copying non-input files into s3://mrjob-d5ed1dc31babbe2c/tmp/WordCount.ssatpati.20151003.032533.193129/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-20024AKKNNBRA
Created new job flow j-20024AKKNNBRA
Job launched 30.4s ago, status STARTING: Provisioning Amazon EC2 capacity
Job lau

# Run the code through driver in AWS

In [7]:
from WordCount import MRWordFreqCount
mr_job = MRWordFreqCount(args=['WordCount.txt','-r', 'emr'])
with mr_job.make_runner() as runner: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)



('bar', 1)
('foo', 3)
('labs', 1)
('quux', 2)
