##mrjob##

__mrjob__ is a software package developed by the restaurant recommendation company _Yelp_. 
It's goal is to simplify the deployment of map-reduce jobs based on streaming and python onto different 
frameworks such as Hadoop on a private cluster or hadoop on AWS (called EMR).

* You can read more about mrjob here: https://pythonhosted.org/mrjob/index.html  
* and you can clone it from github here: https://github.com/yelp/mrjob

In this notebook we run a simple word-count example, add to it some logging commands, and look at two modes of running the job.

In [None]:
import os
import sys

# Get enviroment variables set from utils/setup.sh
home_dir = os.environ['HOME']
root_dir = os.environ['BD_GitRoot']

# Add utils to the python system path
sys.path.append(root_dir + '/utils')

# Read AWS credentials from 'EC2_VAULT'/Creds.pkl 
from read_mrjob_creds import *
(key_id, secret_key, s3_bucket, username) = read_credentials()

examples_dir = root_dir + '/notebooks/mrjob/'
!ls -l $examples_dir

In [None]:
filename=examples_dir+'mr_word_freq_count.py'
print filename

!ls -al $filename

In [None]:
# load example code from mr jobs as a starting point
%load $filename

In [None]:
%%writefile mr_word_freq_count.py
#!/usr/bin/python
# Copyright 2009-2010 Yelp
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re
from sys import stderr

WORD_RE = re.compile(r"[\w']+")

#logfile=open('log','w')
logfile=stderr

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            logfile.write('mapper '+word.lower()+'\n')
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        #yield (word, sum(counts))
        l_counts=[c for c in counts]  # extract list from iterator
        S=sum(l_counts)
        logfile.write('combiner '+word+' ['+','.join([str(c) for c in l_counts])+']='+str(S)+'\n')
        yield (word, S)

    def reducer(self, word, counts):
        #yield (word, sum(counts))
        l_counts=[c for c in counts]  # extract list from iterator
        S=sum(l_counts)
        logfile.write('reducer '+word+' ['+','.join([str(c) for c in l_counts])+']='+str(S)+'\n')
        yield (word, S)

if __name__ == '__main__':
    MRWordFreqCount.run()


In [None]:
!python mr_word_freq_count.py $root_dir/README.md > counts_local.txt

In [None]:
!cat counts_local.txt

## What is the meaning of "yield" ?

The keyword __yield__ is somewhat similar to __return__ however, while __return__ terminates the function and returns the result, 
__yield__, the first time it is encountered, return an object called a __generator__, without executing the function even once. On subsequent calls, the function is executed until one or more __yield__ commands are encountered, these values are returned, and the function halts (but does not terminate) until it is called again.

Here is a simple example:

In [None]:
def myrange(start,stop,step):
    value=start
    while value<=stop:
        yield value
        value += step
print [x for x in myrange(1.0,3.0,0.3)]

In [None]:
print myrange(1.0,3.0,0.3)

In [None]:
gen1=myrange(1.0,3.0,0.3)
gen2=myrange(2.0,5.0,0.7)
print 'gen1:',[x for x in gen1]
print 'gen1:',[x for x in gen1]  # after the generator terminated, it does not yield any more values.
print 'gen2:',[x for x in gen2]

A generator is similar to an array or a list, all of those are __iterable__ objects. However, while list store all of the values in memory and can be read in any order, generators create the values on the fly and can only traversed __once__ and __in order__

It is the fact that values are generated on the fly and then discarded which makes generators attractive when processing large amounts of data - only a small amount of intermedite results, the outputs of the mapper which are inputs to the reducer, need to be stored in memory. How much depends on the communication speed between mappers and reducers.

It is instructive to see how generators can be cascaded by passing a generator as a parameter to another generator.

In [None]:
def mycumul(values):   # values can be a list or a generator.
    s=0
    for value in values:
        s+=value
        yield s

In [None]:
# Here we pass a generator as an input to another generator.
gen3=mycumul(myrange(1.0,3.0,0.3))   

In [None]:
print 'gen3:',[x for x in gen3]

## Different modes of running a mrjob map-reduce job ##

Once the mapper, combiner and reducer have been written and tested, you can run the job on different types of infrastructure:

1. __inline__ run the job as a single process on the local machine.
1. __local__ run the job on the local machine, but using multiple processes to simulate parallel processing.
1. __hadoop__ run the job on a hadoop cluster (such as the one we have in SDSC)
1. __EMR__ (Elastic Map Reduce) run the job on a hadoop cluster running on the amazon cloud.

Below we run the same process we ran at the top using __local__ instead of the default __inline__. Observe that in this case the reducers have some non-trivial work to do even when combiners are used.

## Running in local mode

In [None]:
!python mr_word_freq_count.py --runner=local $root_dir/README.md > counts_local.txt

In [None]:
!cat counts_local.txt

## Setting up configuration

In [None]:
from find_waiting_flow import *
flow_id = find_waiting_flow(key_id,secret_key)

## Running in EMR mode on existing job flow (hadoop cluster)

In [None]:
import uuid

# Create unique output directory in the student's s3_bucket
output_dir = s3_bucket + str(uuid.uuid4()) + "/"

print output_dir

In [None]:
!python mr_word_freq_count.py -r emr $root_dir/README.md --emr-job-flow-id=$flow_id --output-dir=$output_dir  > counts_emr.txt

In [None]:
!cat counts_emr.txt

In [None]:
# TODO: mr_travelling_salesman is missing from GitHub repo
%load $root_dir/examples/mr_travelling_salesman/README.rst

### HW ###
1) Look around in the examples directory
2) Write a map-reduce job that computes the PCA of a large set of vectors (use as input the max_temp profiles in 
   /home/ubuntu/data/weather/SAMPLE_TMAX.csv)

**Hint:** One map-reduce job is enough. You might think that you first need to compute the means $\mu_i=E(X_i)$ and then, in a second path, compute
$$cov(X_i,X_j) = E((X_i-\mu_i)(X_j-\mu_j))$$
However, recall the formula 
$$ var(X) \doteq E((X-\mu)^2) = E(X^2) - E(X)^2 $$
This formula can be generalized to the $cov$ matrix.

In [None]:
!wc /home/ubuntu/data/weather/SAMPLE_TMAX.csv