## Introduction and Motivation

Big data processing is obviously an important part of data science.By making use of big data, we could get some interesting insights and more reliable results.
<img src=http://chuantu.biz/t6/269/1522551559x-1566683256.png />

<br> The above figure shows the Google search trend of Big Data(Red) and Data Science(Blue), which shows an increase search interest in these areas over years.

<br>This tutorial will introduce you to one popular programming model **MapReduce** which focuses on processing large dataset in parallel.
By introducing basic concepts, showing several examples to how to program with MapReduce in detail, hopefully, you will get a good understanding of the framework and feel more comfortable to use the framework in your future work.

## Basic Idea of MapReduce in a Programmer's View
MapReduce was created with the idea of making parallel programming of large dataset simpler in mind. It used to call for a large amount of code to make computation parallel among thousands of machines, distribute data among them and figure out ways to recover from failure. With the emergence of MapReduce programming model, things become much easier. A programmer only needs to care about code that directly related to computation based on the dataset without worrying about issues caused by performing computing among a cluster of machines.
 
 <br> Before learning to program with Hadoop MapReduce (an implementation of MapReduce), it would be helpful to know about the basic workflow of a MapReduce job.
<img src=http://chuantu.biz/t6/269/1522551484x-1404812823.png />
<center><figcaption>MapReduce workflow overview (Referenced from [Advanced Cloud Computing](https://www.cs.cmu.edu/~15719/) course of CMU)</figcaption></center>

<br> MapReduce jobs will be divided into map tasks and reduce tasks. Map task will conduct user-defined map function while reduce task will execute reduce functions. The input and output of a MapReduce job will be stored in a distributed file system, **HDFS**. The main workflow of a MapReduce job can be viewed as follows. 

+ **Map Phase**: Each map task will conduct map function on its own portion of input -- taking in a <font color=INDIANRED>(key, value)</font> pair and emit one or several intermediate <font color=INDIANRED>(key, value)</font> pairs. It will sort keys within the mapper and store the output on the local disk. 


+ **Reduce Phase**:
  + **Shuffle stage**:  <font color=INDIANRED>(key, value)</font> pairs will be sent to corresponding reducers for further processing. Usually, the reducer for a key in different mappers is decided by `hash(key) % number_of_reducers`. This process is time-consuming as it usually includes reading from disk from remote machines. This process will guarantee that each key will be assigned to exactly one reducer.
  + **Merge & Sort stage**: <font color=INDIANRED>(key, value)</font> pairs from different mappers will be merged and sorted in the reducer. The way of merge same key can be done by specifying a combiner function.
  + **Reduce stage**: Each reducer will conduct reduce functions on one or more keys. It will take in the output <font color=INDIANRED>(key, value)</font> pairs from the mapper which have been merged and sort by key. The output of reduce function will be <font color=INDIANRED>(key, value)</font> pair as well. These <font color=INDIANRED>(key, value)</font> pairs produced by reduce functions will be written into HDFS as the final output of a MapReduce job

Within each stage, tasks could be processed in parallel. For examples, different mappers will run in parallel to process its own split. However, only after all mappers finish their tasks can a reducer start executing its work. Only after shuffle/ merge & sort finished in one particular reducer, can the reducer start executing reduce function on its own input partition.
<br>Understanding basic concepts of MapReduce will help you design, optimize and debug your own program. With the help of MapReduce, you only need to design the data processing process by figuring out your map functions and your reduce functions.

 #### Map Function 
HDFS will store files in 64MB blocks by default and each block of an input file will become an **input split** for MapReduce workflow. Each mapper will process one input split by executing map function on it.
<br> The map function takes in a <font color=INDIANRED>(key, value)</font> pair. Sometimes it is a <font color=INDIANRED>(, line)</font> pair where no key is specified and the value will be a line of the input file. The function could generate intermediate <font color=INDIANRED>(key, value)</font> pairs.
 
 #### Reduce Function 
 The reduce function will take in <font color=INDIANRED>(key, value)</font> pairs produced by map function and produce the final <font color=INDIANRED>(key, value)</font> pairs that will be stored in HDFS.

## Simulate a Hadoop MapReduce Streaming 
After the brief introduction of MapReduce work overview, let's jump into a classic "WordCount" example and show the Hadoop MapReduce streaming (one implementation of MapReduce) on our local machine.

<br> We are simulating the MapReduce workflow on a local machine. Instead of having input and output on HDFS or local disk of the mapper, we will take input from standard input and output to standard output. 

<br> **Note that most codes cannot run directly in the notebook, follow instructions and run python files in the directory using commands**

I have a simple design for my map and reduce function. For the map function, it takes in the line, and for every word, it emits a (word,1) pair. For the reducer, aggregate all value for the same key together to get the final count of each word.

Here is the map function saved in `WordCountmapper.py`

In [21]:
#!/usr/bin/env python
##please change the first commented line to the right location of your python2/3 (3 preferred) interpreter
import sys
#reading line from standard input
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        #write result to standard output
        print ('{0}\t{1}'.format(word, 1))

And the reducer function is saved in `WordCountReducer.py`.

In [22]:
#!/usr/bin/env python
##please change the first commented line to the right location of your python2/3 (3 preferred) interpreter
import sys
word_count_dict = {}
# input comes from STDIN
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    if word in word_count_dict:
        word_count_dict[word] += count
    else:
        word_count_dict[word] = count

for word in word_count_dict.keys():
    print('{0}\t{1}'.format(word, word_count_dict[word]))

You could simulate on your local machine with the command<br>
`echo "foo foo quux labs foo bar quux" | ./WordCountmapper.py | sort | ./WordCountReducer.py`

In [23]:
import subprocess
command = 'echo "foo foo quux labs foo bar quux" | ./WordCountmapper.py | sort | ./WordCountReducer.py'
output = subprocess.check_output(command, shell=True)
print(output)

labs	1
quux	2
foo	3
bar	1



Although we only simulate the file locally with standard input and output. **Note that 'WordCountmapper.py' and 'WordCountReducer.py' is still a right implementation for word_count running on real clusters.**
If you are interested, please check out the [introduction video of CMU's Cloud Computing course](https://youtu.be/39tUXYjW4co) which shows you how to run your code on clusters step by step.
<br> It is always a good practice to simulate runs on small dataset locally before you launch large data set and run on real clusters 

Although the above reduce function could achieve word count, however, it doesn't take into consideration that the keys are sorted before executing the reduce function. We could optimize the function to save space and avoid memory problem of large data sets. The following is another implementation which considers the pairs are sorted in advance. The function is saved in `WordCountReducer2.py`

In [None]:
#!/usr/bin/env python
import sys
curr_word = None
word = None
curr_count = 0
word_count_dict = {}
# input comes from STDIN
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    if not curr_word:
        curr_word = word

    if word == curr_word:
        curr_count += count
    else:

        print('{0}\t{1}'.format(curr_word, curr_count))
        curr_word = word
        curr_count = count

#last can be combined with previous ones
if curr_word == word:
    print('{0}\t{1}'.format(curr_word, curr_count))

You could simulate on your local machine with the command <br>
`echo "foo foo quux labs foo bar quux" | ./WordCountmapper.py | sort | ./WordCountReducer2.py`

In [31]:
command = 'echo "foo foo quux labs foo bar quux" | ./WordCountmapper.py | sort | ./WordCountReducer2.py'
output = subprocess.check_output(command, shell=True)
print(output)

bar	1
foo	3
labs	1
quux	2



## Introduction of MRJob

From the previous example, you may find out it's not easy to write map and reduce function even for a simple job. You could only conduct one MapReduce job in each run. If your program consists of many MapReduce steps, you have to set up clusters multiple times.

<br> Here I would like to introduce you to a new python library called `mrjob` which make it much easier and intuitive to write programs that consist of several MapReduce steps. First, let's get it installed

`pip install mrjob`

Now I'd like to show you how a WordCount program is written using MRJob. The program is saved in `./wordcount.py`. The input file is the same as before but saved in `./input.txt`.
Run the program by command line tool in the working directory <br>
```python wordcount.py input.txt```

In [None]:
from mrjob.job import MRJob
from mrjob.step import MRStep
class MRWordFrequencyCount(MRJob):
    def mapper(self, _, line):
        line = line.strip()
        words = line.split()
        for word in words:
            # write result to standard output
            yield word, 1

    def reducer(self, word, count):
        yield (word,sum(count))

if __name__ == '__main__':
    MRWordFrequencyCount.run()

In [43]:
command = 'python wordcount.py input.txt'
output = subprocess.check_output(command, shell=True)
print(output)

"bar"	1
"foo"	3
"labs"	1
"quux"	2



+ With `mrjob` , all you have to do is writing mapper, combiner, reducer function in a very intuitive way. I'm not going to introduce combiner function in detail here, but will give you referece at the end.
  + For the mapper, the input is (key,value) pair has already been parsed. Note that value is raw input you have to convert it to the datatype you want.
  + For the reducer, the input is (key,values) where key is generated by previous mapper and values are python `generator` yielded by all mappers that contain all values of a given key.
  
  In the word count example, map function converts a raw input line to several `(word,1)` pairs. The reducer function     will take care of certain key and all its values in the generator, use ` sum ` to aggregate all values.
  
+ `mrjob` can make you write all MapReduce steps in one single file

+ It could help you test locally, run on a Hadoop cluster, run on Amazon EMR and run on Google Cloud Dataproc on an easy configuration.

## Example Application: Calculate Movie Similarities

In this section, I'm going to use a more complex example to show how to write a Hadoop MapReduce program in python with the use of `mrjob`

Knowing similarity between two movies can help us do things like recommendations. Similarity can be defined in many ways. The similarity I use is the [Pearson's correlation coefficient](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)(which has been used in HW2 RD) of same user's rating.

By measuring the correlation coefficient between different movies, the program output the topK movie pairs which are highly similar.

### Dataset
The dataset I use is from [movielens_100k](https://grouplens.org/datasets/movielens/100k/) which collects real user's rating for movies. The dataset contains 100k ratings from 1000 users on 1700 movies. 
<br>The input data is in u.data, each line is one record of (user_id   movie_id    rating   timestamp)
<br>The movie info is in u.item, each line is one record of (movie id | movie title |...)


### Steps of my design:

+ Step one:
  + mapper ( `mapper_extract_data` ): Read line from the u.data and emit (user_id,(movie_id,rating)) pair
  + reducer ( `reducer_combine_singleUser_ratings` ): Put all rating from one user in a list, produce (user_id,    [(movie_id1,rating1),(movie_id2,rating2),(movie_id3,rating3),...]) pair


+ Step two:
  + mapper ( `mapper_generate_rating_pairs` ): The input will be the output of the last step. It will combine the (movie_id, rating) pair by two in the list. Produces rating pairs ((m1,m2),(r1,r2)) of the same user. Note that an order is maintained to make ratings on same movie pairs from different users be treated as same key and go into one reducer. This step will generate a huge dataset to shuffle.
  + reducer( `reducer_combine_singleUser_ratings` ): Combine all rating pairs of the same movie pairs. Calculate the correlation between two movies using these ratings. Note that if there are no more than 10 users rate both movies, we should eliminate the results. The output will be (None, ((m1,m2),coefficient))

+ Step three:
  + reducer( `find_topK_moviepairs` ): Before conducting reducer, an initializer is used to create a dictionary that map `movie_id` into `movie_name`. All input pairs will go into one machine as they all have `key=None` which enables global ranking. Note that some different movie_ids are corresponding the same movie such as 266 and 680, we should eliminate such results.
  
The steps configuration are in steps function.
Also, I add an argument when I run the function by `configure_args`
<br> The file is saved as `./movierating.py`

In [None]:
from mrjob.job import MRJob
from mrjob.step import MRStep
import numpy as np
import math
from itertools import combinations

class MRMovieSimilarity(MRJob):

    #(user_id,movie_id,rating,timestamp) -> (userid,(movie_id,rating))
    def mapper_extract_data(self, _, record):
        user_id, movie_id, rating, timestamp = record.split()
        yield user_id, (movie_id, rating)

    # (userid,(movie_id,rating)) -> (userid,[(m1,r1),(m2,r2)...])
    def reducer_combine_singleUser_ratings(self, user_id, rating_info):
        all_ratings = []
        for movie_id, rating in rating_info:
            all_ratings.append((movie_id, rating))
        yield None, all_ratings

    # (userid,[(m1,r1),(m2,r2)...]) -> ((m1,m2),(r1,r2))...
    def mapper_generate_rating_pairs(self,_,all_ratings):
        for movie1info,movie2info in combinations(all_ratings,2):
            m1 = movie1info[0]
            r1 = movie1info[1]
            m2 = movie2info[0]
            r2 = movie2info[1]
            if m1 < m2:
                yield (m1,m2),(r1,r2)
            if m2 < m1:
                yield (m2,m1),(r2,r1)

    def calculate_correlation(self,x,y):
        n = len(x)
        x = np.array(x)
        y = np.array(y)

        x_sum = x.sum()
        y_sum = y.sum()

        x_sqr = x ** 2
        y_sqr = y ** 2

        x_sqr_sum = x_sqr.sum().item()
        y_sqr_sum = y_sqr.sum().item()
        xy = x * y
        xy_sum = xy.sum()

        res = (n * xy_sum - x_sum * y_sum) / (
        math.sqrt(n * x_sqr_sum - x_sum ** 2) * math.sqrt(n * y_sqr_sum - y_sum ** 2))

        return res;

    # (m1,m2), cosine similarity score
    def reducer_cosine_similarity(self,movie_pair,rating_pair):
        rating_1 = []
        rating_2 = []
        for r1,r2 in rating_pair:
            rating_1.append(float(r1))
            rating_2.append(float(r2))
        if len(rating_1) > 10:
            #sim = dot(rating_1, rating_2) / (norm(rating_1) * norm(rating_2))
            sim = self.calculate_correlation(rating_1,rating_2)
            yield None,(movie_pair,sim)

    def map_movie_names(self,movie_pair):
        if  self.name_dict[movie_pair[0]] == self.name_dict[movie_pair[1]]:
            return None
        return self.name_dict[movie_pair[0]],self.name_dict[movie_pair[1]]

    def find_topK_moviepairs(self,_,sim_info):
        K = 20
        cnt = 0
        sorted_siminfo = sorted(list(sim_info), key=lambda a: float(a[1]), reverse=True)
        for movie_pair,score in sorted_siminfo:
            if self.map_movie_names(movie_pair):
                yield self.map_movie_names(movie_pair),score
                cnt += 1
            if cnt >= K:
                break

    def get_movies_names(self):
        self.name_dict = {}
        with open('u.item',encoding = "ISO-8859-1") as f:
            for line in f:
                fields = line.split('|')
                self.name_dict[(fields[0])] = fields[1]


    def configure_args(self):
        super(MRMovieSimilarity, self).configure_args()
        self.add_file_arg('--items', help='u.item (which contains movie info) path')


    def steps(self):
        return [MRStep(mapper=self.mapper_extract_data, reducer=self.reducer_combine_singleUser_ratings),
                MRStep(mapper=self.mapper_generate_rating_pairs, reducer = self.reducer_cosine_similarity),
                MRStep( reducer_init=self.get_movies_names, reducer=self.find_topK_moviepairs)]

if __name__ == '__main__':
    MRMovieSimilarity.run()


Use python3 to run the program 
```
 python3 ./movierating.py ./u.data --items ./u.item
```
It takes about 10 minutes to run on local machine.

In [41]:
# this may cause error on your machine, you have to specify your python3 interpreter 
# make sure you have numpy and mrjob in that environment
# like "/Users/momo/anaconda/envs/MRtutorial/bin/python ./movierating.py ./u.data --item ./u.item"
command = 'python3 ./movierating.py ./u.data --item ./u.item'
output = subprocess.check_output(command, shell=True)
print(output)

["Jackie Chan's First Strike (1996)", "Wings of Desire (1987)"]	0.9728148411099612
["Fire Down Below (1997)", "Shadow Conspiracy (1997)"]	0.966190518580404
["Air Bud (1997)", "Free Willy 3: The Rescue (1997)"]	0.9658676311477036
["Tin Drum, The (Blechtrommel, Die) (1979)", "True Romance (1993)"]	0.9624980365728483
["Chain Reaction (1996)", "Shadow Conspiracy (1997)"]	0.958569328934402
["Sling Blade (1996)", "Some Folks Call It a Sling Blade (1993)"]	0.9579176872327142
["Event Horizon (1997)", "Wes Craven's New Nightmare (1994)"]	0.9510703097740938
["Hamlet (1996)", "Raise the Red Lantern (1991)"]	0.9386740768532977
["Once Upon a Time... When We Were Colored (1995)", "Rosewood (1997)"]	0.9343666883698424
["Local Hero (1983)", "Raise the Red Lantern (1991)"]	0.9340289249886088
["Young Guns (1988)", "Beverly Hills Ninja (1997)"]	0.928950669521923
["War, The (1994)", "Outbreak (1995)"]	0.9280125662443223
["D3: The Mighty Ducks (1996)", "Batman Forever (1995)"]	0.9264365695804326
["Fly Away

If you have an amazon AWS account, you could run this program on real EMR clusters as well. See [quick reference](https://pythonhosted.org/mrjob/guides/emr-quickstart.html) for help.


## Some Thoughts on MapReduce
+ Each MapReduce step will read input and write output with HDFS which is time-consuming. As a result, it's not very suitable for processing iterative jobs. (Use Spark instead)
+ MapReduce jobs may suffer from the imbalance workload at different reducers. 
+ Pay Attention to memory settings when running MapReduce job on clusters to eliminate out-of-memory error

## References

1. MapReduce Paper: https://static.googleusercontent.com/media/research.google.com/zh-CN//archive/mapreduce-osdi04.pdf
2. Mrjob documentations: https://pythonhosted.org/mrjob/index.html
3. Set up EMR clusters to run MapReduce program in python: https://youtu.be/39tUXYjW4co
4. Another python MR library (lower level and more flexible): https://github.com/klbostee/dumbo/wiki/Short-tutorial
5. MapReduce combiners: https://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm