# What Is MapReduce?

* MapReduce is a method for distributing a task across multiple nodes
* Each node processes data stored on that node (Where possible)
* Consists of two phases: 
    - Map
    - Reduce
    
<img src="files/Figures/hadoop_8.png" width="750cm">

# What about the bandwith?

<img src="files/Figures/hadoop_9.png" width="1500cm">

# Features of MapReduce

* Automatic parallelization and distribution
* Fault-tolerance
* Status and monitoring tools
* A clean abstraction for programmers
    - MapReduce programs are usually written in Java
* MapReduce abstracts all the **‘housekeeping’** away from the developer
    - Developer can concentrate simply on writing the Map and Reduce functions

# MapReduce: The Mapper

* Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data locally, to avoid network traffic
    - Multiple Mappers run in parallel, each processing a portion of the input data
* The Mapper reads data in the form of key/value pairs
* It outputs zero or more key/value pairs
    - ```map(in_key, in_value) -> (inter_key, inter_value) list```
* The Mapper may use or completely ignore the input key
    - For example, a standard pattern is to read a line of a file at a time
        - The key is the byte offset into the file at which the line starts 
        - The value is the contents of the line itself
        - Typically the key is considered irrelevant
* If it writes anything at all out, the output must be in the form of key/value pairs
* What can we do with a Mapper?
    - Select part of the input
    - transform the text to the right format
    - Apply functions
    - filters (outliers for instance)
    - ...

# MapReduce Example: Word Count

* Count the number of occurrences of each word in a large amount of input data
```
Map(input_key, input_value)
   foreach word w in input_value:
    emit(w, 1)```

* Input to the Mapper
```
(3414, 'the cat sat on the mat')
(3437, 'the aardvark sat on the sofa')
```

* Output from the Mapper
```
('the', 1), ('cat', 1), ('sat', 1), ('on', 1),
('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1),
('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)
```

<img src="files/Figures/hadoop_3.png" width="500cm">

# MapReduce: The Reducer

* After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list
*  This list is given to a Reducer
    - There may be a single Reducer, or multiple Reducers
    - All values associated with a particular intermediate key are guaranteed to go to the same Reducer
    - The intermediate keys, and their value lists, are passed to the Reducer in sorted key order
    - This step is known as the ‘shuffle and sort’
* The Reducer outputs zero or more final key/value pairs
    - These are written to HDFS
    - In practice, the Reducer usually emits a single key/value pair for each input key
* What can we do with a Reducer?
    - nothing
    - data aggregation
    - descriptive statistics
    - data ordering
    - ...

# Example Reducer: Sum Reducer

* Add up all the values associated with each intermediate key:
```
reduce(output_key, intermediate_vals)
   set count = 0
   foreach v in intermediate_vals:
       count += v
   emit(output_key, count)
```

* Reducer output:
```
('aardvark', 1)
('cat', 1)
('mat', 1)
('on', 2)
('sat', 2)
('sofa', 1)
('the', 4)
```

<img src="files/Figures/hadoop_4.png" width="350cm">

# Putting It All Together

* The overall word count process

<img src="files/Figures/hadoop_5.png" width="1500cm">

# MapReduce: The Shuffing & Sorting

* Shuffling is the process of transfering data from the mappers to the reducers
* Shuffling can start even before the map phase has finished, to save some time
* Sorting saves even more time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous.
* Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. 
* Easy when input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers).

<img src="files/Figures/aIGRQ.png" width="1500cm">

_Note that in the image shuffing is called copy phase_

# Why Do We Care About Counting Words?

* Word count is challenging over massive amounts of data
    - Using a single compute node would be too time-consuming
    - Using distributed nodes require moving data
    - Number of unique words can easily exceed the RAM
    - Would need a hash table on disk
    - Would need to partition the results (sort and shuffle)
* Fundamentals of statistics often are simple aggregate functions
* Most aggregation functions have distributive nature
    - e.g., max, min, sum, count, mean, variance, etc
*  MapReduce breaks complex tasks down into smaller elements which can be executed in parallel