## Map/Reduce

* A structured programming model for data parallelism
  * Geared toward data intensive applications, not compute.
  * Data-parallel, distributed-memory, data-intensive  
* Map/Reduce is co-deployed with The **Google File System**
  * Large scale parallel file system for large objects
  * Failures are the normal case
  * M/R requires such a global data service
* Open-source re-implementation
  * HADOOP: Map/Reduce programming
  * HDFS: Haddop Distributed File System

### Some important properties

* Automated parallelism
  * programmer has no interaction with number of processes workers.
  * same program runs on any M/R cluster
* Functional programming concepts
  * break progam into a data parallel portion (`map`) and a reduction (`reduce`)
  * user-defined (procedural) functions for each task executed in a functional framework
  * No *working memory* in either mappers or reducers. The functions are created and applied on each data element.
* Sequential/streaming data access
  * data are accessed from disk to mappers
  * mapper outputs a stream (of keys and values)
  * stream shuffled to reducers
  * reducers output to file system

  
<img src="https://cdn-images-1.medium.com/max/1600/1*KKm4roOpsum147kKk5qp7A.jpeg" width=768 />

### Key/Value

* All M/R data are key/value pairs
    * keys are interpreted by the M/R system. Must be sortable.
    * values are not interpreted by M/R system. Only used in functions. 
* Colors in this figure represent keys
  * that are not changed by mapper (in this example)
  * shuffled by the engine and delivered to reducer

### Map/Reduce Example

Wordcount example from the original Google paper. Produce a count of the occurrence of each word in a set of documents.

```python
    map ( String key, String value ):
        // key: document name (file name)
        // value: document contents
        for each word w in value:
            EmitIntermediate ( word, "1" );
```

Mapper outputs `key=word, value="1"` for each word. Note that the output key and input key are different.
  * map is a **transformation** of the key/value schema
  * the only effect of the function is the emitted key, value pair.

```python
    reduce ( String key, Iterator values ):
        // key: a word
        // value: a list of counts
        int result = 0;
        for each word v in values:
            result += ParsseInt ( v );
        EmitAsString ( result );
```
Reducer sums the counts of words.  Some properties:
  * reducer gets a list of values at a key
  * reduce cannot change the key, emits a value, that is reduced from list
  * user defined function
  
### WordCount Visualized

<img src="https://www.researchgate.net/profile/Oscar_Pereira3/publication/270448794/figure/fig6/AS:295098651824130@1447368409317/Word-count-program-flow-executed-with-MapReduce-5.png" width=768 title="from Oscar Perreira @ ResearchGate" />

### What can you do?

It seems like a limited programming framework.  But, many string processing programs fit naturally in this model.
  * Grep
  * Web-log processing
  * Reverse web-link graph
  * Term vectors per host
  * Inverted index
  * Distributed sort