## Map/Reduce Semantics and Systems

### Types and transformations
* Map is a transformation
  * Input domain to output domain
* Reduce is a collection
  * No domain change
  
$$
map (k1, v1) \rightarrow list(k2,v2) \\
reduce (k2, list(v2)) \rightarrow list(v2)
$$

* Google C++ implementation is based all on strings
  * User code must convert to structured types
* Hadoop! has type wrappers

### Parallelism in Map/Reduce

* How much potential paralleism in mappers? in reducers?


.

.

......spoiler alert.......
  
.  

.
* Mappers: up to a parallel process for each input (typically a file)
* Reducers: up to a parallel process for each key
* So for the WordCount example
  * two files = two mappers
  * 5 different words = five reducers
  * but this is scalable with input


Differentiating betweens mapper/reducer and map/reduce processes
* A cluster will typically configure number of available phyical processes
  * this numer is typically much smaller than potential parallelism
  * we refer to `mappers` as potential parallelism and `map processes` as the number of phyical processes running map functions.

### Map/Reduce Runtimes

From the Google paper https://www.usenix.org/legacy/events/osdi04/tech/dean.html

<img src="images/mr.png" width=512 />

* Automatically partition input data
  * 16-64 MB chunks for Google
* Create M map tasks: one for each chunk
  * Assign available workers (up to M) to tasks
* Write intermediate pairs to local (to worker) disk
* R reduce tasks (defined by user) read and process intermediate results
* Output is up to R files available on shared file system
* Master tracks state
  * Asssignment of M map tasks and R reduce tasks to workers
  * State and liveliness of the above


### Systems Issues

The map/reduce runtime must deal with:
* Master failure
  * Checkpoint/restart, classic distributed systems/replication problem
* Failed worker
  * Heartbeat liveness detection, restart
* Slow worker
  * Backup tasks
* Locality of processing to data
  * Big deal, they don’t really solve
  * But, much subsequent research does
* Task granularity
  * Metadata size and protocol scaling (not inherent parallelism) limit the size of M and R
  
### Google File System: The Data Service

* Goals
  * Wide-distribution
  * Commodity hardware
  * High (aggregate) performance
* Different assumptions than traditional file systems
  * Component failures are normal behavior
  * Files are huge (new to Google environment ca. 2004)
* Most files have append-only writing, 
  * Mandate append-only writing to realize good I/O properties
  * Why append only?
      * reduce contention -- logical locking of tail rather than physcial locking of offset
      * no data reorganization for writes

GFS architecture (from https://www.cs.rutgers.edu/~pxk/417/notes/16-dfs.html)
    
<img src="images/gfs.png" width=512 />

<img src="images/gfs2.png" width=512 />


*  Reliable services
  * Master, scheduler, lock services fault tolerant
* Data are triple replicated
  * On nodes that have independent failure properties (different racks, power supplies, networks)
  * This became standard practice in cloud key/value stores (for a decade, now supersedes by distributed error coding)
  * Tolerates two failures
  * Rereplicated on failure detection
  
Here is the originial diagram from https://www.usenix.org/legacy/events/osdi04/tech/dean.html

<img src="images/gfs.orig.png" width=700 />

GFS is an **out of band** file system.
  * metadata and data are seperate coordinated services

### How GFS changed the world

* Atomic checkpoint and append
  * Major mode for writing
  * Great semantics for limited usage
* Abandon POSIX file system semantics
* In-memory metadata at Master
  * Gotta keep it small
  * Even for big data (scale metadata memory in proportion to aggregate storage)
* Re-replication
  * Keep three, detect missing on read/write
  * Forget reliable storage, forget RAID
* Design for failures (not recovery)


## Semantics

Let's start with some definitions:

#### Shuffle

This is the routing process of mapper outputs to reducer inputs.

<img src="images/hadoopsem.png" width=512 />

#### Partition

* Partition is the output file of a reducer _process_ (not a single reducer).
  * Contains many reducer keys
 
#### Combiner

This image is linked from https://data-flair.training/blogs/hadoop-combiner-tutorial/.  Please refer to their page.

<img src="https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploads/sites/2/2017/09/mapreduce-program-with-combiner.jpg" width=512 />



Combiner is a function that runs on the outputs of the mapper before the shuffle.

* Combiner executes on the mappers $ \langle key,value \rangle$ output while in memory at the mapper
  * It is possible to write unique combiner and reduce classes
  * It is common to use the reducer as a combiner
  * Combiner must have algebraic propertics, i.e. `reduce(combine(A),combine(B)) == reduce(A,B)` 
  * Traditional reduce operators (aggregates and extrema) work in combiners
* Combiner in WordCount:
  * compute sum output by mapper for each key and send a single aggregated value to reducers
* Combiner for Maximum: 
  * compute maximum value for each key.  
  * Reducer computes a maximum of maxima.
* __Caution!!!__ Your homework assignment cannot use a combiner.  I will ask you why?


### The Map/Reduce Sorting Guarantee

* Map: extracts a sorting key from the value
$$
 \langle key, value \rangle -> \langle sort\_key, output\_value \rangle
$$
* Shuffle does not sort strictly:
  * it route's to reducer based on sort key.
  * typically hashing maps key to reducer
  * Keys are sorted as they are presented to the reducer

Here is the guarantee:

_We guarantee that within a given partition, the intermediate key/value pairs are processed in increasing key order. This ordering guarantee makes it easy to generate
a sorted output file per partition, which is useful when
the output file format needs to support efficient random
access lookups by key, or users of the output find it convenient to have the data sorted._

The ordering guarantees sort within partitions
* To sort completely:
  * All output to a single partition (use one reducer)
  * Customize the shuffle function (quite complex and can introduce skew)
  * The default shuffle uses hashing (for load balance)

* The Google paper optimizes sort by customizing shuffle so that partitions are ordered, not randomized
  * Run a M/R job to learn the key distribution
  * Specify a shuffle based on evenly paritioning the key distribution
  * This is also how Hadoop!’s sort record worked