In [None]:
# Install Package of generator for random text that looks like Latin.
!pip install lorem

Collecting lorem
  Downloading lorem-0.1.1-py3-none-any.whl (5.0 kB)
Installing collected packages: lorem
Successfully installed lorem-0.1.1


# Map Reduce Overview


This notebook objective is to code in Python language a wordcount application using map-reduce process. A java version is well explained on [this page](https://www.dezyre.com/hadoop-tutorial/hadoop-mapreduce-wordcount-tutorial)

![domain decomposition](https://github.com/pnavaro/big-data/blob/master/notebooks/images/domain_decomp.png?raw=1)

credits: https://computing.llnl.gov/tutorials/parallel_comp

Reference: https://en.wikipedia.org/wiki/MapReduce

* **Map**: The Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be (and often are) different from each other.
* **Shuffle**: The process of transferring data from the mappers to reducers is known as shuffling i.e. the process by which the system performs the sort and transfers the map output to the reducer as input. So, MapReduce shuffle phase is necessary for the reducers, otherwise, they would not have any input (or input from every mapper). 
* **Reduce**: The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.

![MapReduce](https://github.com/pnavaro/big-data/blob/master/notebooks/images/mapreduce.jpg?raw=1)

# Python's built-in function `map`

The `map(func, seq)` Python function applies the function func to all the elements of the sequence seq. It returns a new list with the elements changed by func

In [None]:
def f(x):
    return x * x

rdd = [2, 6, -3, 7]
res = map(f, rdd )
res  # Res is an iterator

<map at 0x7f120d4b9350>

In [None]:
print(*res)

4 36 9 49


In [None]:
from operator import mul
rdd1, rdd2 = [2, 6, -3, 7], [1, -4, 5, 3]
res = map(mul, rdd1, rdd2 ) # element wise sum of rdd1 and rdd2 

In [None]:
print(*res)

2 -24 -15 21


# Python's built-in function `functools.reduce`

The function `reduce(func, seq)` continually applies the function func() to the sequence seq and return a single value. For example, reduce(f, [1, 2, 3, 4, 5]) calculates f(f(f(f(1,2),3),4),5).

In [None]:
from functools import reduce
from operator import add
rdd = list(range(1,6))
reduce(add, rdd) # computes ((((1+2)+3)+4)+5)

15

In [None]:
reduce(lambda x, y: x + y, [1, 2, 3, 4, 5])

15

In [None]:
reduce(lambda x, y: x - y, [1, 2, 3, 4, 5], 0)

-15