## MapReduce

MapReduce is actually just a prallel programming model for processing large data sets across a cluster of computers. 

### Computation is done via two functions.
1. Mapper
2. Reducer

### Basics of MapReduce
1. Send many documents to many mappers. And they each perform the same mapping on their respective documents and produce a series of intermediate key-value pairs.

2. We can shuffle these intermediate results and send all key-value pairs of the same key to the same reducer for processing.

3. We do this so that each reducer can produce one final key-value pair for each key.

** Before we move on, let's do some word-counting via Python dictionary to get a sense of Mapreducing. **

In [82]:
import string

def word_count(sentence) :
    word_counts = {}
    
    data = sentence.split(" ")

    for word in data :
        key = ''
        for i in word :
            for char in i :
                if char not in string.punctuation :
                    key += char
        if key in word_counts.keys() :
            word_counts[key] +=1
        else :
            word_counts[key] = 1
    print (word_counts)

    

sentence = "Hi, there. My name is Uidam."
word_count(sentence)

{'Hi': 1, 'there': 1, 'My': 1, 'name': 1, 'is': 1, 'Uidam': 1}


<br>
** If the text was really really large, this script would take extremely long time.
In this case, it makes sense to employ MapReduce **

### 1. Mapper

Recall that a mapper takes in a document. In this case, a collection of words that appear in a book and return various intermediate key value pairs. 



In [84]:
import sys
list = []
def word_count(sentence) :
    word_counts = {}
    
    data = sentence.split(" ")

    for word in data :
        key = ''
        # clean the data
        for i in word :
            for char in i :
                if char not in string.punctuation :
                    key += char
        key = key.lower()
        if key in word_counts.keys() :
            word_counts[key] +=1
        else :
            word_counts[key] = 1
        
        #emit a key-value pair
        
        print("{0}\t{1}".format(key, 1))
        list.append("{0}\t{1}".format(key,1))
    
  

sentence = "Hi, there. My name is Uidam. Uidam is My name"
word_count(sentence)

hi	1
there	1
my	1
name	1
is	1
uidam	1
uidam	1
is	1
my	1
name	1


### 2. Reducer

These key value pairs will be shuffled tn ensure that every key value pair with the same key ends up on the same reducer. 

And each reducer will perform some operation on particular key. 

In [98]:
import sys
list.sort()
def reducer() :
    word_count = 0
    old_key = None
    
    for line in list:
        data = line.split("\t")
        
        if len(data) != 2 :
            continue
        
        this_key, count = data
        
        if old_key and old_key != this_key :
            print("{0}\t{1}".format(old_key, word_count))
            word_count = 0
            
        old_key = this_key
        word_count += float(count)
        
    if old_key != None :
        print("{0}\t{1}".format(old_key, word_count))
            
reducer()

hi	1.0
is	2.0
my	2.0
name	2.0
there	1.0
uidam	2.0



### 3. More complext MapReduce

A very common open source implementation of MapReducemodel is **HADOOP**.

In order to more easily allow programmers to complete complicated tasks using the processing power of **HADOOP**, there are lots of infras.

Two of the most common are ***Hive and Pig.*** 

**1. Hive**
- dev by Facebook
- MapReduce job through a SQL-like querying language.
- Hive Query Language

**2. Pig** : 
- dev by Yahoo!, excel in some area
- Procedural language called pig Latin
- More explicit about the execution of your data processing.

** Bunch of other products out there : ** 
- **Mahout** for machine learning, 
- **Giraph** for graph analysis, and 
- **Cassandra**, a hybrid of a key value and a column oriented database.
