### Generators (to help understand map-reduce)

A generator is an object on which you can call the method `next` such that on every call it returns a value until it raises a `StopIteration` exception, signaling that all values have been generated.  This object is called an iterator.

Normal functions return a single value (or a list of values) all at once with `return`.  When `return` is called the states of all the local variables in the function are lost.  **Generators functions `yield` control back to the calling program when returning a value, but the local state of the function is maintained, so when `next` is called, the generator can start where it left off.**

In [1]:
# an example of a function returning a generator
def hold_client(name):
    yield 'Hello, %s! You will be connected soon' % name
    yield 'Dear %s, could you please wait a bit.' % name
    yield 'Sorry %s, we will play a nice music for you!' % name
    yield '%s, your call is extremely important to us!' % name

In [2]:
wait = hold_client('Frank')

In [3]:
type(wait)

generator

In [4]:
#wait.next() #python 2
next(wait) #python 3

'Hello, Frank! You will be connected soon'

### But why use a generator?

In [5]:
from sys import getsizeof

# list
print("Evaluating list")
lst = [x for x in range(0,1000)]
print("Type: ", type(lst))
print("Length: ", len(lst))
print("Size (bytes): ", getsizeof(lst))

Evaluating list
Type:  <class 'list'>
Length:  1000
Size (bytes):  9024


In [6]:
# generator
print("Evaluating generator")
gen = (x for x in range(0,1000))
print("Type: ", type(gen))
#print "Length: ", len(gen) # causes error!
print("Size (bytes): ", getsizeof(gen))

Evaluating generator
Type:  <class 'generator'>
Size (bytes):  88


In [8]:
# gen.next() Python 2
next(gen)

1

**Generators are memory efficient, returning results as needed.  This is useful for large datasets.  Additionally, sometimes it is useful to save the state of a function instead of starting from scratch.**

A classic example of the utility of generators is finding the prime numbers that exist in a range of values.  See a nice description [here.](https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/)

### Hadoop Map Reduce in Python
- Technically Hadoop Map Reduce jobs need to be written in Java
- mrjob is a Python library that makes use of _hadoop streaming_
    - Basically, all function inputs and outputs are written to/from stdin/stdout  
- typically run it from command line: `$ python python_mr_file.py textfile.txt` 

to install:
`$ pip install mrjob`  


### Basic structure of an mrjob file

```python
from mrjob.job import MRJob

class MyMR(MRJob):
    
    def mapper(self, key, value):
        pass
    
    def reducer(self, key, value): 
        pass
    
if __name__ == '__main__': 
    MyMR.run()
```

Using MRJob for Word Count
--------------------------

Q: Count the frequency of words using MRJob.


- Create the `WordCount.py` file.

In [9]:
%%writefile WordCount.py

from mrjob.job import MRJob

class MyMR(MRJob):
    
    # key = _, value = line
    def mapper(self, _, line):
        words = line.split()
        
        for word in words:
            yield (word, 1)
    
    # key = word, value = count
    def reducer(self, word, count):
        yield (word, sum(count))
    
if __name__ == '__main__': 
    MyMR.run()

Writing WordCount.py


- Run it locally on the `words.txt` file

In [10]:
!python WordCount.py words.txt #output to stdout, ! runs from bash (stdin)

No configs found; falling back on auto-configuration
Creating temp directory /var/folders/ck/09slxnrw8xjf1_001s7djbx80000gn/T/WordCount.courtney.20171018.041814.798084
Running step 1 of 1...
Streaming final output from /var/folders/ck/09slxnrw8xjf1_001s7djbx80000gn/T/WordCount.courtney.20171018.041814.798084/output...
"again"	1
"hello"	2
"is"	2
"line"	2
"second"	1
"the"	2
"third"	1
"this"	2
"world"	1
Removing temp directory /var/folders/ck/09slxnrw8xjf1_001s7djbx80000gn/T/WordCount.courtney.20171018.041814.798084...


In [11]:
!python WordCount.py words.txt > wordout.txt #output redirected to wordout.txt to save output

No configs found; falling back on auto-configuration
Creating temp directory /var/folders/ck/09slxnrw8xjf1_001s7djbx80000gn/T/WordCount.courtney.20171018.041826.857052
Running step 1 of 1...
Streaming final output from /var/folders/ck/09slxnrw8xjf1_001s7djbx80000gn/T/WordCount.courtney.20171018.041826.857052/output...
Removing temp directory /var/folders/ck/09slxnrw8xjf1_001s7djbx80000gn/T/WordCount.courtney.20171018.041826.857052...


### Something to note - mrjob only cares that key, value pairs are parameters in the mappers, combiners, and reducers, and that key,value pairs are yielded.  They do not need to match up.

In [14]:
%%writefile WordCount_gibberish.py

from mrjob.job import MRJob

class MyMR(MRJob):
    
    # key = abba, value = rice_crispy_treats
    def mapper(self, abba, rice_crispy_treats):
        words = rice_crispy_treats.split()
        
        for word in words:
            yield (word, 1)
    
    # key = SpeedRacer, value = Mach5
    def reducer(self, SpeedRacer, Mach5):
        yield (SpeedRacer, sum(Mach5))
    
if __name__ == '__main__': 
    MyMR.run()

Writing WordCount_gibberish.py


In [15]:
!python WordCount_gibberish.py words.txt #output to standard out

No configs found; falling back on auto-configuration
Creating temp directory /tmp/WordCount_gibberish.frank.20170823.162107.464760
Running step 1 of 1...
Streaming final output from /tmp/WordCount_gibberish.frank.20170823.162107.464760/output...
"the"	2
"third"	1
"this"	2
"world"	1
"again"	1
"hello"	2
"is"	2
"line"	2
"second"	1
Removing temp directory /tmp/WordCount_gibberish.frank.20170823.162107.464760...
