# SI 618: Data Manipulation and Analysis
## 11a - Big Data I

### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

## Introduction to Big Data

### **NOTE: conda install -y -c conda-forge mrjob**

[Distributed Computing I](assets/distcomp1.pdf)

### Generators

In [1]:
def f():
    return 1

In [2]:
def f123():
    yield 1
    yield 2
    yield 3

In [3]:
def f123x(x):
    for i in range(0,x):
        yield i

In [4]:
f()

1

In [5]:
x = f()

In [6]:
x

1

In [7]:
f123()

<generator object f123 at 0x00000266B5232660>

In [8]:
x = f123()

In [9]:
x

<generator object f123 at 0x00000266B52326D8>

In [10]:
for thing in x:
    print(thing)

1
2
3


In [11]:
x = f123x(10)

In [12]:
x

<generator object f123x at 0x00000266B5232750>

In [13]:
for thing in x:
    print(thing)

0
1
2
3
4
5
6
7
8
9


In [14]:
x = f123x(10)

In [15]:
for thing in x:
    print(thing)
    break;

0


In [16]:
for thing in x:
    print(thing)
    break;

1


In [17]:
next(x)

2

### Our first mrjob script

Note the use of the magic command ```%%file```.  You can use this to write the contents of a cell out to a file, which is what we need to do to use mrjob:

In [18]:
%%file word_count.py
from mrjob.job import MRJob
import re

class MRWordFrequencyCount(MRJob):

  ### input: self, in_key, in_value
  def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

  ### input: self, in_key from mapper, in_value from mapper
  def reducer(self, key, values):
    yield key, sum(values)
if __name__ == "__main__":
    MRWordFrequencyCount.run()

Writing word_count.py


Now we can run the script from the "command line", supplying a file to
analyze as the first argument.  We have supplied data/simpletext.txt, which contains the text from the slides, in today's zip file.

In [20]:
!python word_count.py data/simpletext.txt

"chars"	41
"lines"	3
"words"	9


No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory C:\Users\dteng\AppData\Local\Temp\word_count.dteng.20190408.222817.270834
job output is in C:\Users\dteng\AppData\Local\Temp\word_count.dteng.20190408.222817.270834\output
Streaming final output from C:\Users\dteng\AppData\Local\Temp\word_count.dteng.20190408.222817.270834\output...
Removing temp directory C:\Users\dteng\AppData\Local\Temp\word_count.dteng.20190408.222817.270834...


### <font color="magenta">Q1: Explain what the yield statements in the  above script do.  Provide a list of the first few iterations through the mapper() step would yield.

The mapper would do the char, word, and line summarization for each line. If the reducer is added, it sums up all of them together.

### <font color="magenta">Q2.  Repeat the above cell using the Tale of Two Cities text file (data/totc.txt).  Report your findings below.

In [21]:
!python word_count.py data/totc.txt

"chars"	760429
"lines"	16273
"words"	138883


No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory C:\Users\dteng\AppData\Local\Temp\word_count.dteng.20190408.223336.445873
job output is in C:\Users\dteng\AppData\Local\Temp\word_count.dteng.20190408.223336.445873\output
Streaming final output from C:\Users\dteng\AppData\Local\Temp\word_count.dteng.20190408.223336.445873\output...
Removing temp directory C:\Users\dteng\AppData\Local\Temp\word_count.dteng.20190408.223336.445873...


There are 16273 lines total, with 138883 words total, and 760429 characters total.

### Now let's look at a slightly more complicated example:

In [22]:
%%file most_used_word.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+")


class MRMostUsedWord(MRJob):
    STOPWORDS = {'i', 'we', 'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'}
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
                   combiner=self.combiner_count_words,
                   reducer=self.reducer_count_words),
            MRStep(reducer=self.reducer_find_max_word)
        ]

    def mapper_get_words(self, _, line):
        # yield each word in the line
        for word in WORD_RE.findall(line):
            if word.lower() not in self.STOPWORDS:
                yield (word.lower(), 1)

    def combiner_count_words(self, word, counts):
        # optimization: sum the words we've seen so far
        yield (word, sum(counts))

    def reducer_count_words(self, word, counts):
        # send all (num_occurrences, word) pairs to the same reducer.
        # num_occurrences is so we can easily use Python's max() function.
        yield None, (sum(counts), word)

    # discard the key; it is just None
    def reducer_find_max_word(self, _, word_count_pairs):
        # each item of word_count_pairs is (count, word),
        # so yielding one results in key=counts, value=word
        yield max(word_count_pairs)


if __name__ == '__main__':
    MRMostUsedWord.run()

Writing most_used_word.py


In [23]:
!python most_used_word.py data/simpletext.txt

3	"car"


No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 2...
Creating temp directory C:\Users\dteng\AppData\Local\Temp\most_used_word.dteng.20190408.223456.139356
Running step 2 of 2...
job output is in C:\Users\dteng\AppData\Local\Temp\most_used_word.dteng.20190408.223456.139356\output
Streaming final output from C:\Users\dteng\AppData\Local\Temp\most_used_word.dteng.20190408.223456.139356\output...
Removing temp directory C:\Users\dteng\AppData\Local\Temp\most_used_word.dteng.20190408.223456.139356...


### <font color="magenta">Q3: Explain what the yield statements in the  above script do.  Provide a list of the first few iterations through the mapper() step would yield.

Mapper will emit each word of a line, with the count being 1 for the reducer to use.  

So, the first few steps would yield:  
dear, 1  
bear, 1  
river, 1  
car, 1  
car, 1  
river, 1  

### <font color="magenta">Q4: Run the above script on the Tale of Two Cities text file (data/totc.txt).  What answer do you get?</font>

In [24]:
!python most_used_word.py data/totc.txt

661	"said"


No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 2...
Creating temp directory C:\Users\dteng\AppData\Local\Temp\most_used_word.dteng.20190408.223728.334842
Running step 2 of 2...
job output is in C:\Users\dteng\AppData\Local\Temp\most_used_word.dteng.20190408.223728.334842\output
Streaming final output from C:\Users\dteng\AppData\Local\Temp\most_used_word.dteng.20190408.223728.334842\output...
Removing temp directory C:\Users\dteng\AppData\Local\Temp\most_used_word.dteng.20190408.223728.334842...


The most common word in the Tale of Two Cities is 'said'