# Hadoop and MapReduce
### Jack Bennetto
#### April 10, 2017

## Objectives

By the end of this class, we will be able to:

- Explain how HDFS stores large files on a cluster.
- Describe MapReduce, and how it relates to Hadoop.
- Explain types of problems which benefit from MapReduce.
- Write MapReduce (multi-step) jobs in python using MRJob.
- Speed up MapReduce using combiners, map-only jobs, and counters. 

## Agenda

Morning
 * Big data
 * HDFS
 * MapReduce
 * Examples with MRJob

Afternoon
 * Hive & Pig
 * Combiners
 * Map-only jobs and counters
 * Multi-step jobs

## Motivation

(and caveats)

## Big Data

"Big data is *high volume*, *high velocity* and/or *high variety* information assets that require new forms of processing to enable enhanced decision making, in insight discovery and process optimization." – Gartner, Inc.

In particular, we often talk about data too large to fit in the memory of a single computer, and that we need to process it faster than we can with a single machine. In this case we need a distributed (as opposed to local) architecture that will spread the data across the disks, memory, and processors of multiple machines.

Distributed systems are great, but have some challenges.

Local                     | Distributed
--------------------------|-------------
Simple to understand and program     | Hard to understand and program; not always appropriate
No communications overhead | Overhead passing data between computers
Limited memory, cpu, disk  | Lots of memory, cpu, disks
Single point of failure    | May be fault tolerant


### Big-data problem

We have a 100 TB of sales data that looks like this:

ID    |Date          |Store  |State |Product   |Amount
--    |----          |-----  |----- |-------   |------
101   |11/13/2014    |100    |WA    |331       |300.00
104   |11/18/2014    |700    |OR    |329       |450.00

What are some of the questions we could answer if we could process this huge data set?

- How many transactions were there by store, by state?
- How many transactions were there by product?
- How many transactions were there by week, month, year?
- How many transactions were there by store, state, product, month?
- How much revenue did we make by store, state?
- How much revenue did we make by product?
- How much revenue did we make by week, month, year?
- How much revenue did we make by store, state, product, month?

### Statistical Uses

Why are these interesting?

- These questions can help us figure out which products are selling
  in which markets, at what time of the year.
- Using statistical algorithms such as regression or random forests we
  can predict sales.

What kinds of sales can we predict?
  
- How much of each product will sell in each store next week.
- How much of each product to stock in inventory.
- If there are any large-scale trends.
- If there are any blips in the data.

### Engineering Problem

To answer these questions we have to solve two problems:

- Store 100 TB of data
- Process 100 TB of data

Here is our starting point:

- To solve this problem we have been provided with 1000 commodity Linux servers.
- How can we organize these machines to store and process this data?

## Hadoop Intro

Hadoop is a cluster operating system. It is made up of:

- HDFS, which coordinates storing large amounts of data on a
  cluster.

- MapReduce which coordinates processing data across a cluster of
  machines.

### Google Papers

Hadoop, HDFS, and MapReduce are open source implementations of the
ideas in these papers from Google and Stanford.

- Paper #1: [2003] The Google File System     
    <http://research.google.com/archive/gfs-sosp2003.pdf>

- Paper #2: [2004] MapReduce: Simplified Data Processing on Large Clusters    
    <http://research.google.com/archive/mapreduce-osdi04.pdf>

- Paper #3: [2006] Bigtable: A Distributed Storage System for Structured Data
    <http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf>


### Hadoop Analogy

System     |Analogy
------     |-------
Hadoop     |Cluster Operating System
HDFS       |Cluster Disk Drive
MapReduce  |Cluster CPU

- Hadoop clusters are made up of commodity Linux machines.
- Each machine is weak, limited, and may fail.
- Hadoop combines these machines into something more powerful than any part.  

## HDFS

<img src="images/hdfs-data-distribution.png">

### HDFS Notes

- HDFS breaks up large files into 128 MB blocks.

- The system stores 3 replicas of each block.

- When a machine goes down the NameNode daemon makes the DataNode
  daemons rereplicate the lost blocks.
  
Questions:

* What are the implications of a block sizes that large?

* In this picture how many machines can crash before we lose data?

* If a machines crashes, the system rereplicates the lost blocks, and
then the machine rejoins the cluster. What happens to the block replication
count?

## MapReduce

Question:

Suppose you want to count how many of each word, totaled across a large number of large text files. How would you spread the work across multiple machines?

<img src="images/map-reduce-key-partition.png">

MapReduce Notes
---------------

How does MapReduce work?

- The developer provides mapper and reducer code.
- The mapper function transforms individual records and attaches a key to each record.
- All the records with the same key end up on the same reducer.
- For each key the reduce function combines the records with that key.

Which machines run mappers and which run reducers?

- The JobTracker tries to run the mappers on the machines where the
  blocks of input data are located.
- This is called data locality – ideally, the mapper does not need to
  pull data across the network.
- The reducers are assigned randomly to machines which have memory and
  CPUs currently available.

Questions:

 * How many mappers does each job get?
 * How many reducers does each job get?
 * Suppose I want to find out how many sales transactions are in a data set for each state. What key should the mapper output?

### MapReduce Using MRJob


Hadoop is technically a Java library from Apache. You probably don't want to use it in Java, because it's not easy.

The following is a "simple" word count on a text file.


```java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    public static class TokenizerMapper
           extends Mapper<Object, Text, Text, IntWritable>{

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context
                        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
      }

    public static class IntSumReducer
           extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
                           ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
```

## MRJob

MRJob is a Python wrapper around Hadoop created by Yelp (luckily for you). It makes your life *a lot* easier.

Here's the same program in python.

```python
from mrjob.job import MRJob

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in line.split():
            yield (word, 1)

    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
    MRWordFreqCount.run()
```

## Sales Data

Here is the sales data we are going to analyze.

In [14]:
%%writefile sales.txt
#ID    Date           Store   State  Product    Amount
101    11/13/2014     100     WA     331        300.00
104    11/18/2014     700     OR     329        450.00
102    11/15/2014     203     CA     321        200.00
106    11/19/2014     202     CA     331        330.00
103    11/17/2014     101     WA     373        750.00
105    11/19/2014     202     CA     321        200.00

Overwriting sales.txt


## Transactions By State

Q: How many transactions were there for each state?

We can't run MapReduce from jupyter, so we'll write `.py` file and run it.

This includes a class with two functions, a mapper and a reducer. The 

In [15]:
%%writefile SaleCount.py
from mrjob.job import MRJob

class SaleCount(MRJob):
    
    def mapper(self, _, line):
        if line.startswith('#'):
            return
        fields = line.split()
        state = fields[3]
        store = fields[2]
        yield (state, 1)
        
    def reducer(self, state, counts): 
        yield state, sum(counts)
        
if __name__ == '__main__': 
    SaleCount.run()

Overwriting SaleCount.py


- Run it locally.

In [17]:
!python SaleCount.py sales.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount.jackbennetto.20180403.235707.063161

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount.jackbennetto.20180403.235707.063161/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount.jackbennetto.20180403.235707.063161/step-0-mapper-sorted
> sort /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount.jackbennetto.20180403.235707.063161/step-0-mapper_part-00000
writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount.jackbennetto.20180403.235707.06

- Check the output.

In [18]:
!cat output.txt

"CA"	3
"OR"	1
"WA"	2


Questions:

* Suppose instead of counting transactions by state we want to count
transactions by store. What should we change in the code above?
* Suppose instead of counting transactions we want to find total
revenue by state. What should we change in the code above?


## Using MapReduce For Statistics

- Using MapReduce we can calculate statistics for any factors.
- Our factor or condition becomes the key.
- The parameter that we want to calculate the statistic on becomes
  the value.
- The reducer contains the logic to apply the statistic.
- The statistic can be sum, count, average, stdev, etc.

## Using MRJob for Word Count

First, we create an input file.

In [19]:
%%writefile input.txt
hello world
this is the second line
this is the third line
hello again

Overwriting input.txt


- Create the `WordCount.py` file.

In [20]:
%%writefile WordCount.py
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")

class WordCount(MRJob):
    
    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)
            
    def reducer(self, word, counts): 
        yield word, sum(counts)
        
if __name__ == '__main__': 
    WordCount.run()

Overwriting WordCount.py


- Run it locally.

In [22]:
!python WordCount.py input.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/WordCount.jackbennetto.20180404.000115.516612

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/WordCount.jackbennetto.20180404.000115.516612/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/WordCount.jackbennetto.20180404.000115.516612/step-0-mapper-sorted
> sort /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/WordCount.jackbennetto.20180404.000115.516612/step-0-mapper_part-00000
writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/WordCount.jackbennetto.20180404.000115.51

- Check the output.

In [23]:
!cat output.txt

"again"	1
"hello"	2
"is"	2
"line"	2
"second"	1
"the"	2
"third"	1
"this"	2
"world"	1


### Word Count Notes

- WordCount is used as a standard distributed application

- A large corpus can require more storage than the disk on a single
  machine

- WordCount generalizes to other counting applications such as
  counting clicks by category.

# Afternoon Lecture

## MapReduce Abstractions

### Why Hive and Pig

- Instead of writing MapReduce programs what if we could write SQL.
- Hive and Pig let you write MapReduce programs in SQL-like languages.
- These are then converted to MapReduce on the fly.
- We will look at Spark SQL tomorrow, which fills the same niche.

### Hive Example

```sql
SELECT user.*
FROM user
WHERE user.active = 1;
```

### Hive

- Hive was developed at Facebook.
- It translates SQL to generate MapReduce code.
- Its dialect of SQL is called HiveQL.
- Data scientists can use SQL instead of MapReduce to process data.

### Pig Example

```pig
user = LOAD 'user';
active_user = FILTER user BY active == 1;
dump active_user;
```

### Pig

- Pig was developed at Yahoo.
- It solves the same problem as Hive.
- Pig uses a custom scripting language called PigLatin instead of SQL.
- PigLatin resembles scripting languages like Python and Perl.
- Pig is frequently used for processing unstructured or badly formed
  data.

## Advanced MapReduce Applications

### Combiner

- Communication between nodes is can be a major bottlenex in MapReduce.
- Reduce the records before shuffling them saves bandwidth/disk usage.
- The *combiner* shrinks the numer of records locally before shuffling.
- A reducer can only be used as a combiner if it is commutative and associative.

### Transactions By State Using Combiner

Q: How many transactions were there for each state?

- Create the `SaleCountFast.py` file.

In [24]:
%%writefile SaleCountFast.py
from mrjob.job import MRJob

class SaleCountFast(MRJob):
    
    def mapper(self, _, line):
        if line.startswith('#'):
            return
        fields = line.split()
        state = fields[3]
        yield (state, 1)
        
    def combiner(self, state, counts): 
        yield state, sum(counts)
        
    #def reducer(self, state, counts): 
        #yield state, sum(counts)
        
if __name__ == '__main__': 
    SaleCountFast.run()

Writing SaleCountFast.py


- Run it locally.

In [26]:
!python SaleCountFast.py sales.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCountFast.jackbennetto.20180404.000451.662372

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCountFast.jackbennetto.20180404.000451.662372/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
Moving /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCountFast.jackbennetto.20180404.000451.662372/step-0-mapper_part-00000 -> /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCountFast.jackbennetto.20180404.000451.662372/output/part-00000
Streaming final output from /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCountFast.jackb

- Check the output.

In [27]:
!cat output.txt

"CA"	3
"OR"	1
"WA"	2


Questions:

 * Can we use the reduce function as a combiner if we are calculating
the total number of sales transactions per state?
 * Can we use the reduce function as a combiner if we are calculating
the average transaction revenue per state?

### Using Map-Only Job To Clean Data

Q: Write an application that extracts all the `CA` sales records.

- This only requires transforming records, without consolidating them.

- Any time we don't have to consolidate records we can use a *Map
  Only* job.

- Create the `SaleExtract.py` file.

In [28]:
%%writefile SaleExtract.py
from mrjob.job  import MRJob
from mrjob.step import MRStep

class SaleExtract(MRJob):
    
    def mapper(self, _, line):
        if line.startswith('#'): 
            return
        fields = line.split()
        state = fields[3]
        if state != 'CA': 
            return
        yield (state, line)
        

if __name__ == '__main__': 
    SaleExtract.run()

Writing SaleExtract.py


- Run it locally.

In [29]:
!python SaleExtract.py sales.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleExtract.jackbennetto.20180404.000502.348834

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleExtract.jackbennetto.20180404.000502.348834/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
Moving /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleExtract.jackbennetto.20180404.000502.348834/step-0-mapper_part-00000 -> /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleExtract.jackbennetto.20180404.000502.348834/output/part-00000
Streaming final output from /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleExtract.jackb

- Check the output.

In [30]:
!cat output.txt

"CA"	"102    11/15/2014     203     CA     321        200.00"
"CA"	"106    11/19/2014     202     CA     331        330.00"
"CA"	"105    11/19/2014     202     CA     321        200.00"


### Map-Only Applications

Here are some other applications of map-only jobs.

- Web-crawler that finds out how many jobs are on Craigslist for a
  particular keyword.
- Application that maps property addresses to property back-taxes by
  scraping county databases.

Question:

 * Do map-only applications shuffle and sort the data?


### Counters

Q: Count how many transactions there were in California and Washington.

- One way to solve this problem is to use a MapReduce application we did before.
- However, if we have a fixed number of categories we want to count we can use counters.
- If we use counters we no longer need a reduce phase, and can use a map-only job.
- MapReduce has a limit of 120 counters so this cannot be used to count an unknown number of categories.

Create the `SaleCount1.py` file.

In [31]:
%%writefile SaleCount1.py
from mrjob.job  import MRJob
from mrjob.step import MRStep

class SaleCount1(MRJob):
    
    def mapper(self, _, line):
        if line.startswith('#'): 
            return
        fields = line.split()
        state = fields[3]
        if state == 'CA':
            self.increment_counter('State', 'CA', 1)
        if state == 'WA':
            self.increment_counter('State', 'WA', 1)

if __name__ == '__main__': 
    SaleCount1.run()

Writing SaleCount1.py


- Run it locally.

In [32]:
!python SaleCount1.py sales.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount1.jackbennetto.20180404.000506.966566

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount1.jackbennetto.20180404.000506.966566/step-0-mapper_part-00000
Counters from step 1:
  State:
    CA: 3
    WA: 2
Moving /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount1.jackbennetto.20180404.000506.966566/step-0-mapper_part-00000 -> /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount1.jackbennetto.20180404.000506.966566/output/part-00000
Streaming final output from /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount1.jackbennetto.

- There should not be any output. The counter values were printed when
  the job was executed.

In [33]:
!cat output.txt

### Counter Notes

- Counters can be incremented in both the map and the reduce phase.
- Counter values from all the machines participating in a MapReduce
  job are aggregated to compute job-wide value.
- Counter values are printed out when the job completes and are also
  accessible on the Hadoop Web UI that stores job history.
- Counters are have a group name and a counter name.
- Group names help organize counters.
- Here is how we increment a counter:
  `self.increment_counter(group_name, counter_name, 1)`

Questions:

SalesStrategy Inc employs 100,000 part-time sales partners to sell
their products. The salespeople get monthly bonuses based on the
number of transactions they ring up. Should SalesStrategy use counters
to calculate these bonuses? Why or why not?

### Map-Only Job Observations

- Map-only jobs are the multi-machine equivalent of the
  multi-threading and multi-processing exercises we did earlier.
- Like our multi-threading and multi-processing applications, map-only
  jobs break up a larger problem into smaller chunks and then work on
  a particular chunk.
- Any time we have a problem where we don't need to reconcile or
  consolidate records we should use map-only jobs.
- Map-only jobs are much faster than regular MapReduce jobs.

Questions:


Q: Why are map-only jobs faster than regular MapReduce jobs?

## Chaining Jobs Together

It's possible to chain multiple MapReduce steps together. Each mapper take key-value pairs from the function before it, and each reducer takes the prior pairs aggregated by keys.

Rather than override the `mapper` and `reducer` methods, we need to override the `steps` method to returns a list of `MRStep`s, each with (optionally) a mapper, combiner, and reducer. First, a simple example that only returns the results of the sales data for states with amounts greater than 500.00 (similar to a HAVING command in SQL).

In [34]:
%%writefile SaleCountTwoStep.py
from mrjob.job import MRJob
from mrjob.step import MRStep

class SaleCount(MRJob):
    
    def mapper1(self, _, line):
        if line.startswith('#'):
            return
        fields = line.split()
        state = fields[3]
        amount = float(fields[5])
        yield (state, amount)
        
    def reducer1(self, state, counts): 
        yield state, sum(counts)
    
    def having(self, state, amount):
        if amount > 500:
            yield state, amount
        
    def steps(self):
        return [
            MRStep(mapper=self.mapper1, reducer=self.reducer1),
            MRStep(mapper=self.having)
        ]
        
if __name__ == '__main__': 
    SaleCount.run()

Writing SaleCountTwoStep.py


Run locally and look at the output.

In [35]:
!python SaleCountTwoStep.py sales.txt > output.txt
!cat output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCountTwoStep.jackbennetto.20180404.000516.647728

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCountTwoStep.jackbennetto.20180404.000516.647728/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCountTwoStep.jackbennetto.20180404.000516.647728/step-0-mapper-sorted
> sort /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCountTwoStep.jackbennetto.20180404.000516.647728/step-0-mapper_part-00000
writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCountTwoS

Here's a tricker example that takes advantage of the sorting in MapReduce.

Q: Find word frequencies and sort the result by frequency. 

This requires running two MapReduce jobs.
 - The first will calculate word frequencies.
 - The second will sort them.


First, Create `MostUsedWords.py`.

In [36]:
%%writefile MostUsedWords.py
from mrjob.job  import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+")

class MostUsedWords(MRJob):

    def mapper_get_words(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def reducer_count_words(self, word, counts):
        count_sum = '%03d'%sum(counts) 
        yield (count_sum, word)

    def reducer_sort(self, count, words):
        for word in words:
            yield (word, count)

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
                   reducer=self.reducer_count_words),
            MRStep(reducer=self.reducer_sort)
        ]

if __name__ == '__main__':
    MostUsedWords.run()

Writing MostUsedWords.py


- Run it locally.

In [37]:
!python MostUsedWords.py input.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/MostUsedWords.jackbennetto.20180404.000535.593879

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/MostUsedWords.jackbennetto.20180404.000535.593879/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/MostUsedWords.jackbennetto.20180404.000535.593879/step-0-mapper-sorted
> sort /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/MostUsedWords.jackbennetto.20180404.000535.593879/step-0-mapper_part-00000
writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/MostUsedWords.jackbennett

- Check the output.

In [38]:
!cat output.txt

"again"	"001"
"second"	"001"
"third"	"001"
"world"	"001"
"hello"	"002"
"is"	"002"
"line"	"002"
"the"	"002"
"this"	"002"


Hadoop Streaming API
-----------------------

- Why are we left-padding the amount with zeros? 

- MRJob is a wrapper around the Hadoop Streaming API.

- The Hadoop Streaming API converts all intermediate types to strings for comparison.

- So `123` will be smaller than `59` because it starts with `1` which
  is less than `5`.
  
- To get around this in MRJob if we want our data to sort numerically
  we have to left-pad the numbers with zeros.


### Sorting Sales Data

Q: Find the total sales per state and then sort by sales to find the
state with the highest sales total.

- We can use a multi-step MRJob to do this.

- Sort sales data using two steps.

In [39]:
%%writefile SaleCount.py
from mrjob.job  import MRJob
from mrjob.step import MRStep
import numpy as np

class SaleCount(MRJob):
   
    def mapper1(self, _, line):
        if line.startswith('#'):
            return
        fields = line.split()
        amount = float(fields[5])
        state = fields[3]
        yield (state, amount)

    def reducer1(self, state, amounts):
        amount = '%07.2f'%sum(amounts) 
        yield (state, amount)
    
    def mapper2(self, state, amount):
        yield (amount, state)

    def reducer2(self, amount, states):
        for state in states: 
            yield (state, amount)
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper1, reducer=self.reducer1),
            MRStep(mapper=self.mapper2, reducer=self.reducer2)
        ]
if __name__ == '__main__': 
    SaleCount.run()

Overwriting SaleCount.py


- Run it locally.

In [40]:
!python SaleCount.py sales.txt --jobconf mapred.reduce.tasks=2 > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount.jackbennetto.20180404.000544.608820

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount.jackbennetto.20180404.000544.608820/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount.jackbennetto.20180404.000544.608820/step-0-mapper-sorted
> sort /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount.jackbennetto.20180404.000544.608820/step-0-mapper_part-00000
writing to /var/folders/m9/8htjs0dj34d0qgtnbq5w_2fm0000gn/T/SaleCount.jackbennetto.20180404.000544.60

- Check the output.

In [41]:
!cat output.txt

"OR"	"0450.00"
"CA"	"0730.00"
"WA"	"1050.00"


## Appendix: an Aside on Generators

Most python functions run and return a value. That value might be a single number or string or it might be a list or dictionary. It might be a tuple, effectively returning multiple things. Or it might return `None`, effectively nothing at all (this is implicit if there isn't a return statement. But in all these cases, the return value is one object returned all at once.

The `yield` statement allows a function to return many object as they are available, returning a bit at a time. Any function with a `yield` returns a **generator**. Let's see what it looks like.

In [16]:
def return_arguments_as_generator(a, b, c):
    yield a
    yield b
    yield c

In [23]:
g = return_arguments_as_generator(3, 4, 5)
print(g)

<generator object return_arguments_as_generator at 0x1153baa98>


We can iterate over a generator with a for loop, or convert it into a list.

In [18]:
for x in g:
    print(x)

3
4
5


In [19]:
list(return_arguments_as_generator(10, 20, 30))

[10, 20, 30]

So what's happening here? When we call the function it doesn't actually runAt this point, the function hasn't actually run yet. Calling the `next` function one the generator advances the function until it hits a `yield` statement and returns the `yield`ed value.

Note we have to recreate the generator; the old one is finished.

In [24]:
g = return_arguments_as_generator(7, 8, 9)
next(g)

7

In [25]:
next(g)

8

In [26]:
next(g)

9

If the function reaches a `return` statement (either explicitly, or implicitly by reaching the end of the function) it raises a `StopIteration` exception.

In [27]:
next(g)

StopIteration: 

More often, generators are used to return values from a loop. We could write a version of the `range` function like this.

In [48]:
def my_range(start, end, step=1):
    '''Similar to buildin range function,
    but both start and end are required,
    and step must be positive'''
    i = start
    while i < end:
        yield i
        i += step

In [49]:
list(my_range(3, 10))

[3, 4, 5, 6, 7, 8, 9]