## Hadoop! A First Example

I've installed map/reduce and example in Gigantum.  Run in a terminal with.

```bash
cd code/hadoop/wordcount
/usr/local/hadoop-2.9.2/bin/hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
/usr/local/hadoop-2.9.2/bin/hadoop jar wc.jar WordCount /mnt/labbook/input/wordcount /mnt/labbook/output/untracked/output
```
### What did we learn??

* Hadoop programs are written in Java (natively)
  * Compiled and packed into a Java archive file.
* The Hadoop executable takes a jar archive and sumbits it to the engine
  * You should think about this as a job submission engine, similar to HPC schedulers.
* Hadoop takes **HDFS** file paths as input and write **HDFS** files as output
  * These are actually local file system paths.....more on this later.
  
### A Java Map/Reduce Program

This example is taken from the [Map/Reduce tutorial]()https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html).  Everyone should do this tutorial.

[WordCount.java](./hadoop/wordcount/WordCount.java)

A Map/Reduce program contains class that defines a `map()` and `reduce()` function from the interfaces in:
* Java package org.apache.hadoop.io.mapreduce
  * Mapper (interface for mapper function)
  * Reducer (interface for reducer function)
* Paradigm
  * Implement the interfaces
  * Called by the Hadoop! runtime
  
  
#### Mapper
```java
 public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
```

* Defines a schema for input and output key/value pairs
  * `<Object, Text, Text, IntWritable>`
  * takes any input key, excpects `Text` as value
  * outputs a `Text` key and `Intwritable`
* Map/Reduce has specific types that it uses as input/output
  * `Text` is analagous to Java's `String`
  * `IntWritable` is analagous to built int `int`
  * These types are classes that Hadoop! knows how to serialize, marshall etc.
* Context is a handle to the Hadoop! runtime
  * `context.write(word, one);` outputs from mapper to shuffle.

  
#### Reducer

```java
public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
```

* Reducer schema must be of orm `<A,B,A,B>`
  * DANGER: reduces is not a transformation, so you cannot change the key type
  * Doing so will break the system (silently?? Used to be)
  * Seems like a poor design
  * When we compare with Google MR pseudocode there's no reduce key type.


#### The Rest. Job Setup

```java
   public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
```

* Configure a job: a class with “public static void main(..)” entry point to be run by Hadoop!
* Assign, output types (seems redundant)
* Assign input and output directories
* Configure mapper, reducer, _combiner???_
  * We'll talk about combiners later
* Create a client to manipulate the running job
* LAUNCH! (on whatever Hadoop! cluster is configured)
* Wait for completion
  

### Runtime Systems and Execution Modes

Hadoop is a set of services (master, scheduler, workers, HDFS) that can be run in different configurations:
* Cluster setup i.e. __fully distributed__  = each service is run as a scalable service on mutliple nodes.
    * This is the deployment scenario for big data
* Single-node setup, i.e. __pseudo distributed__ = configures all cluster services as Java processes in a single computer
    * runs exactly the same as cluster, but with one node.
    * node must be installed and configured to run this way
* Local (__standalone__) runs all services in a single java process
  * good for development and debugging only (not scalable)
  * this is what we did
  * requires no configuration, just a Java install.
  
For our homework we will run the __Amazon EMR__ service = Elastic Map reduce.
* Submit a jar file to launcher and configure data in S3
* EMR builds a cluster, runs your job, and puts output in S3
* Cluster computing at arbitrary scalability on demand.
    