# `hadoop`

*soon we will need the following file, and it takes a considerable amount of time to download. start downloading it now*

https://resources.oreilly.com/examples/0636920035275/raw/master/hfpd3.vmdk.gz

## the problem(s)

reach all the way back in your memory to two lectures ago, when we talked about `aws` `dynamodb`. the proposed use case for `dynamodb` was an ambiguous "webby" one:

*we're reading and writing way too much data way too fast for our one machine*

this sentiment is reflective of a modern data reality that often goes by the buzziest of buzzwords:

**big data**

*mandatory caveat: big data $\neq$ data science*

traditional data analyses were optimized for super-powerful single machines such as the monolithic, super-powerful `sql` servers

*note: this is not to say that cluster computing didn't exist; indeed it was one of the main computational frameworks in the early days of computers*

our exponential growth in disk space and memory space per dollar spent fueled a lot of this work and innovation.

basically, for a long while our ability to *compute* data grew faster than our ability to *create* or *acquire* data. in the modern world, though, that notion is absolute history.

take, for example, a relatively trivial process for modern computation: word counts for a document of several MBs:

In [27]:
%%bash
rm /tmp/shakespeare.txt
wget --quiet -O /tmp/shakespeare.txt.zip https://github.com/bbengfort/hadoop-fundamentals/raw/master/data/shakespeare.txt.zip
unzip /tmp/shakespeare.txt.zip -d /tmp
ls -alh /tmp/shak*

Archive:  /tmp/shakespeare.txt.zip
  inflating: /tmp/shakespeare.txt    
-rw-r--r-- 1 zlamberty zlamberty 8.5M May 22  2009 /tmp/shakespeare.txt
-rw-r--r-- 1 zlamberty zlamberty 2.8M Nov 30 14:31 /tmp/shakespeare.txt.zip


In [29]:
%%bash
head /tmp/shakespeare.txt

hamlet@0		HAMLET
hamlet@8	
hamlet@9	
hamlet@10		DRAMATIS PERSONAE
hamlet@29	
hamlet@30	
hamlet@31	CLAUDIUS	king of Denmark. (KING CLAUDIUS:)
hamlet@74	
hamlet@75	HAMLET	son to the late, and nephew to the present king.
hamlet@131	


In [30]:
with open('/tmp/shakespeare.txt', 'r') as f:
    s = f.read()

print(s[:100])

hamlet@0		HAMLET
hamlet@8	
hamlet@9	
hamlet@10		DRAMATIS PERSONAE
hamlet@29	
hamlet@30	
hamlet@31	CL


In [31]:
import collections

wordct = collections.Counter(
    word.lower()
    for line in s.split('\n')
    for word in line.strip().split('\t')
    if word
)

wordct.most_common(10)

[('[exit]', 571),
 ('[exeunt]', 566),
 ('gloucester', 480),
 ('|', 476),
 ('falstaff', 472),
 ('hamlet', 382),
 ('othello', 292),
 ('brutus', 285),
 ('iago', 273),
 ('clown', 262)]

that was easy, but it relied on some important features:

1. I had enough disk space to have that 8.5MB file stored locally
2. I had enough memory to load that 8.5MB file's contents directly into memory

obviously, that isn't always the case. It's not even hard to think of counter-examples

1. a larger text corpus (e.g. all of wikipedia, 10TB as of 2015, or publically available SEC EDGAR filings)
2. any reasonably large image recognition project
3. the logs of web traffic for any modestly sized website or service
4. IoT information (usage records of your smartphone or headphones, e.g.)

so, back to `dynamodb`. when we (theoretically) started to run into resource issues for our single-machine architecture, we decided to change the way we were doing things and start "scaling horizontally" -- choose an architecture and software that can spread the storage and computation burden across multiple machines

for `dynamodb` we were attempting to distribute out our database writes and reads as actions, but the test scenario I laid out above was one of

+ data storage
+ resource availability

`hadoop` is the *de facto* operating system for distributed computing

it is a software solution that abstracts out all the "hard stuff" (the complicated networking and resource martialing) that needs to happen to get multiple computers on the same page, and instead provides the user (you) with a single api for

+ accessing distributed files (`hdfs`)
+ securing computational resources and memory (`yarn`)

## `hadoop` nuts and bolts

let's dig into the details of distributed computing a bit

### terminology

+ **node**: a single machine (real or virtual)
+ **cluster**: a collection of *nodes* which can communicate with eachother
+ **master**: a *node* which can request information from or delegate tasks to other *nodes*
+ **worker**: a *node* which merely receives, processes, and responds to requests

### basic concepts

in the database world we had certain requirements a `dbms` needed to meet to ensure that all clients of that database service could share those resources (the `ACID` principles)

similarly, for distributed computing to be well defined and robust, we have four requirements:

1. *fault tolerance*: if one computer goes down, we're good. if it comes back, we're even gooder
2. *recoverability*: we don't lose data when things fail
3. *consistency*: results shouldn't depend on jobs failing or succeeding
4. *scalability*: more data means longer time, not failure; we can increase resources if desired

`hadoop` addresses these requirements by making the following decisions:

+ data is distribute across many nodes in the cluster; each node prefers it's local data
+ all data is chunked into blocks (say, 128 MB) and is *replicated* (copied to other nodes)
+ jobs (computations) are broken into tasks applied to single blocks
+ jobs are completely unaware that they are distributed
+ worker nodes don't care about eachother
+ tasks are redundant and repeatable
+ master nodes handle allocation of all resources (storage, cpus, memory)

### `hadoop` architecture

`hadoop` as an operating system is basically just two pieces of software:

1. `hdfs` (a program for handling distributed file storage)
2. `yarn` (a program for handling distributed resource allocation)

together, these two process conspire to enforce some of those design decisions above: namely, to make sure that all data is robustly distributed and that all distributed tasks are working on local data

`hdfs` and `yarn` are the defaults and they were built to work in tandem, but either can be replaced:

+ you could change stoarge methods (e.g. `hdfs` replaced by `s3`)
+ you could change resource managers or computational layers on top of storage (e.g. `yarn` replaced by `hbase`)

#### a `hadoop` cluster

`hadoop` is a software. the hardware is a cluster of computers. the benefit you get in using `hdfs` and `yarn` are abstracted `api`s that hide cluster administration details and tasks from you.

to put it another way: `hadoop` lets someone else do the hard task of distribution so you can do what you came here to do (analysis)

when we've talked before about databases or `aws` `REST` `api`s, we've often called them *services* and the programs we wrote to utilize those services *clients*

both `hdfs` and `yarn` have several *services* that our tools (*clients*) use

`hdfs` services:

+ `NameNode` (master): stores the directory tree, file metadata, file cluster locations. this is the access point for `hdfs` usage
+ `Secondary NameNode` (master): performs housekeeping, checkpointing. not a backup `NameNode`
+ `DataNode` (worker): local `io` for `hdfs` blocks

the basic interaction with `hdfs`:

1. client asks `NameNode` where data lives.
2. `NameNode` tells client
3. client is responsible for going and getting data from `DataNode`

`yarn` services:

+ `ResourceManager` (master): allocates and monitor resources (memory, cores), schedules jobs
+ `ApplicationMaster` (master): coordinates a particular app after `ResourceManager` has scheduled it
+ `NodeManager` (worker): runs tasks and reports on task status

the basic interaction with `yarn` is very similar:

1. client asks `ResourceManager` for resources
2. `ResourceManager` assigns `ApplicationMaster` instance to manage the individual application
3. `ApplicationMaster` submits a job to a single `NodeManager`, tracks all submitted jobs
4. `NodeManager` executes incoming assigned tasks

to give you a sense of scale for a typical `hadoop` cluster

+ 20 - 30 workers and one master can handle 10s of terrabytes of data in simulatneous workflows
+ single servers (resource absolutism) is needed once you have hundreds of nodes
+ multiple masters are needed when you start talking about thousands of nodes

#### details on `hdfs`

`hdfs` is a file system on top of another filesystem. in many respects, it behaves like you're used to the `linux` filesystem behaving (with slightly different commands). there are a few nuances worth discussing, however.

##### blocks

files are blocked into large (e.g. 128MB) chunks. this means that a file larger than that will be separated up into different blocks. it's worth noting: this is effectively the *only* sense in which the block size matters

a small file will not wastefully take up the remainder of the space on the OS.

that's not to say there isn't a problem with small files, though -- there is. It's not wasteful disk usage, it's wasteful *resource* usage. we will discuss `mappers` and `reducers` later, but for now it suffices to say: when we distributed tasks, we already said we distribute them to blocks.

if one of those blocks contains a small amount of information, that will be pretty wasteful.

better a million files of 100 MB than a billion files of 0.1 MB

## a demo `hadoop` environment

a single `hadoop` (or, related, `aws` `emr`) environment is often a large, complicated, expensive, and unruly engineering project.

to avoid the hastle of constantly building up complicated development environments, many developers will create a *virtual execution environment* in a *virtual machine*.

we are going to build one such virtual environment right now using oracle's `virtualbox` and the Ubuntu 14.04 `vmdk` provided by the authors of our text book

<div align="center">**walkthrough: installing `virtualbox` and a `hadoop` virtual machine**</div>

1. if you didn't start the download of https://resources.oreilly.com/examples/0636920035275/raw/master/hfpd3.vmdk.gz at the beginning of class, do so now
2. download `virtualbox` for your os and follow instructions: https://www.virtualbox.org/wiki/Downloads
4. unzip the `vmdk` file once it is downloaded
    1. in a terminal, `gunzip hfpd3.vmdk.gz`
5. create the VM
    1. open `virtualbox` and click the "new" button
    2. name it whatever you want, change the type to `linux`, and make the version "ubuntu 64-bit"
    3. set the memory however you want (I'll go high because yolo)
    4. select "Use an existing virtual hard disk file" and navigate to the `vmdk` file
    5. start up the VM. password is `password`
    6. click "Devices > Insert guest additions cd" and the run (again, password is `password`)
    7. restart the VM when finished (now you can resize!)
    8. back in the `virtualbox` program, navigate to "Settings", and on the "General > Advanced" tab make "Shared Clipboard" "Bidirectional"

note: [the `cloudera` vm](https://www.cloudera.com/downloads/quickstart_vms/5-12.html) is actually pretty excellent to use and I highly recommend it for your general development and hacking.

I opted for the course-specific `vmdk` so that we would avoid configuration and implementation discrepancies as much as is possible, and also because the `cloudera` download requires you provide a lot of identifying information and I am attempting to respect privacy when possible

<div align="center">**starting `hadoop`**</div>

1. log in to your `hadoop` vm
2. execute the following:

```bash
sudo -H -u hadoop $HADOOP_HOME/sbin/start-dfs.sh
sudo -H -u hadoop $HADOOP_HOME/sbin/start-yarn.sh

# demonstrate it worked
hadoop fs -mkdir -p /user/student
hadoop fs -chown student:student /user/student
hadoop fs -ls -h /
```

### working with a distributed file system

#### basic file system operations

many of the common `linux` command line file system tools are available with the same names in `hadoop`. try

```bash
hadoop fs -help
```

(note the single-dash parameters and curse the `java` gods)

tired of reading those 4000 lines? try any one subcommand too:

```bash
hadoop fs -help ls
hadoop fs -help chmod
```

out on the etherwebs, you may see floating around commands such as

```bash
hdfs dfs -ls /
```

`hdfs dfs` is *related to* `hadoop fs`, but is not exactly the same. `hadoop fs` defaults to looking at `hdfs` files, but is actually file-system agnostic(ish), and supports local files (via the `file://` schema), `s3` files, `ftp` services, and any other people have been kind enough to implement.

`hdfs dfs` *only* works with `hdfs`.

to demonstrate how `hadoop fs` can be used with local files as well, try out

```bash
hadoop fs -ls file:///tmp/
```

none of this is to say *prefer* `hadoop fs` or *avoid* `hdfs dfs`. just knowing what the difference is may help you avoid some confusion when you try the subcommands or flags of one and don't experience the same result as you would with the other.

let's prepare our `hadoop` cluster to actually do some `hadoop`-y stuff:

```bash
mkdir ~/code && cd ~/code
git clone https://github.com/bbengfort/hadoop-fundamentals.git
cd hadoop-fundamentals/data
unzip shakespeare.txt.zip
hadoop fs -copyFromLocal shakespeare.txt shakespeare.txt
```

the files we save in `hadoop` are generally enormous. it's good to know right away how to read portions of such large files:

+ `hadoop fs -cat shakespeare.txt | less`
+ `hadoop fs -cat shakespeare.txt | head` (this aborts the streaming when `head` has had enough)
+ `hadoop fs -tail shakespeare.txt`

#### other `hdfs` interfaces

finally, there are `http` integrations. in particular, check out the web interface for the `DataNode`s found at `datanode_url:50075` (for our `hadoop` vm, try `127.0.0.1:50075`)

### working with distributed computing

as we said above, `yarn` is the main resource manager and one of the main access points for computation. in the original instance of `hadoop`, however, the computational framework was a software called `mapreduce`

knowing what `mapreduce` is helps illuminate the engineering paradigm at play in `hadoop` programs

#### `mapreduce`: a functional programming model

`mapreduce` was [first proposed](https://research.google.com/archive/mapreduce.html) by google developers as a way of performing easily distributable computations

the name comes from the two "pieces":

+ a `map` function takes input as a series of key-value pairs ("kvps") and performs the same computation on each pair, generating a (possibly empty) sequence of intermediate kvps
    + this is where analysis happens (usually)
    + e.g. filter: take a key, check if it belongs in a list of acceptable keys, emit the kvp if yes, pass silently if no
+ a `reduce` function takes a key and an iterator of values and process the values, usually to determine some aggregate statistic

these functions out to be stateless functional programming functions

#### `mapreduce`: implemented on a cluster

the `mapreduce` framework is great for a distributed computation environment because it is assumes many of the central tenets of the distribution framework. specifically, because mappers and reducers are stateless functions, they can be executed by a worker node to independently work on any number of blocks and emit their responses back to the master node.

mappers are already set: individual blocks are key-value pairs where the keys are file or line metadata and the values are the contents of the file / line. we can distribute the mapper function to any number of workers and let them process blocks at their own pace without any outside information

reducers needs all the output values for a single key across all processed blocks, so we have to wait until all mappers are done to "reduce".

we create as many reducers as there are output keys and distribute them among the workers

because reducers expect to get the keys emitted by mappers and **all** values for those keys, we need to perform a shuffle and sort of those intermediate kvps before we can reduce. this stage is called exactly that: *shuffle and sort*

so, in the end, we have a general framework:

+ input: `hdfs` kvps
+ mapping: input kvps are processed by mappers and generate intermediate kvps
+ shuffle and sort: take the generated key, partition the key space, and assign keys to reducers
+ reduce: take the keys and the iterated list of values and reduce them to aggregate kvps

it's kvps all the way down!

##### mapreduce examples

we already counted words in the shakespeare corpus, in memory in plain `python`, and the pseudo-code which can fit this wordcount problem into `mapreduce` is not that different:

```python
def mapper(dockey, line):
    for word in line.split():
        emit(word, 1)
        
def reducer(word, values):
    emit(word, sum(val for val in values))
```

### submitting a mapreduce job to `yarn`

`yarn` is responsible for scheduling tasks, so if we would like to perform some task we need to give it to `yarn`.

one way (and the most basic) is to create a `jar` file (compiled `java` code) and to pass that directly to `yarn` using the `hadoop jar` command.

in [the github repo](https://github.com/bbengfort/hadoop-fundamentals) for the "Data Analytics with Hadoop" O'Reilly book, we have been provided with a couple `java` files to implement a simple `mapreduce` word count job

+ [`WordCount.java`](https://github.com/bbengfort/hadoop-fundamentals/blob/master/wordcount/WordCount/WordCount.java)
+ [`WordMapper.java`](https://github.com/bbengfort/hadoop-fundamentals/blob/master/wordcount/WordCount/WordMapper.java)
+ [`SumReducer.java`](https://github.com/bbengfort/hadoop-fundamentals/blob/master/wordcount/WordCount/SumReducer.java)

let's compile and run that code on the shakespeare corpus.

first thing's first, let's compile our `java` code into a `jar` file

```bash
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
cd ~/code/hadoop-fundamentals/wordcount/WordCount/
hadoop com.sun.tools.javac.Main *.java
jar cf wc.jar WordCount.class WordMapper.class SumReducer.class
```

second thing's second, let's fix a simple permission problem on our local machine and then another one on our `hdfs`.

```bash
sudo chmod g+w /var/app/hadoop/data
sudo su hadoop
hadoop fs -chmod g+w /
```

final thing's final, we can submit the `jar` file to `yarn` by calling

```bash
hadoop jar wc.jar WordCount shakespeare.txt wordcounts
```

we can track the results of that job via a web interface at 127.0.0.1:8088

## not using `java`  with `hadoop` streaming

so we were able to write `java` code to create `mapreduce` jobs. super.

I mean... not knowing `java` is a bit of a problem though. not to be ungrateful.

this is just what we get out of the box with `hadoop` `mapreduce`.

+ java api with input, output, map and reduce functions, job params exposed as *job configuration*
+ jobs get packaged into a jar which is passed to the `ResourceManager` by the *job client*
+ `ResourceManager` handles the rest

but what if you don't want to write `java` code that implements this same workflow over and over and over again?

or just don't want to write `java` code *at all*, because you already did everything you needed to do in `python`?

*hadoop streaming* is here to help.

### `hadoop` streaming

hadoop streaming is a `java` util which can take any executable (in *any* language!) and use that as a mapper or reducer or combiner.

really, this is just a super hacky `jar` file that is submitted in the same was as our `wc.jar` in our example above. for this `hadoop`-specific `jar` file, you pass executable scripts or commands as parameters to this `jar` file

note: the word "streaming" is used because the input and output method is unix streams (`stdin`, `stdout`), not in reference to streaming data.

this is actually pretty cool, because we know how to access those streams:

+ `python`: `sys` module
+ `R`: `file("stdin")` (I think? who even knows. does anyone?)

when we develop a `mapper.py` script, know the following:

+ *each* mapper launches the executable. spin-up time sucks for obvious reasons
+ `hadoop streaming` parses input data into lines of text and pipes them through `stdin`
+ `python` streaming script parses those lines of texts and prints (to `stdout`) kvps delimited in some way (default is `\t`)
+ these intermediate kvps are scooped up by `hadoop streaming` again and passed on to the reducer
+ the mapper gets an entire block via `sys.stdin`. so it doesn't receive a *file*, or a *line number*, it receives a file handler to a block. that's important.

the `reducer.py` script follows much of the same logic, but in addition:

+ the reducer doesn't receive a key and an iterable, it reads shuffled and sorted kvp records (like a table) from stdin (they are in the `a\tb` format)
+ a single reducer task will always get *all* records for given key, but *may* get more than one key (so your reducer doesn't have a key, we need logic there)

for both files (and for any file in any language being used as a `hadoop streaming` script), the shebang (`#!`) declaration at the top of the file is important -- it tells the streaming process (a bash shell) how to execute the script (e.g. in `python`)

<!--div align="center">***DROP joke WHERE is_bad***</div>
<img align="middle" src=""></img-->

# END OF LECTURE

next lecture: [`spark`](012_spark.ipynb)