# Introduction to Hadoop

Now let's try out Hadoop. The Docker image we just setup should currently be running. 

## Verify that Hadoop is running

In the terminal window that is running Docker, run the following:

```bash
jps
```

This will show you that `DataNode`, `NameNode` and `SecondaryNameNode` are running:

```bash
141 NameNode
574 Jps
267 DataNode
435 SecondaryNameNode
```

## Examine data

Alright, we already have some data in this Docker container that we'll be using. Let's examine it. 

The data is located in the text file `/root/textdata/44604.txt.utf-8` and comes from this Gutenberg ebook: 

 * [How to Become an Engineer by Frank W. Doughty](http://www.gutenberg.org/ebooks/44604.txt.utf-8)
 
Note: We are using the plain text `utf-8` encoding!
 
```bash
head /root/textdata/44604.txt.utf-8
```

should print something like this:

```
The Project Gutenberg eBook, How to Become an Engineer, by Frank W. Doughty


This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
```

## Put data in HDFS

First, make some directories **in the hadoop distributed file system!**

```bash
hdfs dfs -mkdir /user/root/
hdfs dfs -mkdir /user/root/gutenberg
```

Let’s check that they exist

```bash
hdfs dfs -ls /
hdfs dfs -ls /user/
```

There may be directories printed other than the ones we created.

Yay!

Ok, now put some data into hdfs:

```bash
hdfs dfs -put /root/textdata/* /user/root/gutenberg
```

Make sure the data is in the hdfs:

```bash
hdfs dfs -ls /user/root/gutenberg
```

Yay!

## Create the mapper and reducer 

This will be done directly on our Docker instance. First, change to the root's home directory.

```bash
cd ~
```

You can use vi or nano to create the following two files. Note that these files use Python 2. Python 3 will not work as TextBlob has only been installed for Python 2 in this Docker container.

Our mapper `count_mapper.py` includes the following code:

```python
#!/usr/bin/env python
from __future__ import print_function

import sys
from textblob import TextBlob

for line in sys.stdin:
    line = line.decode('utf-8')
    words = TextBlob(line).words
    for word in words:
        word = word.encode('utf-8')
        print("{}\t{}".format(word, 1))
```

And our reducer `count_reducer.py` looks like this:

```python
#!/usr/bin/env python
from __future__ import print_function

import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    word, count = line.split('\t')
    count = int(count)
    if word == current_word:
        current_count += count
    else:
        if current_word:
            print('{}\t{}'.format(current_word, current_count))
        current_word = word
        current_count = count

if current_word == word:
    print('%s\t%i' % (current_word, current_count))
```

## Let's run it!

Before giving the following command, it's a good idea to ensure the map and reduce files are  executable: 

```bash
chmod +x /root/count_mapper.py
chmod +x /root/count_reducer.py
```

Now, run the map-reduce job:


```bash
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-file /root/count_mapper.py \
-mapper /root/count_mapper.py \
-file /root/count_reducer.py \
-reducer /root/count_reducer.py \
-input /user/root/gutenberg/* \
-output /user/root/book-output
```

The output should something like the following:

```bash
16/12/18 19:45:12 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/root/count_mapper.py, /root/count_reducer.py] [] /tmp/streamjob7522181193963040200.jar tmpDir=null
16/12/18 19:45:13 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/12/18 19:45:13 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/12/18 19:45:13 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
16/12/18 19:45:14 INFO mapred.FileInputFormat: Total input paths to process : 1
16/12/18 19:45:14 INFO mapreduce.JobSubmitter: number of splits:1
16/12/18 19:45:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local236049766_0001
...  
...
16/12/18 19:46:10 INFO streaming.StreamJob: Output directory: /user/root/book-output
```

Booom! :boom: It's running.

## Looking at the output

Once it's done,

```bash
hdfs dfs -ls /user/root/book-output
```

should show that there is a `_SUCCESS` file (showing we did it!) and
another file called `part-00000`

This `part-00000` is our output. To look in:

```bash
hdfs dfs -cat /user/root/book-output/part-00000
```

or just

```
hdfs dfs -cat /user/root/book-output/*
```

will show the output of our job!

If you want to see the most common words, run:

```bash
hdfs dfs -cat /user/root/book-output/* | sort -rnk2 | head
```

## NOTE

If something went wrong with the mapreduce job, or you fix something and want to run it again, it will throw a different error the second time. This error will say that the book-output directory already exists in hdfs. 

This error is thrown to avoid overwriting previous results. If you want to just rerun it anyway, you need to delete the output first, so it can be created again:

```bash
hdfs dfs -rm -r /user/root/book-output
```