# Running Jobs on Hadoop

The aim of this exercise is to get familiar with Hadoop. We will show how to run Hadoop application and create our own Python-based streaming application for parsing log file data.

## Configure Hadoop Environment and Test Applications

For the following exercise we will use two examples provided as part of the standard Hadoop distribution. We use the Hortonworks HDP 2.3.2 deployed on Amazon Web Services (EC2). First we need to set these two variables to the `jar` files containing the application.

In [1]:
HADOOP_EXAMPLES="/usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-mapreduce-examples.jar"
HADOOP_STREAMING="/usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming.jar"

## 1. Hadoop Services (HDFS and YARN)

In [2]:
!hadoop dfsadmin -report

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Configured Capacity: 425150103552 (395.95 GB)
Present Capacity: 396067639296 (368.87 GB)
DFS Remaining: 394627694592 (367.53 GB)
DFS Used: 1439944704 (1.34 GB)
DFS Used%: 0.36%
Under replicated blocks: 4
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
report: Access denied for user radical. Superuser privilege is required


In [3]:
!yarn node -list all

15/11/08 15:13:34 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-63-179-69.ec2.internal:8188/ws/v1/timeline/
15/11/08 15:13:34 INFO client.RMProxy: Connecting to ResourceManager at ip-10-63-179-69.ec2.internal/10.63.179.69:8050
Total Nodes:4
         Node-Id	     Node-State	Node-Http-Address	Number-of-Running-Containers
ip-10-63-179-69.ec2.internal:45454	        RUNNING	ip-10-63-179-69.ec2.internal:8042	                           0
ip-10-145-0-4.ec2.internal:45454	        RUNNING	ip-10-145-0-4.ec2.internal:8042	                           0
ip-10-218-164-206.ec2.internal:45454	        RUNNING	ip-10-218-164-206.ec2.internal:8042	                           0
ip-10-179-174-236.ec2.internal:45454	        RUNNING	ip-10-179-174-236.ec2.internal:8042	                           0


## 2. Terasort

In [4]:
!yarn jar $HADOOP_EXAMPLES teragen 100000 teragen

15/11/07 22:28:54 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-63-179-69.ec2.internal:8188/ws/v1/timeline/
15/11/07 22:28:54 INFO client.RMProxy: Connecting to ResourceManager at ip-10-63-179-69.ec2.internal/10.63.179.69:8050
15/11/07 22:28:55 INFO terasort.TeraSort: Generating 100000 using 2
15/11/07 22:28:55 INFO mapreduce.JobSubmitter: number of splits:2
15/11/07 22:28:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1446932642552_0007
15/11/07 22:28:55 INFO impl.YarnClientImpl: Submitted application application_1446932642552_0007
15/11/07 22:28:55 INFO mapreduce.Job: The url to track the job: http://ip-10-63-179-69.ec2.internal:8088/proxy/application_1446932642552_0007/
15/11/07 22:28:55 INFO mapreduce.Job: Running job: job_1446932642552_0007
15/11/07 22:29:01 INFO mapreduce.Job: Job job_1446932642552_0007 running in uber mode : false
15/11/07 22:29:01 INFO mapreduce.Job:  map 0% reduce 0%
15/11/07 22:29:06 INFO mapreduce.Job:  map 100% reduce 

In [8]:
!yarn jar $HADOOP_EXAMPLES terasort teragen teraout

15/10/30 23:05:14 INFO terasort.TeraSort: starting
15/10/30 23:05:15 INFO input.FileInputFormat: Total input paths to process : 2
Spent 97ms computing base-splits.
Spent 2ms computing TeraScheduler splits.
Computing input splits took 100ms
Sampling 2 splits of 2
Making 1 from 100000 sampled records
Computing parititions took 278ms
Spent 380ms computing partitions.
15/10/30 23:05:15 INFO impl.TimelineClientImpl: Timeline service address: http://radical-9:8188/ws/v1/timeline/
15/10/30 23:05:15 INFO client.RMProxy: Connecting to ResourceManager at radical-9/10.20.108.87:8050
15/10/30 23:05:16 INFO mapreduce.JobSubmitter: number of splits:2
15/10/30 23:05:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1446245362115_0002
15/10/30 23:05:16 INFO impl.YarnClientImpl: Submitted application application_1446245362115_0002
15/10/30 23:05:16 INFO mapreduce.Job: The url to track the job: http://radical-9.radical-cybertools.org:8088/proxy/application_1446245362115_0002/
15/10/30 23:05

## 3. Word Count

Count the words contained in the log file located at `/data/nasa/NASA_access_log_Jul95`

In [17]:
!hdfs dfs -rm -r wordcount-out

15/10/30 23:51:14 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://radical-10:8020/user/radical/wordcount-out' to trash at: hdfs://radical-10:8020/user/radical/.Trash/Current


In [22]:
!hdfs dfs -text /data/nasa/NASA_access_log_Jul95 | head

199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085
burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0
199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179
burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0
burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0
205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985
d104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] "GET / H

In [18]:
!yarn jar $HADOOP_EXAMPLES wordcount /data/nasa/ wordcount-out/

15/10/30 23:51:20 INFO impl.TimelineClientImpl: Timeline service address: http://radical-9:8188/ws/v1/timeline/
15/10/30 23:51:20 INFO client.RMProxy: Connecting to ResourceManager at radical-9/10.20.108.87:8050
15/10/30 23:51:20 INFO input.FileInputFormat: Total input paths to process : 1
15/10/30 23:51:20 INFO mapreduce.JobSubmitter: number of splits:2
15/10/30 23:51:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1446245362115_0004
15/10/30 23:51:21 INFO impl.YarnClientImpl: Submitted application application_1446245362115_0004
15/10/30 23:51:21 INFO mapreduce.Job: The url to track the job: http://radical-9.radical-cybertools.org:8088/proxy/application_1446245362115_0004/
15/10/30 23:51:21 INFO mapreduce.Job: Running job: job_1446245362115_0004
15/10/30 23:51:26 INFO mapreduce.Job: Job job_1446245362115_0004 running in uber mode : false
15/10/30 23:51:26 INFO mapreduce.Job:  map 0% reduce 0%
15/10/30 23:51:38 INFO mapreduce.Job:  map 67% reduce 0%
15/10/30 23:51:41 INF

## 4. Log Parsing

Use the commands `head`, `cat`, `uniq`, `wc`, `sort`, `find`, `xargs`, `awk` to evaluate the NASA log file:

Which page was called the most?
What was the most frequent return code?
How many errors occurred? What is the percentage of errors?
Implement a Python version of this Unix Shell script using this script as template!

Run the Python script inside an Hadoop Streaming job.

In [28]:
!cat mapreduce_streaming.py 

#!/usr/bin/python
#
# Licensed to Cloudera, Inc. under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  Cloudera, Inc. licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
#
# Template for python Hadoop streaming.  Fill in the map() and reduce()
# functions, which should call emit(), as appropriate.
#
# Test your script with
#  cat input | python map_reduc

In [32]:
!yarn jar $HADOOP_STREAMING -input /data/nasa -output logs-parsed \
                            -file mapreduce_streaming.py \
                            -mapper "python mapreduce_streaming.py map" \
                            -reducer "python mapreduce_streaming.py reduce" 

15/10/31 00:00:57 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapreduce_streaming.py] [/usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob8755056418131512842.jar tmpDir=null
15/10/31 00:00:58 INFO impl.TimelineClientImpl: Timeline service address: http://radical-9:8188/ws/v1/timeline/
15/10/31 00:00:58 INFO client.RMProxy: Connecting to ResourceManager at radical-9/10.20.108.87:8050
15/10/31 00:00:58 INFO impl.TimelineClientImpl: Timeline service address: http://radical-9:8188/ws/v1/timeline/
15/10/31 00:00:58 INFO client.RMProxy: Connecting to ResourceManager at radical-9/10.20.108.87:8050
15/10/31 00:00:59 INFO mapred.FileInputFormat: Total input paths to process : 1
15/10/31 00:00:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.20.108.86:50010
15/10/31 00:00:59 INFO net.NetworkTopology: Adding a new node: /default-

In [33]:
!hdfs dfs -text logs-parsed/*

200	1701534
302	46573
304	132627
400	5
403	54
404	10845
500	62
501	14
