# 3. MapReduce

For this exercise we are going to use MapReduce in local mode, i.e. we won't be running anything on the cluster!
 
## 3.1. Use the commands `head`, `cat`, `uniq`, `wc`, `sort`, `find`, `xargs`, `awk` to evaluate the NASA log file:

* Data File:  <https://github.com/scalable-infrastructure/exercise-2018/blob/master/data/nasa/NASA_access_log_Jul95.gz>
* Which page was called the most?
* What was the most frequent return code?
* How many errors occurred? What is the percentage of errors?


In [1]:
!pwd
!gzip -k -d -f  ../data/nasa/NASA_access_log_Jul95.gz

/home/hpc/pn69si/mnmda001/git/exercise-2018/03_MapReduce


In [2]:
!ls ../data/nasa

NASA_access_log_Jul95  NASA_access_log_Jul95.gz


In [3]:
%%time
!cat ../data/nasa/NASA_access_log_Jul95 | awk  '{print $(NF-1)}'| sort | uniq -c

1701534 200
  46573 302
 132627 304
      5 400
     54 403
  10845 404
     62 500
     14 501
      1 alyssa.p
CPU times: user 923 ms, sys: 220 ms, total: 1.14 s
Wall time: 17.9 s


Compute Percentages

In [14]:
data = !cat ../data/nasa/NASA_access_log_Jul95 | awk  '{print $(NF-1)}'| sort | uniq -c

In [29]:
df=pd.DataFrame.from_records([i.split() for i in data], columns=["Count", "HTTP RC"])
df["Count"]=pd.to_numeric(df["Count"], errors='coerce')
df["Counts_pct"]=(df["Count"]/df["Count"].sum()*100) 
df

Unnamed: 0,Count,HTTP RC,Counts_pct
0,1701534,200,89.946636
1,46573,302,2.461946
2,132627,304,7.01094
3,5,400,0.000264
4,54,403,0.002855
5,10845,404,0.573289
6,62,500,0.003277
7,14,501,0.00074
8,1,alyssa.p,5.3e-05


## 3.2 Implement a Python version of this Unix Shell script using this script as template! Run the Python script inside an Hadoop Streaming job.

Template: <https://github.com/scalable-infrastructure/scalable-infrastructure.github.io/blob/master/src/map_reduce.py>

In [2]:
import os
os.environ["HADOOP_HOME"]="/naslx/projects/pn69si/mnmda001/software/hadoop-2.7.5"
os.environ["PATH"]="/naslx/projects/pn69si/mnmda001/software/hadoop-2.7.5/bin:"+os.environ["PATH"]

In [5]:
!hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar -info

Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]
Options:
  -input          <path> DFS input file(s) for the Map step.
  -output         <path> DFS output directory for the Reduce step.
  -mapper         <cmd|JavaClassName> Optional. Command to be run as mapper.
  -combiner       <cmd|JavaClassName> Optional. Command to be run as combiner.
  -reducer        <cmd|JavaClassName> Optional. Command to be run as reducer.
  -file           <file> Optional. File/dir to be shipped in the Job jar file.
                  Deprecated. Use generic option "-files" instead.
  -inputformat    <TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName>
                  Optional. The input format class.
  -outputformat   <TextOutputFormat(default)|JavaClassName>
                  Optional. The output format class.
  -partitioner    <JavaClassName>  Optional. The partitioner class.
  -numReduceTasks <num> Optional. Number of reduce tasks.
  -inputreader    <spec> Optional. 

In [3]:
%%time
!rm -rf nasa-out
!hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
             -input `pwd`/../data/nasa/NASA_access_log_Jul95 -output nasa-out \
             -mapper "map_reduce_solution.py map" -reducer "map_reduce_solution.py reduce"  \
             -file `pwd`/map_reduce_solution.py 

18/04/05 09:15:18 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/home/hpc/pn69si/mnmda001/git/exercise-2018/03_MapReduce/map_reduce_solution.py] [] /tmp/streamjob7563165842408199726.jar tmpDir=null
18/04/05 09:15:20 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
18/04/05 09:15:20 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
18/04/05 09:15:20 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
18/04/05 09:15:21 INFO mapred.FileInputFormat: Total input paths to process : 1
18/04/05 09:15:21 INFO mapreduce.JobSubmitter: number of splits:7
18/04/05 09:15:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local810022510_0001
18/04/05 09:15:24 INFO mapred.LocalDistributedCacheManager: Localized file:/home/hpc/pn69si/mnmda001/git/exercise-2018/03_MapReduce/map_reduce_so

---

## 3.3 Run the program Terasort on 1 GB of data - each record that TeraGen generates is 100 Bytes in size:

    hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar teragen <number_of_records> <output_directory>

    hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar terasort <input_directory> <output_directory>

Measure the runtime for each step!

In [8]:
%%time
!hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar teragen 10000000 teragen-1GB

18/04/03 15:02:04 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
18/04/03 15:02:04 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
18/04/03 15:02:05 INFO terasort.TeraSort: Generating 10000000 using 1
18/04/03 15:02:05 INFO mapreduce.JobSubmitter: number of splits:1
18/04/03 15:02:06 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1151238665_0001
18/04/03 15:02:06 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
18/04/03 15:02:06 INFO mapreduce.Job: Running job: job_local1151238665_0001
18/04/03 15:02:06 INFO mapred.LocalJobRunner: OutputCommitter set in config null
18/04/03 15:02:06 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/04/03 15:02:06 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
18/04/03 15:02:07 INFO mapred.LocalJobRunner: Waiting for map tasks
18/04/03 15:02:07

In [9]:
%%time
!hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar teragen 10000000 terasort-1GB

18/04/03 15:03:21 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
18/04/03 15:03:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
18/04/03 15:03:22 INFO terasort.TeraSort: Generating 10000000 using 1
18/04/03 15:03:22 INFO mapreduce.JobSubmitter: number of splits:1
18/04/03 15:03:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local722040421_0001
18/04/03 15:03:23 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
18/04/03 15:03:23 INFO mapreduce.Job: Running job: job_local722040421_0001
18/04/03 15:03:23 INFO mapred.LocalJobRunner: OutputCommitter set in config null
18/04/03 15:03:23 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/04/03 15:03:23 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
18/04/03 15:03:23 INFO mapred.LocalJobRunner: Waiting for map tasks
18/04/03 15:03:23 I

In [19]:
%%time
!hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar teravalidate terasort-1GB teravalidate-1GB

18/03/11 22:30:55 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
18/03/11 22:30:55 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
18/03/11 22:30:56 INFO input.FileInputFormat: Total input paths to process : 1
Spent 282ms computing base-splits.
Spent 15ms computing TeraScheduler splits.
18/03/11 22:30:56 INFO mapreduce.JobSubmitter: number of splits:1
18/03/11 22:30:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1518461612_0001
18/03/11 22:30:57 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
18/03/11 22:30:57 INFO mapreduce.Job: Running job: job_local1518461612_0001
18/03/11 22:30:57 INFO mapred.LocalJobRunner: OutputCommitter set in config null
18/03/11 22:30:57 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/03/11 22:30:57 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitte