# Hadoop Streaming assignment 1: Words Rating

The purpose of this task is to create your own WordCount program for Wikipedia dump processing and learn basic concepts of the MapReduce.

In this task you have to find the 7th word by popularity and its quantity in the reverse order (most popular first) in Wikipedia data (`/data/wiki/en_articles_part`).

There are several points for this task:

1) As an output, you have to get the 7th word and its quantity separated by a tab character.

2) You must use the second job to obtain a totally ordered result.

3) Do not forget to redirect all trash and output to /dev/null.

Here you can find the draft of the task main steps. You can use other methods for solution obtaining.

####  python3.6 기반으로 작성하였다. ("/opt/conda/bin/python")

## Step 1. Create mapper and reducer.

<b>Hint:</b>  Demo task contains almost all the necessary pieces to complete this assignment. You may use the demo to implement the first MapReduce Job.

In [153]:
%%writefile mapper1.py

import sys
import re

for line in sys.stdin:
    try:
        article_id, content = line.strip().split('\t', 1)
    except ValueError as e:
        continue
    words = re.split('\W+', content)
    for word in words:
        sys.stderr.write(f'reporter:counter:Wiki stats,Total words,1\n')
        print(f'{word.lower()}\t1')

Overwriting mapper1.py


In [11]:
%%writefile reducer1.py

import sys

current_key = None
total_amt = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
    except ValueError as e:
        continue
    if current_key != key:
        if current_key:
            print(f'{current_key}\t{total_amt}')
        total_amt = 0
        current_key = key
    total_amt += count

if current_key:
    print(f'{current_key}\t{total_amt}')

Overwriting reducer1.py


## Step 2. Create sort job.

<b>Hint:</b> You may use MapReduce comparator to solve this step. Make sure that the keys are sorted in ascending order.

comparator를 사용하기 위해서는 amt값이 key값이 되어야 정렬할 수 있다.

mapper에서 amt를 key값으로 설정해놓으면 정렬이 되고 reducer에서 결과를 return하면 된다.

In [94]:
%%writefile mapper_sort.py

import sys

for line in sys.stdin:
    try:
        word, amt = line.strip().split('\t',1)
        amt = int(amt)
    except ValueError as e:
        continue
    
    print(f'{amt}\t{word}')

Overwriting mapper_sort.py


In [101]:
%%writefile reducer_sort.py

import sys

for line in sys.stdin:
    try:
        amt, word = line.strip().split('\t',1)
        amt = int(amt)
    except ValueError as e:
        continue
    
    print(f'{word}\t{amt}')

Overwriting reducer_sort.py


## Step 3. Bash commands

<b> Hint: </b> For printing the exact row you may use basic UNIX commands. For instance, sed/head/tail/... (if you know other commands, you can use them).

To run both jobs, you must use two consecutive yarn-commands. Remember that the input for the second job is the ouput for the first job.

In [127]:
%%time
%%bash

OUT_DIR="assignment1_"$(date +"%s%6N")
HADOOP_JAR="/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar"

hdfs dfs -rm -r -skipTrash assignment* > /dev/null

# Code for your first job
yarn jar ${HADOOP_JAR} \
    -D mapreduce.job.name="First Job" \
    -D mapreduce.job.reduces=4 \
    -files mapper1.py,reducer1.py \
    -mapper "/opt/conda/bin/python mapper1.py" \
    -combiner "/opt/conda/bin/python reducer1.py" \
    -reducer "/opt/conda/bin/python reducer1.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null

OUT_DIR2="final_result"

hdfs dfs -rm -r -skipTrash final_result* > /dev/null

# Code for your second job
yarn jar ${HADOOP_JAR} \
    -D mapreduce.job.name="Second Job" \
    -D mapreduce.job.reduces=1 \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D mapreduce.map.output.key.field.separator=\t \
    -D mapreduce.partition.keycomparator.options="-k1,1nr" \
    -files mapper_sort.py,reducer_sort.py \
    -mapper "/opt/conda/bin/python mapper_sort.py" \
    -reducer "/opt/conda/bin/python reducer_sort.py" \
    -input ${OUT_DIR} \
    -output ${OUT_DIR2} > /dev/null

# Code for obtaining the results
hdfs dfs -cat ${OUT_DIR2}/part-00000 | sed -n '1,7p'

the	831724
of	448016
and	344105
in	298796
to	242846
a	242158
is	126579


19/03/18 13:53:43 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/03/18 13:53:43 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/03/18 13:53:44 INFO mapred.FileInputFormat: Total input files to process : 1
19/03/18 13:53:45 INFO mapreduce.JobSubmitter: number of splits:2
19/03/18 13:53:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552910315332_0017
19/03/18 13:53:45 INFO impl.YarnClientImpl: Submitted application application_1552910315332_0017
19/03/18 13:53:45 INFO mapreduce.Job: The url to track the job: http://12da6f52529f:8088/proxy/application_1552910315332_0017/
19/03/18 13:53:45 INFO mapreduce.Job: Running job: job_1552910315332_0017
19/03/18 13:53:51 INFO mapreduce.Job: Job job_1552910315332_0017 running in uber mode : false
19/03/18 13:53:51 INFO mapreduce.Job:  map 0% reduce 0%
19/03/18 13:54:07 INFO mapreduce.Job:  map 21% reduce 0%
19/03/18 13:54:13 INFO mapreduce.Job:  map 32% reduce 0%
19/03/18 13:54:19 INFO 

CPU times: user 10 ms, sys: 10 ms, total: 20 ms
Wall time: 1min 44s


In [128]:
# 첫번째 job 결과
!hdfs dfs -ls assignment1_1552917220194804

Found 5 items
-rw-r--r--   1 jovyan supergroup          0 2019-03-18 13:54 assignment1_1552917220194804/_SUCCESS
-rw-r--r--   1 jovyan supergroup     767815 2019-03-18 13:54 assignment1_1552917220194804/part-00000
-rw-r--r--   1 jovyan supergroup     771088 2019-03-18 13:54 assignment1_1552917220194804/part-00001
-rw-r--r--   1 jovyan supergroup     765018 2019-03-18 13:54 assignment1_1552917220194804/part-00002
-rw-r--r--   1 jovyan supergroup     768466 2019-03-18 13:54 assignment1_1552917220194804/part-00003


In [135]:
!hdfs dfs -cat assignment1_1552917220194804/part-00003 | head

0	14891
000	8186
00000	5
00000000000	1
00000035	1
0000003a	1
00004	2
00008	1
00010111	1
00011	2
cat: Unable to write to output stream.


In [129]:
# 두번째 job 결과
!hdfs dfs -cat final_result/part-00000 | sed -n '1,7p'

the	831724
of	448016
and	344105
in	298796
to	242846
a	242158
is	126579
