# Hadoop Streaming assignment 4: Word group

Make WordCount program for all the names in the dataset. Name is a word with the following properties:

The first character is not a digit (other characters can be digits).
The first character is uppercase, all the other characters that are letters are lowercase.
There are less than 0.5% occurrences of this word, when this word regardless to its case appears in the dataset and the condition (2) is not met.
Order by quantity, most popular first, output format:

name <tab> count

The result is the 5th line in the output

The result on the sample dataset:
french 5742

## Step 1. Create the mapper.

<b>Hint:</b> Create the mapper, which calculates Total word and Stop word amounts. You may redirect this information to sys.stderr. This will make it possible to parse these data on the next steps.

Example of the redirections:

`print >> sys.stderr, "reporter:counter:Wiki stats,Total words,%d" % count`

Remember about the Distributed cache. If we add option `-files mapper.py,reducer.py,/datasets/stop_words_en.txt`, then `mapper.py, reducer.py` and `stop_words_en.txt` file will be in the same directory on the datanodes. Hence, it is necessary to use a relative path `stop_words_en.txt` from the mapper to access this txt file.

In [18]:
%%writefile mapper1.py
import sys
import re

path = 'stop_words_en.txt'

# Your code for reading stop words here
with open(path, "r") as f:
    stop_words = f.read().splitlines()
                    
for line in sys.stdin:
    try:
        article_id, text = line.strip().split('\t', 1)
    except Exception as e:
        continue

    words = re.split('\W*\s+\W*', text, flags=re.UNICODE)

    for word in words:
        if word.lower() not in stop_words and word.isalpha():
            print("{}\t{}".format(word.lower(), 1))


Overwriting mapper1.py


## Step 2. Create the reducer.

Create the reducer, which will accumulate the information after the mapper step. You may implement the combiner if you want. It can be useful from optimizing and speed up your computations (see the lectures from the Week 2 for more details).

In [8]:
%%writefile reducer1.py

import sys

current_word = None
current_count = 0
word_group = None

for line in sys.stdin:
    try:
        word, count= line.strip().split('\t', 1)
        count = int(count)
        if current_word != word:
            if current_word:
                word_group = "".join(sorted(current_word))
                print("{}\t{}\t{}".format(current_word,word_group,current_count))
            current_count = 0
            current_word = word
            
        current_count  += count
        
    except Exception as e:
        continue
    
if current_word:
    print("{}\t{}\t{}".format(current_word,word_group,current_count))

Overwriting reducer1.py


## Step 3. Create second MapReduce.


In [11]:
%%writefile mapper2.py

import sys

for line in sys.stdin:
    try:
        word, word_group,count = line.strip().split('\t', 2)
        count = int(count)
        print("{}\t{}\t{}".format(word_group,word,count))
        
    except Exception as e:
        continue


Overwriting mapper2.py


In [22]:
%%writefile reducer2.py

import sys

current_count = 0
word_group = None
words_in_group = set()
current_group = None
for line in sys.stdin:
    try:
        word_group,word,count= line.strip().split('\t', 2)
        count = int(count)
        if current_group != word_group:
            len_word_group = len(words_in_group)
            if current_group and len_word_group>1 :
                print("{}\t{} {}\t{}".format(current_count,current_group,len_word_group,",".join(sorted(words_in_group))))
            current_count = 0
            current_group = word_group
            words_in_group = set()
        
        words_in_group.add(word)
        current_count += count
        
    except Exception as e:
        continue
    
if current_group:
    print("{}\t{} {}\t{}".format(current_count,word_group,len(words_in_group),",".join(sorted(words_in_group))))

Overwriting reducer2.py


## Step 4. Bash commands



In [15]:
%%bash

OUT_DIR_1="task_4_1_"$(date +"%s%6N")
NUM_REDUCERS=4

hdfs dfs -rm -r -skipTrash ${OUT_DIR_1}* > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="WordGroup" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper1.py,reducer1.py,/datasets/stop_words_en.txt \
    -mapper "python mapper1.py" \
    -reducer "python reducer1.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR_1} > /dev/null

OUT_DIR_2="task_4_2_"$(date +"%s%6N")
NUM_REDUCERS_2=1
hdfs dfs -rm -r -skipTrash ${OUT_DIR_2} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="WordGroup2" \
    -D mapreduce.job.reduces=${NUM_REDUCERS_2} \
    -files mapper2.py,reducer2.py \
    -mapper 'python mapper2.py' \
    -reducer 'python reducer2.py' \
    -input ${OUT_DIR_1} \
    -output ${OUT_DIR_2} > /dev/null

# Code for obtaining the results
hdfs dfs -cat ${OUT_DIR_2}/part-00000 | grep "english,"

rm: `task_4_1_1566082094172337*': No such file or directory
19/08/17 22:48:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/08/17 22:48:21 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/08/17 22:48:24 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1252)
	at java.lang.Thread.join(Thread.java:1326)
	at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:927)
	at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:578)
	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:755)
19/08/17 22:48:24 INFO mapred.FileInputFormat: Total input files to process : 1
19/08/17 22:48:24 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1252)
	at java.lang.Thread.join(Thread.java:1326)
	at org.apache.hadoop.hdfs.DataSt