# Hadoop Streaming assignment 3: Name count

Make WordCount program for all the names in the dataset. Name is a word with the following properties:

The first character is not a digit (other characters can be digits).
The first character is uppercase, all the other characters that are letters are lowercase.
There are less than 0.5% occurrences of this word, when this word regardless to its case appears in the dataset and the condition (2) is not met.
Order by quantity, most popular first, output format:

name <tab> count

The result is the 5th line in the output

The result on the sample dataset:
french 5742

## Step 1. Create the mapper.

<b>Hint:</b> Create the mapper, which calculates Total word and Stop word amounts. You may redirect this information to sys.stderr. This will make it possible to parse these data on the next steps.

Example of the redirections:

`print >> sys.stderr, "reporter:counter:Wiki stats,Total words,%d" % count`

Remember about the Distributed cache. If we add option `-files mapper.py,reducer.py,/datasets/stop_words_en.txt`, then `mapper.py, reducer.py` and `stop_words_en.txt` file will be in the same directory on the datanodes. Hence, it is necessary to use a relative path `stop_words_en.txt` from the mapper to access this txt file.

In [12]:
%%writefile mapper1.py
#! /usr/bin/python
import sys
import re
                    
for line in sys.stdin:   
    article_id, text = line.strip().split('\t', 1)
    try:
        words = re.split('\W*\s+\W*', text.strip())
        for word in words:
            cond1=word[0].isalpha()
            cond2=not word[0].islower() and word[1:].islower()
            if cond1:
                print("{}\t{}\t{}".format(word.lower(), 1,int(cond2)))
    except Exception as e:
        continue


Overwriting mapper1.py


## Step 2. Create the reducer.

Create the reducer, which will accumulate the information after the mapper step. You may implement the combiner if you want. It can be useful from optimizing and speed up your computations (see the lectures from the Week 2 for more details).

In [11]:
%%writefile reducer1.py
#! /usr/bin/python
# Your code for reducer here.
import sys

current_word = None
current_count = 0
sum_name_count = 0
for line in sys.stdin:
    try:
        word, count, name_count = line.strip().split('\t', 2)
        count = int(count)
        name_count = int(name_count)
        if current_word != word:
            if current_word:
                print("{}\t{}\t{}".format(current_word,sum_name_count,current_count))
            current_count = 0
            current_word = word
            sum_name_count=0
            
        current_count  += count
        sum_name_count += name_count
        
    except Exception as e:
        continue
    
if current_word:
    print("{}\t{}\t{}".format(current_word,sum_name_count,current_count))

Overwriting reducer1.py


## Step 3. Create second MapReduce.


In [21]:
%%writefile mapper2.py

import sys

for line in sys.stdin:
    try:
        word, name_count,count = line.strip().split('\t', 2)
        count = int(count)
        name_count = int(name_count)
        print("{}\t{}\t{}".format(name_count,count,word))
        
    except Exception as e:
        continue


Overwriting mapper2.py


In [22]:
%%writefile reducer2.py

import sys

for line in sys.stdin:
    try:
        name_count, count, word = line.strip().split('\t', 2)
        count = int(count)
        name_count = int(name_count)
        if float(name_count)/float(count) >= 0.995:
            print("{0}\t{1}".format(word, name_count))
    except Exception as e:
        continue

Overwriting reducer2.py


## Step 4. Bash commands



In [13]:
%%bash

OUT_DIR_1="task_3_1_"$(date +"%s%6N")
NUM_REDUCERS=4

hdfs dfs -rm -r -skipTrash ${OUT_DIR_1}* > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="NamesCount" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper1.py,reducer1.py \
    -mapper "python mapper1.py" \
    -reducer "python reducer1.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR_1} > /dev/null

OUT_DIR_2="task_3_2_"$(date +"%s%6N")
NUM_REDUCERS_2=1
hdfs dfs -rm -r -skipTrash ${OUT_DIR_2} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D mapreduce.partition.keycomparator.options="-k1,1nr" \
    -D mapred.jab.name="NamesCount2" \
    -D mapreduce.job.reduces=${NUM_REDUCERS_2} \
    -files mapper2.py,reducer2.py \
    -mapper 'python mapper2.py' \
    -reducer 'python reducer2.py' \
    -input ${OUT_DIR_1} \
    -output ${OUT_DIR_2} > /dev/null

# Code for obtaining the results
hdfs dfs -cat ${OUT_DIR_2}/part-00000 | sed -n "5p;8q"


rm: `task_3_1_1566061416127044': No such file or directory
19/08/17 17:03:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/08/17 17:03:43 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/08/17 17:03:44 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1252)
	at java.lang.Thread.join(Thread.java:1326)
	at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:927)
	at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:578)
	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:755)
19/08/17 17:03:44 INFO mapred.FileInputFormat: Total input files to process : 1
19/08/17 17:03:44 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1252)
	at java.lang.Thread.join(Thread.java:1326)
	at org.apache.hadoop.hdfs.DataStr