# Hadoop Streaming assignment 3: Name Count

Make WordCount program for all the names in the dataset. Name is a word with the following properties:

1) The first character is not a digit (other characters can be digits).

2) The first character is uppercase, all the other characters that are letters are lowercase.

3) There are less than `0.5%` occurrences of this word, when this word regardless to its case appears in the dataset and the condition (2) is not met.

Order by quantity, most popular first, output format:

    name <tab> count

The result is the 5th line in the output.

The result on the sample dataset:

    french 5742



## Step 1. Create the 1st mapper and reducer: condition 2)

In [1]:
%%writefile mapper1.py

import sys
import re

for line in sys.stdin:

    article_id, text = line.strip().split('\t', 1)

    try:
        words = re.split('\W*\s+\W*', text.strip())
        for word in words:
            cond1 = word[0].isalpha()
            cond2 = not word[0].islower() and word[1:].islower()

            if cond1:
                print("{}\t{:d}\t{:d}".format(word.lower(), 1, int(cond2)))
            
    except Exception as e:
        continue

Overwriting mapper1.py


In [2]:
%%writefile reducer1.py

import sys

current_key = None
word_total = 0
name_total = 0

for line in sys.stdin:
    try:
        key, word_count, name_count = line.strip().split('\t', 2)
        word_count = int(word_count)
        name_count = int(name_count)
        
        if current_key != key:
            if current_key:
                print("{}\t{:d}\t{:d}".format(current_key, word_total, name_total))
                
            current_key = key
            word_total = 0
            name_total = 0
        
        word_total += word_count
        name_total += name_count
        
    except Exception as e:
        print(e)
        continue  

if current_key:
    print("{}\t{:d}\t{:d}".format(current_key, word_total, name_total))

Overwriting reducer1.py


## Step 2. Create the 2nd mapper and reducer: condition 3)

In [3]:
%%writefile mapper2.py

import sys

current_key = None
word_total = 0
name_total = 0

for line in sys.stdin:
    try:
        key, word_count, name_count = line.strip().split('\t', 2)
        word_count = int(word_count)
        name_count = int(name_count)
        
        if current_key != key:
            if current_key:
                 print("{}\t{}\t{}".format(name_total, word_total, current_key))
                
            current_key = key
            word_total = 0
            name_total = 0
        
        word_total += word_count
        name_total += name_count
        
    except ValueError as e:
        print(e)
        continue    

if current_key:
    print("{}\t{}\t{}".format(name_total, word_total, current_key))


Overwriting mapper2.py


In [4]:
%%writefile reducer2.py


import sys

total_count = 0
total_caps = 0
current_word = None

for line in sys.stdin:
    try:
        caps_count, count, key = line.strip().split('\t', 2)
        count = int(count)
        caps_count = int(caps_count)
        
        if key != current_word:
            
            if current_word and float(total_caps) / float(total_count) >= 0.995: 
                print("{}\t{:d}".format(current_word, total_caps))
            
            total_count = 0
            total_caps = 0
            current_word = key


        total_caps += caps_count
        total_count += count

    except Exception as e:
        print e
        continue
        
if current_word and float(total_caps) / float(total_count) >= 0.995: 
    print("{}\t{:d}".format(current_word, total_caps))

Overwriting reducer2.py


## Step 3. Bash commands

Hint: For printing the exact row you may use basic UNIX commands. For instance, sed/head/tail/... (if you know other commands, you can use them).

To run both jobs, you must use two consecutive yarn-commands. Remember that the input for the second job is the ouput for the first job.

In [5]:

%%bash

OUT_DIR_1="assignment3_1_"$(date +"%s%6N")
OUT_DIR_2="assignment3_2_"$(date +"%s%6N")
NUM_REDUCERS=4

# Code for your first job
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -files mapper1.py,reducer1.py \
    -mapper 'python mapper1.py' \
    -reducer 'python reducer1.py' \
    -numReduceTasks ${NUM_REDUCERS} \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR_1} > /dev/null

# Code for your second job
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D mapreduce.partition.keycomparator.options="-k1,3nr" \
    -files mapper2.py,reducer2.py \
    -mapper 'python mapper2.py' \
    -reducer 'python reducer2.py' \
    -numReduceTasks 1 \
    -input ${OUT_DIR_1} \
    -output ${OUT_DIR_2} > /dev/null

# Code for obtaining the results
hdfs dfs -cat ${OUT_DIR_2}/part-00000 | sed -n "5p;8q"

hdfs dfs -rm -r -skipTrash ${OUT_DIR_1}* > /dev/null
hdfs dfs -rm -r -skipTrash ${OUT_DIR_2}* > /dev/null

french	5740


19/05/20 01:53:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/05/20 01:53:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/05/20 01:53:08 INFO mapred.FileInputFormat: Total input files to process : 1
19/05/20 01:53:08 INFO mapreduce.JobSubmitter: number of splits:2
19/05/20 01:53:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1558315267356_0007
19/05/20 01:53:08 INFO impl.YarnClientImpl: Submitted application application_1558315267356_0007
19/05/20 01:53:08 INFO mapreduce.Job: The url to track the job: http://5472775a3ed4:8088/proxy/application_1558315267356_0007/
19/05/20 01:53:08 INFO mapreduce.Job: Running job: job_1558315267356_0007
19/05/20 01:53:14 INFO mapreduce.Job: Job job_1558315267356_0007 running in uber mode : false
19/05/20 01:53:14 INFO mapreduce.Job:  map 0% reduce 0%
19/05/20 01:53:30 INFO mapreduce.Job:  map 51% reduce 0%
19/05/20 01:53:36 INFO mapreduce.Job:  map 71% reduce 0%
19/05/20 01:53:40 INFO 