# Hadoop Streaming assignment 3: Name Count

Make WordCount program for all the names in the dataset. Name is a word with the following properties:

1. The first character is not a digit (other characters can be digits).
2. The first character is uppercase, all the other characters that are letters are lowercase.
3. There are less than 0.5% occurrences of this word, when this word regardless to its case appears in the dataset and the condition (2) is not met.

Order by quantity, most popular first, output format:

*name < tab > count*

The result is the 5th line in the output.

In [1]:
%%writefile name_count_mapper.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8')

def is_name(word):
    if len(word) < 2:
        return False
    return (word[0].isalpha()) and (word[0].isupper()) and (word[1:].islower())

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    words = re.split('\W*\s+\W*', text, flags = re.UNICODE)
    for word in words:
        name_flag = int(is_name(word))
        print "%s\t%d\t%d" % (word.lower(), name_flag, 1)

Writing name_count_mapper.py


In [2]:
%%writefile name_count_reducer.py

import sys

current_key = None
word_sum = 0.0
name_sum = 0.0

for line in sys.stdin:
    try:
        key, name_count, word_count = line.strip().split('\t', 2)
        name_count = int(name_count)
        word_count = int(word_count)
    except ValueError as e:
        continue
    if current_key != key:
        if current_key and float(word_sum - name_sum) / word_sum * 100 < 0.5:
            print '%s\t%d' % (current_key, word_sum)
        current_key = key
        name_sum = 0
        word_sum = 0
    name_sum += name_count
    word_sum += word_count
    
if current_key and float(word_sum - name_sum) / word_sum * 100 < 0.5:
    print '%s\t%d' % (current_key, word_sum)

Writing name_count_reducer.py


In [3]:
%%writefile name_count_sorting_mapper.py

import sys

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
        print "%d\t%s" % (count, key)
    except ValueError as e:
        continue

Writing name_count_sorting_mapper.py


In [4]:
%%writefile name_count_sorting_reducer.py

import sys

for line in sys.stdin:
    try:
        count, key = line.strip().split('\t', 1)
        count = int(count)
        print "%s\t%d" % (key, count)
    except ValueError as e:
        continue

Writing name_count_sorting_reducer.py


In [6]:
%%bash

NAME_COUNT_OUT_DIR="namecount_result"
NUM_REDUCERS=8

hdfs dfs -rm -r -skipTrash ${NAME_COUNT_OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Streaming nameCount" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files name_count_mapper.py,name_count_reducer.py \
    -mapper "python name_count_mapper.py" \
    -reducer "python name_count_reducer.py" \
    -input /data/wiki/en_articles_part \
    -output ${NAME_COUNT_OUT_DIR} > /dev/null

NAME_COUNT_SORT_OUT_DIR="namecount_sort_result"
NUM_REDUCERS=1

hdfs dfs -rm -r -skipTrash ${NAME_COUNT_SORT_OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Streaming nameCountSorting" \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D map.output.key.field.separator=\t \
    -D mapreduce.partition.keycomparator.options=-k1,1nr \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files name_count_sorting_mapper.py,name_count_sorting_reducer.py \
    -mapper "python name_count_sorting_mapper.py" \
    -reducer "python name_count_sorting_reducer.py" \
    -input ${NAME_COUNT_OUT_DIR} \
    -output ${NAME_COUNT_SORT_OUT_DIR} > /dev/null    

hdfs dfs -cat ${NAME_COUNT_SORT_OUT_DIR}/part-00000 | head -5 | tail -1

french	5753


18/08/02 10:59:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/08/02 10:59:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/08/02 10:59:55 INFO mapred.FileInputFormat: Total input files to process : 1
18/08/02 10:59:55 INFO mapreduce.JobSubmitter: number of splits:2
18/08/02 10:59:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533143790063_0007
18/08/02 10:59:55 INFO impl.YarnClientImpl: Submitted application application_1533143790063_0007
18/08/02 10:59:55 INFO mapreduce.Job: The url to track the job: http://6b428e5a5e59:8088/proxy/application_1533143790063_0007/
18/08/02 10:59:55 INFO mapreduce.Job: Running job: job_1533143790063_0007
18/08/02 11:00:01 INFO mapreduce.Job: Job job_1533143790063_0007 running in uber mode : false
18/08/02 11:00:01 INFO mapreduce.Job:  map 0% reduce 0%
18/08/02 11:00:17 INFO mapreduce.Job:  map 42% reduce 0%
18/08/02 11:00:23 INFO mapreduce.Job:  map 55% reduce 0%
18/08/02 11:00:29 INFO 