# Hadoop Streaming

Usually MapReduce jobs are written in Java. Nevertheless, Hadoop has a feature called somewhat misleadingly [Hadoop Streaming](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html) which enables one to use Python or any other script language such as `shell` for developing mappers and reducers.

# Writing Some Code

First, we need to code our mapper. In case of Hadoop Streaming, a mapper is a script which gets some text from the standard input until the EOF and produces some text line by line to the standard output. For example, it can be like the following file:

In [2]:
!cat /home/borisshminke/Downloads/mapper.py

#!/usr/bin/python

counter = 0
while True:
    try:
        counter += 1
        input()
    except EOFError:
        break
print(counter)


Mind the first line (so-called [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix))) - it's very important to keep it in place, since Hadoop doesn't know where your favourite Python executable is located. It could even be `#!/usr/bin/perl` or `#!/usr/bin/bash` as well.

This script does nothing interesting, it simply goes through the file line by line until EOF and counts lines. Then it prints the total number of lines in a file.

Of course, you can code anything more complicated: import additional packages, define functions, classes, and so forth.

The reducer looks similar since generally it does the same trick: goes through the lines of the standard input and prints something to the standard output. The main difference is that it has the output of the mapper as it's input.

In [3]:
!cat /home/borisshminke/Downloads/reducer.py

#!/usr/bin/python

counter = 0
while True:
    try:
        line = input()
    except EOFError:
        break
    counter += int(line)
print(counter)


This reducer sums integer values (which are the line counts produced by the mapper).

# Pushing your code to the cluster

Hadoop lives on a cluster, and your MapReduce jobs will run on the cluster too. Hadoop __can't__ execute any code from a local machine directly. That means we need to put out code to HDFS somehow.

In [2]:
!hdfs dfs -mkdir /user/qlr/code
!hdfs dfs -put \
    file:///home/user/Downloads/mapper.py \
    /user/qlr/code
!hdfs dfs -put \
    file:///home/user/Downloads/reducer.py \
    /user/qlr/code

In [3]:
!hdfs dfs -ls /user/qlr/code

Found 2 items
-rw-r--r--   2 root hadoop        138 2020-11-19 14:28 /user/qlr/code/mapper.py
-rw-r--r--   2 root hadoop        150 2020-11-19 14:28 /user/qlr/code/reducer.py


Remember to do that whenever your want to update your MapReduce jobs!

# Running MapReduce

We will use the `mapred streaming` command for running our Hadoop Streaming job. The description of parameters follows:
* files - here we put a comma-separated list of our source code files __on HDFS__. In case of Python they are simple Python scripts but in case of Java they would be `jar` files
* input - a file on HDFS to input to the mapper
* output - some location to HDFS where to put the results (the output of the reducer)
* mapper - the name of the mapper script
* reducer - the name of the reducer script

Let magic happen!

In [9]:
!hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.9.2.jar \
    -files hdfs:///user/qlr/code/mapper.py,hdfs:///user/qlr/code/reducer.py \
    -input /user/qlr/data/yelp_academic_dataset_review.json \
    -output /user/qlr/data/result \
    -mapper mapper.py \
    -reducer reducer.py

packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.9.2.jar] /tmp/streamjob5407524167963061102.jar tmpDir=null
20/11/19 14:36:20 INFO client.RMProxy: Connecting to ResourceManager at cluster-big-data-uca-m/10.132.0.2:8032
20/11/19 14:36:20 INFO client.AHSProxy: Connecting to Application History server at cluster-big-data-uca-m/10.132.0.2:10200
20/11/19 14:36:20 INFO client.RMProxy: Connecting to ResourceManager at cluster-big-data-uca-m/10.132.0.2:8032
20/11/19 14:36:20 INFO client.AHSProxy: Connecting to Application History server at cluster-big-data-uca-m/10.132.0.2:10200
20/11/19 14:36:21 INFO mapred.FileInputFormat: Total input files to process : 1
20/11/19 14:36:21 INFO mapreduce.JobSubmitter: number of splits:48
20/11/19 14:36:21 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/19 14:36:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_160579198626

Let's see the results:

In [10]:
!hdfs dfs -ls /user/qlr/data/result

Found 8 items
-rw-r--r--   2 root hadoop          0 2020-11-19 14:38 /user/qlr/data/result/_SUCCESS
-rw-r--r--   2 root hadoop          8 2020-11-19 14:38 /user/qlr/data/result/part-00000
-rw-r--r--   2 root hadoop          9 2020-11-19 14:38 /user/qlr/data/result/part-00001
-rw-r--r--   2 root hadoop          9 2020-11-19 14:38 /user/qlr/data/result/part-00002
-rw-r--r--   2 root hadoop          9 2020-11-19 14:38 /user/qlr/data/result/part-00003
-rw-r--r--   2 root hadoop          9 2020-11-19 14:38 /user/qlr/data/result/part-00004
-rw-r--r--   2 root hadoop          8 2020-11-19 14:38 /user/qlr/data/result/part-00005
-rw-r--r--   2 root hadoop          9 2020-11-19 14:38 /user/qlr/data/result/part-00006


Mind the file \_SUCCESS. It appears only when a MapReduce job finished successfully. Since the reducer outputs files, here they are - `part-0000*`

In [11]:
!hdfs dfs -cat /user/qlr/data/result/part-00002

1019785	


In [12]:
!hdfs dfs -cat /user/qlr/data/result/part-0000*

682511	
1022698	
1019785	
1517790	
1220647	
838812	
1718927	


In [15]:
!wc -l /home/user/Downloads/yelp_academic_dataset_review.json

8021122 /home/user/Downloads/yelp_academic_dataset_review.json


These are the numbers of lines in our input file.

That happened because our mapper was run in parallel. You can wonder why it was slow then. The answer is quite simple - Python is an interpreted language, so it would be faster if we were using C++ or Java for our mappers and reducers.

<hr>

Removing erroneous folder:

In [19]:
!hdfs dfs -ls /user/qlr/data

Found 2 items
drwxr-xr-x   - root hadoop          0 2020-11-19 14:38 /user/qlr/data/result
-rw-r--r--   2 root hadoop 6325565224 2020-11-19 13:54 /user/qlr/data/yelp_academic_dataset_review.json


In [20]:
!hdfs dfs -rm -R /user/qlr/test_folder

Deleted /user/qlr/test_folder


In [21]:
!hdfs dfs -ls /user/qlr

Found 2 items
drwxr-xr-x   - root hadoop          0 2020-11-19 14:28 /user/qlr/code
drwxr-xr-x   - root hadoop          0 2020-11-19 14:36 /user/qlr/data


# Do It Yourself
* code a mapper for counting characters, not lines

In [28]:
!hdfs dfs -cat /user/qlr/data/yelp_academic_dataset_review.json | head -n 1

{"review_id":"xQY8N_XvtGbearJ5X4QryQ","user_id":"OwjRMXRC0KyPrIlcjaXeFQ","business_id":"-MhfebM0QIsKt87iDN-FNw","stars":2.0,"useful":5,"funny":0,"cool":0,"text":"As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the Virginia Museum of Fine Arts (VMFA), I knew I had to go!\n\nTucked away near the gelateria and the garden, the Gallery is pretty much hidden from view. It's what real estate agents would call \"cozy\" or \"charming\" - basically any euphemism for small.\n\nThat being said, you can still see wonderful art at a gallery of any size, so why the two *s you ask? Let me tell you:\n\n* pricing for this, while relatively inexpensive for a Las Vegas attraction, is completely over the top. For the space and the amount of art you can fit in there, it is a bit much.\n* it's not kid friendly at all. Seriously, don't bring them.\n* the security is n

In [31]:
!hdfs dfs -put \
    file:///home/user/Downloads/mapper-char.py \
    /user/qlr/code

In [32]:
!hdfs dfs -ls /user/qlr/code

Found 3 items
-rw-r--r--   2 root hadoop        151 2020-11-19 14:52 /user/qlr/code/mapper-char.py
-rw-r--r--   2 root hadoop        138 2020-11-19 14:28 /user/qlr/code/mapper.py
-rw-r--r--   2 root hadoop        150 2020-11-19 14:28 /user/qlr/code/reducer.py


In [34]:
!hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.9.2.jar \
    -files hdfs:///user/qlr/code/mapper-char.py,hdfs:///user/qlr/code/reducer.py \
    -input /user/qlr/data/yelp_academic_dataset_review.json \
    -output /user/qlr/data/result_char_count \
    -mapper mapper-char.py \
    -reducer reducer.py

packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.9.2.jar] /tmp/streamjob2451061377900387608.jar tmpDir=null
20/11/19 14:53:12 INFO client.RMProxy: Connecting to ResourceManager at cluster-big-data-uca-m/10.132.0.2:8032
20/11/19 14:53:12 INFO client.AHSProxy: Connecting to Application History server at cluster-big-data-uca-m/10.132.0.2:10200
20/11/19 14:53:12 INFO client.RMProxy: Connecting to ResourceManager at cluster-big-data-uca-m/10.132.0.2:8032
20/11/19 14:53:12 INFO client.AHSProxy: Connecting to Application History server at cluster-big-data-uca-m/10.132.0.2:10200
20/11/19 14:53:12 INFO mapred.FileInputFormat: Total input files to process : 1
20/11/19 14:53:12 INFO mapreduce.JobSubmitter: number of splits:48
20/11/19 14:53:13 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/19 14:53:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_160579198626

In [35]:
!hdfs dfs -cat /user/qlr/data/result_char_count/part-0000*

10797192	
9204228	
7549263	
9178011	
15470253	
13660029	
6331122	


* code a reducer to count lines and characters in one job

In [55]:
!hdfs dfs -rm -R /user/qlr/data/result_char_lines_count
!hdfs dfs -rm -R /user/qlr/code/mapper-char-lines.py
!hdfs dfs -rm -R /user/qlr/code/reducer-char-lines.py

Deleted /user/qlr/data/result_char_lines_count
Deleted /user/qlr/code/mapper-char-lines.py
Deleted /user/qlr/code/reducer-char-lines.py


In [56]:
!hdfs dfs -put \
    file:///home/user/Downloads/mapper-char-lines.py \
    /user/qlr/code
!hdfs dfs -put \
    file:///home/user/Downloads/reducer-char-lines.py \
    /user/qlr/code

In [57]:
!hdfs dfs -ls /user/qlr/code

Found 5 items
-rw-r--r--   2 root hadoop        210 2020-11-19 15:20 /user/qlr/code/mapper-char-lines.py
-rw-r--r--   2 root hadoop        151 2020-11-19 14:52 /user/qlr/code/mapper-char.py
-rw-r--r--   2 root hadoop        138 2020-11-19 14:28 /user/qlr/code/mapper.py
-rw-r--r--   2 root hadoop        314 2020-11-19 15:20 /user/qlr/code/reducer-char-lines.py
-rw-r--r--   2 root hadoop        150 2020-11-19 14:28 /user/qlr/code/reducer.py


In [58]:
!hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.9.2.jar \
    -files hdfs:///user/qlr/code/mapper-char-lines.py,hdfs:///user/qlr/code/reducer-char-lines.py \
    -input /user/qlr/data/yelp_academic_dataset_review.json \
    -output /user/qlr/data/result_char_lines_count \
    -mapper mapper-char-lines.py \
    -reducer reducer-char-lines.py

packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.9.2.jar] /tmp/streamjob8127088355966363102.jar tmpDir=null
20/11/19 15:20:55 INFO client.RMProxy: Connecting to ResourceManager at cluster-big-data-uca-m/10.132.0.2:8032
20/11/19 15:20:55 INFO client.AHSProxy: Connecting to Application History server at cluster-big-data-uca-m/10.132.0.2:10200
20/11/19 15:20:55 INFO client.RMProxy: Connecting to ResourceManager at cluster-big-data-uca-m/10.132.0.2:8032
20/11/19 15:20:55 INFO client.AHSProxy: Connecting to Application History server at cluster-big-data-uca-m/10.132.0.2:10200
20/11/19 15:20:55 INFO mapred.FileInputFormat: Total input files to process : 1
20/11/19 15:20:55 INFO mapreduce.JobSubmitter: number of splits:48
20/11/19 15:20:55 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/19 15:20:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_160579198626

In [59]:
!hdfs dfs -cat /user/qlr/data/result_char_lines_count/part-0000*

number of lines: 1528677; number of characters: 13758012	
number of lines: 1368722; number of characters: 12318426	
number of lines: 1195213; number of characters: 10756854	
number of lines: 1201206; number of characters: 10810791	
number of lines: 1011019; number of characters: 9099117	
number of lines: 1016714; number of characters: 9150372	
number of lines: 699619; number of characters: 6296526	


* code a mapper to sum over the `compliment_count` field value