# Hadoop Streaming

Usually MapReduce jobs are written in Java. Nevertheless, Hadoop has a feature called somewhat misleadingly [Hadoop Streaming](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html) which enables one to use Python or any other script language such as `shell` for developing mappers and reducers.

# Writing Some Code

First, we need to code our mapper. In case of Hadoop Streaming, a mapper is a script which gets some text from the standard input until the EOF and produces some text line by line to the standard output. For example, it can be like the following file:

In [2]:
!cat /home/borisshminke/Downloads/mapper.py

#!/usr/bin/python

counter = 0
while True:
    try:
        counter += 1
        input()
    except EOFError:
        break
print(counter)


Mind the first line (so-called [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix))) - it's very important to keep it in place, since Hadoop doesn't know where your favourite Python executable is located. It could even be `#!/usr/bin/perl` or `#!/usr/bin/bash` as well.

This script does nothing interesting, it simply goes through the file line by line until EOF and counts lines. Then it prints the total number of lines in a file.

Of course, you can code anything more complicated: import additional packages, define functions, classes, and so forth.

The reducer looks similar since generally it does the same trick: goes through the lines of the standard input and prints something to the standard output. The main difference is that it has the output of the mapper as it's input.

In [3]:
!cat /home/borisshminke/Downloads/reducer.py

#!/usr/bin/python

counter = 0
while True:
    try:
        line = input()
    except EOFError:
        break
    counter += int(line)
print(counter)


This reducer sums integer values (which are the line counts produced by the mapper).

# Pushing your code to the cluster

Hadoop lives on a cluster, and your MapReduce jobs will run on the cluster too. Hadoop __can't__ execute any code from a local machine directly. That means we need to put out code to HDFS somehow.

In [4]:
!hdfs dfs -mkdir /user/borisshminke/code
!hdfs dfs -put \
    file:///home/borisshminke/Downloads/mapper.py \
    /user/borisshminke/code
!hdfs dfs -put \
    file:///home/borisshminke/Downloads/reducer.py \
    /user/borisshminke/code

In [5]:
!hdfs dfs -ls /user/borisshminke/code

Found 2 items
-rw-r--r--   1 root hadoop        139 2020-11-17 09:26 /user/borisshminke/code/mapper.py
-rw-r--r--   1 root hadoop        150 2020-11-17 09:26 /user/borisshminke/code/reducer.py


Remember to do that whenever your want to update your MapReduce jobs!

# Running MapReduce

We will use the `mapred streaming` command for running our Hadoop Streaming job. The description of parameters follows:
* files - here we put a comma-separated list of our source code files __on HDFS__. In case of Python they are simple Python scripts but in case of Java they would be `jar` files
* input - a file on HDFS to input to the mapper
* output - some location to HDFS where to put the results (the output of the reducer)
* mapper - the name of the mapper script
* reducer - the name of the reducer script

Let magic happen!

In [6]:
!hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.9.2.jar \
    -files hdfs:///user/borisshminke/code/mapper.py,hdfs:///user/borisshminke/code/reducer.py \
    -input /user/borisshminke/data/yelp_academic_dataset_review.json \
    -output /user/borisshminke/data/result \
    -mapper mapper.py \
    -reducer reducer.py

packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.9.2.jar] /tmp/streamjob3531300907387079280.jar tmpDir=null
20/11/17 09:26:44 INFO client.RMProxy: Connecting to ResourceManager at cluster-df2a-m/10.164.0.3:8032
20/11/17 09:26:44 INFO client.AHSProxy: Connecting to Application History server at cluster-df2a-m/10.164.0.3:10200
20/11/17 09:26:44 INFO client.RMProxy: Connecting to ResourceManager at cluster-df2a-m/10.164.0.3:8032
20/11/17 09:26:44 INFO client.AHSProxy: Connecting to Application History server at cluster-df2a-m/10.164.0.3:10200
20/11/17 09:26:45 INFO mapred.FileInputFormat: Total input files to process : 1
20/11/17 09:26:45 INFO mapreduce.JobSubmitter: number of splits:48
20/11/17 09:26:45 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/17 09:26:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605604180264_0001
20/11/17 09:26:46 INFO im

Let's see the results:

In [7]:
!hdfs dfs -ls /user/borisshminke/data/result

Found 4 items
-rw-r--r--   1 root hadoop          0 2020-11-17 09:31 /user/borisshminke/data/result/_SUCCESS
-rw-r--r--   1 root hadoop          9 2020-11-17 09:31 /user/borisshminke/data/result/part-00000
-rw-r--r--   1 root hadoop          9 2020-11-17 09:31 /user/borisshminke/data/result/part-00001
-rw-r--r--   1 root hadoop          9 2020-11-17 09:31 /user/borisshminke/data/result/part-00002


Mind the file \_SUCCESS. It appears only when a MapReduce job finished successfully. Since the reducer outputs files, here they are - `part-0000*`

In [8]:
!hdfs dfs -cat /user/borisshminke/data/result/part-0000*

3247036	
2044848	
2729286	


These are the numbers of lines in our input file.

That happened because our mapper was run in parallel. You can wonder why it was slow then. The answer is quite simple - Python is an interpreted language, so it would be faster if we were using C++ or Java for our mappers and reducers.

# Do It Yourself
* code a mapper for counting characters, not lines
* code a reducer to count lines and characters in one job
* code a mapper to sum over the `compliment_count` field value