Map Reduce Wordcount
In Map/Reduce we will write three parts of the program: The Mapper, which implements the Map function, the Reducer, which implements the Reduce function, and the main class that configures the complete job.
The input and output of both the Map and the Reduce function is a tuple of key and value. Hadoop defines several special classes to store the values from the tuple. In this first lab we will use the IntWritable object, which stores an int, and the Text object, which stores a String. The provided source code will show how to initialise and use these objects.
Basic flow: First we run the Mapper: This function splits the full text into all the individual words, and it generates one tuple for each word (with key being a String with the word, and value being 1). The Reducer receives as input the aggregated results from the Mapper execution: for each unique Key, it provides a list of values for that key that have been generated by the Mappers. With that information it must be able to count the number of occurrences for the word, and generate as result for that word another tuple, with key being a Text containing the String with the word, and value being a IntWritable with the total count stored as an integer.
IntWritable and Text are the Hadoop equivalents to int and String Java types. They follow their own hierarchy of classes so that they can be moved over the network during the shuffle and partition steps. Identify the code for splitting the text into separate words is slightly different, but the The emit command is equivalent to write in real MapReduce code. Look at the signature of TokenizerMapper, and try to understand how it relates to pairs <k1,v1>, <k2,v2>
Create an input folder
$ mkdir input
WordCount.java a MapReduce program that orchestrates the two classes we have just defined.
In order to create a package ready to be executed by Hadoop go to the project root folder and invoke the ant command
$ ant clean dist
If the code compiles correctly a compressed jar file named HadoopTest.jar will appear in a newly created dist/ folder
ls dist
Execute the job from the base folder of your project running the following line:
$ hadoop-local jar dist/WordCount.jar WordCount input out
*[ ] How many times does the word Sherlock appear in the file?
The part-r-00000 file holds the answer.
The Hadoop job outputs a set of statistics after the completion. That way, you can see what was the number of key-value pairs generated by the Mapper, the number of unique inputs to the Reducer, and the final result of the Reducer. Have a look at these values and answer the following two questions: How many words has the document in total? How many unique words does the document have?
Once you have completed the base Word Count exercise, try to answer correctly the following questions: *[ ] Change the MapReduce job so that only the words that appear 3 or more times are actually stored in the output file
These questions are substantially more challenging than the previous ones, so only try them if you are up to date with the module labs and assignments.
Change the Hadoop job so that instead of counting words it counts bigrams (pairs of consecutive words). You can use the format "word1,word2", encoded in a Text(String) object for the keys to be sent between the Mapper and Reducer. Change the Hadoop job so that you compute the total counts of word length for the dataset. That is, the number of words appearing in the text of length 1,2,3... Take into account that in this case you should change the intermediate key value types, so that they are [IntWritable, IntWritable] instead of [Text, IntWritable]. You will need to modify all three Java files to update the data types completely in Hadoop. What is the most common word length? You can discover that by inspecting the file manually, or running a Unix command over the results of the job:
$ sort -n -k2 part-00000 | tail
However, that option would not be feasible with a large dataset. Try to define a second MapReduce project for computing this, using the output folder of the first job as the input for the second.