Apache Hadoop setup and running

This readme outlines how we set up Hadoop for our testing environment.

The tests we used for Hadoop is the WordCount sample program, provided by Apache, which counts the number of word occurrences in a give dataset, and can be found here.

Machine setup

We followed DigitalOcean's install guide, found here.

Install Hadoop dependencies

Java (at time of initial setup, our version was "1.8.0_151")
```
sudo apt-get install default-jdk
```

Download Hadoop binaries (we tested on version 2.8.2)

wget http://<hadoop download link>/hadoop-2.8.2.tar.gz

Extract and move to programs folder

tar -xzvf hadoop-2.8.2.tar.gz
sudo mv hadoop-2.8.2 /usr/local/hadoop

Set Hadoop environment variables

Open hadoop-env.sh:

sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Change...

export JAVA_HOME=${JAVA_HOME}

...to...

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/

Place the hdfs-site.xml file into the /usr/local/hadoop/etc/hadoop folder, and replace the default one, if it exists already.

Verify install

/usr/local/hadoop/bin/hadoop

Output should be

Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
	CLASSNAME            run the class named CLASSNAME
	or
	where COMMAND is one of:

	...

Running the WordCount program

Place the shakespeare.txt file into a directory called ~/books (in the home folder). This is the directory that will be used by MapReuce to run the WordCount program.

Compile the WordCount program using Hadoop hadoop com.sun.tools.javac.Main WordCount.java jar cf WordCount.jar WordCount*.class or use the precompiled place the precompiled WordCount.jar file into the home directory.

From there, place the run.sh and onRestart_runThis.sh files into the /usr/local/hadoop directory and run sh onRestart_runThis.sh

After that, you should be able to just run sh run.sh to re-run the job without having to set everything up again.

If you want to change the files that Hadoop examines, or if you reboot the machine, you'll need to re-run sh onRestart_runThis.sh in order to set everything up again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!