Skip to content


Subversion checkout URL

You can clone with
Download ZIP
A very simple example of using Hadoop's MapReduce functionality in Java.
Branch: master

Merge pull request #1 from umermansoor/develop

made changes to the README to clarify Map-Reduce example only
latest commit 2f1b2a1f4d
@umermansoor authored

Hadoop Map-Reduce Example in Java

Get up and running in less than 5 minutes


This program demonstrates Hadoop's Map-Reduce concept in Java using a very simple example. The input is raw data files listing earthquakes by region, magnitude and other information.

nc,71920701,1,”Saturday, January 12, 2013 19:43:18 UTC”,38.7865,-122.7630,1.5,1.10,27,“Northern California”

The fields in bold are magnitude of the quake and name of region where the reading was taken, respectively. The goal is to process all input files to find the maximum magnitude quake reading for every region listed. The output is in the form:

    "region_name"      <maximum magnitude of earthquake recorded> 

The raw data files are in the input/ folder. You will notice that there is a compressed file named input.tar.gz. This is to demonstrate the concept that Hadoop MapReduce can automatically uncompress the archive to process the files.

Instructions for Setting Up Hadoop

  1. Download Hadoop 1.1.1 binary. Mirror

  2. Extract it to a folder on your computer:

    $ tar xvfz hadoop-1.1.1.tar.gz
  3. Setup JAVA_HOME environment variable to point to the directory where Java is installed. For my Mac OS X, I did the following:

    $ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

    Note: If you are running Lion, you may want to update the JAVA_HOME to point to java_home command which outputs Java's home directory, that is,

      $ export JAVA_HOME=$(/usr/libexec/java_home)
  4. Setup HADOOP_INSTALL environment variable to point the directory where you extracted hadoop binary in step 2:

    $ export HADOOP_INSTALL=/Users/umermansoor/Documents/hadoop-1.1.1
  5. Edit the PATH environment variable:

    $ export PATH=$PATH:$HADOOP_INSTALL/bin

Or you can add these variables to your standard shell script. For example, checkout my Mac OSX's ~/.bash_profile

Instructions for Running the Sample

  1. Clone the project:

    $ git clone
  2. Change to the project directory:

    $ cd hadoop-java-example
  3. Build the project:

    $ mvn clean install
  4. Setup the HADOOP_CLASSPATH environment variable to tell Hadoop where to find the java classes for the sample:

    $ export HADOOP_CLASSPATH=target/classes/
  5. Run the sample. The output directory shouldn't exists otherwise this will fail.

    $ hadoop com.umermansoor.App input/ output

Note: the output will go to the output/ folder which Hadoop will create when run. The output will be in a file called part-r-00000.

Common Errors:

  1. Exception: java.lang.NoClassDefFoundError Cause: You didn't setup the HADOOP_CLASSPATH environment variable. You need to tell Hadoop where to find the java classes. Resolution: In this case, execute the following to setup HADOOP_CLASSPATH variable to point to the target/classes/ folder.

    $ export HADOOP_CLASSPATH=target/classes/
  2. Exception: org.apache.hadoop.mapred.FileAlreadyExistsException or 'Output directory output already exists'. Cause: Output directory already exists. Hadoop requires that the output directory doesn't exists when run. Resolution: Change the output directory or remove the existing one:

    $ hadoop com.umermansoor.App input/input.csv output_new 

Note: Hadoop failing if the output folder already exists is a good thing: it ensures that you don't accidentally overwrite your previous output, as typical Hadoop jobs take hours to complete.

Something went wrong with that request. Please try again.