Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

An example project for running an m/r job on hadoop, with input from Riak

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 lib
Octocat-spinner-32 src
Octocat-spinner-32 .gitignore
Octocat-spinner-32 README.org
Octocat-spinner-32 huck_fin.txt
Octocat-spinner-32 pom.xml
README.org

Riak Hadoop Word Count Example

This is a sample project that uses Riak-Hadoop to perform the canonical word count example Map/Reduce job.

Set up

You need to do a lot

  1. Install Riak
  2. Install Hadoop (I went for fake cluster mode)
  3. Bootstrap your data
  4. Package this project as a jar
  5. Run the M/R job
  6. Look at the results

NOTE: Running Riak and Hadoop on a single machine means you need a lot of open file descriptors, best set your ulimit to unlimited.

Install Riak

The Basho wiki is the best place to start for this. This demo expects that you run a devrel cluster on localhost (See here for a how-to.)

Also, this demo uses Riak’s Secondary indexes, so you’ll need to enable them, by switching to the leveldb backend. To do this simply edit the app.config for each Riak node, changing the riak_kv storage backend to riak_kv_eleveldb_backend

{riak_kv, [
            %% Storage_backend specifies the Erlang module defining the storage
            %% mechanism that will be used on this node.
            {storage_backend, riak_kv_eleveldb_backend},
            %% ...and the rest
]},

Install Hadoop

I followed the steps here. I went for Pseudo Distributed.

Update Jackosn libraries

BIG CAVEAT Hadoop has some weird classpath issues. There is no hierarchical or scoped classloader, unlike in Servlet/JEE containers. Riak Java Client depends on Jackson, and so does Hadoop, but different versions (natch). When the Riak-Hadoop driver is finished it will come with a custom classloader, until then, you’ll need to replace your Hadoop lib/jackson*.jar libraries with the ones in the lib folder of this repo on your JobTracker/namenode only. On your data/tasknodes, you need only remove the jackson jars from your hadoop/lib directory, since the classes in the job jar are at least loaded (if not in the right order) on the tasknodes. There is an open bug about this in Hadoop’s JIRA, since it has been open for 18 months, I doubt it is about to be fixed anytime soon. I’m very sorry about this. It will be addressed soon.

Bootstrap your data

Clone and build this project

git clone https://github.com/russelldb/riak-hadoop-wordcount
cd riak-hadoop-wordcount
mvn clean install

Then just run the Bootstrap class to load some data. The repo contains a copy of Project Gutenberg’s Adventures Of Huckleberry Finn. The Bootstrap class just loads each chapter into its own key in Riak, in the wordcount bucket. The easiest way is to run it with maven.

mvn exec:java -Dexec.mainClass="com.basho.riak.hadoop.Bootstrap" -Dexec.classpathScope=runtime

The class assumes you are using a loca devrel Riak cluster. If you’re not, you can specify your Riak install’s transport and host using exec.args

mvn exec:java -Dexec.mainClass="com.basho.riak.hadoop.Bootstrap" -Dexec.classpathScope=runtime -Dexec.args="[pb|http PBHOST:PORT|HTTP_URL]"

Package this project

As stated earlier, the demo assumes you’re running a local devrel Riak cluster. If not, you need to edit the Riak Locations in the RiakWordCount class. Using you’re favourite editor, simply change the locations to the actual locations of your Riak node(s). E.g.

conf = RiakConfig.addLocation(conf, new RiakPBLocation("127.0.0.1", 8081));
conf = RiakConfig.addLocation(conf, new RiakPBLocation("127.0.0.1", 8082));

Then package the job jar:

mvn clean package

Run the job

Copy the jar from the previous step to your hadoop install directory and kick off the m/r job.

cp target/riak-hadoop-wordcount-1.0-SNAPSHOT-job.jar $HADOOP_HOME
cd $HADOOP_HOME
bin/hadoop jar riak-hadoop-wordcount-1.0-SNAPSHOT-job.jar 

Look at the results

If it all went well then the results are in your Riak cluster, in the wordcount_out bucket.

curl http://127.0.0.1:8091/riak/wordcount_out?keys=stream

Will show you the keys. Luckily we index the data, too. You can perform range queries, to see the most common words, something like the following will do:

curl 127.0.0.1:8091/buckets/wordcount_out/index/count_int/1000/3000

Or maybe you want to see all the f words Twain used?

curl 127.0.0.1:8091/buckets/wordcount_out/index/word_bin/f/g

And then?

If you try this, please let me know how you get on via the Riak mailing list or GitHub.

Something went wrong with that request. Please try again.