Mathematica package for reading files off of HDFS
Objective-C Java
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



HadoopLink provides a framework for delegating the work of a map-reduce job to Mathematica kernels running on your Hadoop cluster and a suite of tools for working with your Hadoop cluster from a Mathematica notebook.


Distributed Filesystem Interaction

Wherever possible, HadoopLink provides an analogue to Mathematica's filesystem interaction functions for use with the Hadoop filesystem API. These functions are compatible with HDFS, the local filesystem, Amazon S3, and any other system that can be accessed through the Hadoop filesystem API.

How to install HadoopLink

Evaluate FileNameJoin[{$UserBaseDirectory, "Applications"}] in Mathematica and unpack the HadoopLink release archive to the indicated directory.

How to build HadoopLink

Building HadoopLink requires:

  • Apache Ant
  • Mathematica (version 7 or higher)
  • Wolfram Workbench (for building the documentation notebooks)
  • Hadoop version 0.20, patched to include the typed bytes binary data support for Hadoop Streaming.

HadoopLink was developed against the Cloudera Distribution for Hadoop, version 3.

The following properties affect the HadoopLink ant tasks:

The path to your local Mathematica installation. Can be found by evaluating the $InstallationDirectory symbol in Mathematica.
Optional. The path to your Wolfram Workbench installation. If omitted, skip.docbuild will be set.
Optional. Set this property to skip building the documentation.

Build the HadoopLink distribution by running:

ant -Dmathematica.dir=$MATHEMATICA_PATH -Dworkbench.dir=$WORKBENCH_PATH build

using appropriate values for your system.

To do

There are a number of areas in which HadoopLink could be improved.

  • Make sequence file export from Mathematica break writes up into chunks to avoid Java out of heap errors.
  • Make sequence file import compatible with all Writable subclasses in
  • Improve error handling in DFS interaction functions.
  • Switch error messages from using Throw to Message
  • Add support for shipping package dependencies along with map-reduce jobs
  • Use MemoryConstrained in map-reduce tasks
  • Rewrite the reJar function in Java for better performance
  • Put record queues between Java map/reduce calls and Mathematica to reduce J/Link overhead