Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Mathematica package for reading files off of HDFS

branch: master
README.markdown

HadoopLink

HadoopLink provides a framework for delegating the work of a map-reduce job to Mathematica kernels running on your Hadoop cluster and a suite of tools for working with your Hadoop cluster from a Mathematica notebook.

Features

Distributed Filesystem Interaction

Wherever possible, HadoopLink provides an analogue to Mathematica's filesystem interaction functions for use with the Hadoop filesystem API. These functions are compatible with HDFS, the local filesystem, Amazon S3, and any other system that can be accessed through the Hadoop filesystem API.

How to install HadoopLink

Evaluate FileNameJoin[{$UserBaseDirectory, "Applications"}] in Mathematica and unpack the HadoopLink release archive to the indicated directory.

How to build HadoopLink

Building HadoopLink requires:

  • Apache Ant
  • Mathematica (version 7 or higher)
  • Wolfram Workbench (for building the documentation notebooks)
  • Hadoop version 0.20, patched to include the typed bytes binary data support for Hadoop Streaming.

HadoopLink was developed against the Cloudera Distribution for Hadoop, version 3.

The following properties affect the HadoopLink ant tasks:

mathematica.dir
The path to your local Mathematica installation. Can be found by evaluating the $InstallationDirectory symbol in Mathematica.
workbench.dir
Optional. The path to your Wolfram Workbench installation. If omitted, skip.docbuild will be set.
skip.docbuild
Optional. Set this property to skip building the documentation.

Build the HadoopLink distribution by running:

ant -Dmathematica.dir=$MATHEMATICA_PATH -Dworkbench.dir=$WORKBENCH_PATH build

using appropriate values for your system.

To do

There are a number of areas in which HadoopLink could be improved.

  • Make sequence file export from Mathematica break writes up into chunks to avoid Java out of heap errors.
  • Make sequence file import compatible with all Writable subclasses in org.apache.hadoop.io.
  • Improve error handling in DFS interaction functions.
  • Switch error messages from using Throw to Message
  • Add support for shipping package dependencies along with map-reduce jobs
  • Use MemoryConstrained in map-reduce tasks
  • Rewrite the reJar function in Java for better performance
  • Put record queues between Java map/reduce calls and Mathematica to reduce J/Link overhead
Something went wrong with that request. Please try again.