Skip to content
/ RHIPE Public
forked from delta-rho/RHIPE

R and Hadoop Integrated Programming Environment

Notifications You must be signed in to change notification settings

rajeshgk/RHIPE

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

The following are instructions to get Rhipe up and running on a local machine (as opposed to a cluster) for testing and development. The instructions are specific to a Mac although they can be adapted to Linux

Pre-reqs

Install Hadoop

Download Hadoop from Cloudera - you only need Hadoop - and follow the instructions for installing Hadoop in pseudo-distributed mode located in the tar ball under docs/single_node_setup.html

  • In addition: Edit conf/core-site.xml Set the storage directory to something meaningful (where you want hadoop data stored on the local directory)
    <property>  
        <name>hadoop.tmp.dir</name>  
        <value>[path on local machine]</value>  
    </property>  

If you have already formatted the namenode then change the tmp dir you will need to reformat the namenode.

Install Google protocol buffers 2.4.1

On Mac use homebrew On Linux use package manager

Environment

Set the following environment variables:

HADOOP_HOME=<where you unzipped the hadoop tar ball>  
HADOOP_BIN=<hadoo bin dir>  
HADOOP_CONF_DIR=<hadoop conf dir>  
PKG_CONFIG_PATH=<protobuf pkgconfig dir>  
LD_LIBRARY_PATH=<protobuf lib dir>  
RHIPE_HADOOP_TMP_FOLDER=<location in HDFS space for Rhipe to write temporary files, defaults to /tmp if not specified>

Mac/Linux example:
export HADOOP_HOME=/Users/perk387/Software/hadoop-0.20.2-cdh3u6
export HADOOP_BIN=$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/Cellar/protobuf241/2.4.1/lib/pkgconfig
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Cellar/protobuf241/2.4.1/lib
export RHIPE_HADOOP_TMP_FOLDER=/tmp

R Development Environment

  • Start R
  • Run:
    install.packages("roxygen2")

Rhipe

Rhipe is built using Ant and Maven. To clean, compile, build, install and test Rhipe on a fully configured system run:

ant build-all

or to skip the R tests run:

ant clean build r-install

For other main Ant targets run
ant -p

Hadoop Distros

  • Rhipe has been successfully built and run on plain Hadoop 1.x from Apache and Cloudera CDH3/CDH4. However Rhipe must be built against the distro dependencies that will be used.

  • Rhipe is setup to build against CDH3 & CDH4-mr1 by default. There are maven profiles setup in the POM that build against Apache Hadoop 1.x and 2.x however these have not been integrated with the main Ant build and would need to be enabled by the user.
    To do this, edit the ant build file, build.xml, copy the ant target _build-hadoop-1, and customize it for the alternate maven profiles.

  • YARN - Rhipe successfully builds against YARN but has not been tested

Debug Options

Experimental

An option has been added to Java to save any "last.dump.rda" files created by R if an error occurs in the MR job to HDFS at /tmp/map-reduce-error.
To enable this option in R, add options(error=dump.frames) to the beginning of any R code that will be evaluated by Hadoop (mapper, reducer, etc) The last.dump.rda file will be stored according to job ID under the $RHIPE_HADOOP_TMP_FOLDER/map-reduce-error directory. For a usage example see the test-with-errors.R script in the inst/tests directory of the package.

About

R and Hadoop Integrated Programming Environment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 41.1%
  • Java 40.2%
  • C++ 15.3%
  • C 3.3%
  • Shell 0.1%