Skip to content
High-performance I/O tools to run distributed R jobs seamlessly on Hadoop and handle chunk-wise data processing
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.


Type Name Latest commit message Commit time
Failed to load latest commit information.
R remove noop Mar 5, 2020
man clean up connections in examples to avoid check errors Mar 9, 2020
src improve PROTECTion in chunk.apply() Mar 9, 2020
tests fixed issue #24 to support quotes in character fields Jun 22, 2015
.Rbuildignore add Travis CI integration Apr 10, 2020
.gitignore clarifying role of utf-8 in mstrsplit and dstrsplit Jan 12, 2015
.travis.yml add Travis CI integration Apr 10, 2020
NAMESPACE add as.output.raw() which supports both FD and connections Mar 2, 2020
NEWS update NEWS Mar 9, 2020 add Travis CI integration Apr 10, 2020

High-performance I/O tools for R

Anyone dealing with large data knows that stock tools in R are bad at loading (non-binary) data to R. This package started as an attempt to provide high-performance parsing tools that minimize copying and avoid the use of strings when possible (see mstrsplit, for example).

To allow processing of arbitrarily large files we have added way to process chunk-wise input, making it possible to compute on streaming input as well as very large files (see chunk.reader and chunk.apply).

The next natural progress was to wrap support for Hadoop streaming. The major goal was to make it possible to compute using Hadoop Map Reduce by writing code that is very natural - very much like using lapply on data chunks without the need to know anything about Hadoop. See the WiKi page for the idea and hmr function for the documentation.


You can’t perform that action at this time.