Skip to content
This repository

HBase High Update Throughput

branch: master
Octocat-spinner-32 src Merge branch 'CPs' into master November 20, 2012
Octocat-spinner-32 .travis.yml Create .travis.yml June 06, 2013
Octocat-spinner-32 LICENSE.txt Initial sources commit November 30, 2010
Octocat-spinner-32 README Update README June 06, 2013
Octocat-spinner-32 pom.xml updating README December 12, 2012
README
HBaseHUT:
---------
http://github.com/sematext/HBaseHUT

Released under Apache License 2.0.

Mailing List:
-------------
To participate more in the discussion, join the group at
https://groups.google.com/group/hbasehut/

Description:
------------
HBaseHUT stands for High Updates Throughput for HBase. It was inspired by
discussions on HBase mailing lists around the problem with having Get&Put
for each update operation, which affects write throughput dramatically.
Another force behind behind the approach used in HBaseHUT was recent
activity on Coprocessors development.  Although usage of CPs is very limited
in the current implementation (see cp package) HBaseHUT is designed with
broader use of CPs in mind because they add more
flexibility when it comes to alternative MapReduce data processing
approaches in addition to allowing seamlessly integrate the logic in places
where it makes the work to be performed in the most efficient way.

The idea behind HBaseHUT is:
* Don't do updates of existing data on each Put (and hence don't perform
  Get operation for each Put operation). All Puts are plain Puts with the
  relevant pure-insert write performance.
* Defer processing updates to scheduled job (not necessarily a MapReduce job)
  or perform updates on as-needed basis.
* Serve updated data in "online" manner: user always gets updated record
  immediately after new data was Put, i.e., user "sees" updates immediately
  after he writes data.

In addition to allowing real-time data processing where it wasn't possible
before (where batch processing was used due to update throughput limitations)
HBaseHUT also adds such a major feature as ability to roll back changes.

For more information please refer to the github project wiki:
https://github.com/sematext/HBaseHUT/wiki.

For a clear introductory post with a good HBaseHUT use-case read/watch:
http://blog.sematext.com/2010/12/16/deferring-processing-updates-to-increase-hbase-write-performance/
http://blog.sematext.com/2012/04/22/hbase-real-time-analytics-rollbacks-via-append-based-updates/
http://blog.sematext.com/2012/04/27/hbase-real-time-analytics-rollbacks-via-append-based-updates-part-2/
http://vimeo.com/26813019
http://www.slideshare.net/alexbaranau/realtime-analytics-with-hbase-short-version
http://www.slideshare.net/alexbaranau/realtime-analytics-with-hbase-long-version

Build Notes:
------------
Note: unit-tests take some time to execute (up to several minutes), to skip
their execution use -Dmaven.skip.tests=true.

The latest stable version can be linked from you maven project with:

  <repositories>
    <repository>
      <id>sonatype release</id>
      <url>https://oss.sonatype.org/content/repositories/releases/</url>
    </repository>
  </repositories>

  [...]

  <dependency>
    <groupId>com.sematext.hbasehut</groupId>
    <artifactId>hbasehut</artifactId>
    <version>0.1.0</version>
  </dependency>

For running (MR jobs) on hadoop-2.0+ (which is a part of CDH4.1+) use:

  <dependency>
    <groupId>com.sematext.hbasehut</groupId>
    <artifactId>hbasehut</artifactId>
    <version>0.1.0-hadoop-2.0</version>
  </dependency>

Author:
-------
Alex Baranau
Something went wrong with that request. Please try again.