abaranau edited this page May 17, 2012 · 12 revisions

Super Short Introduction

Please refer to very short and clear 30 Seconds HBaseHUT Introduction (not video) in case you want to get the whole idea quickly.


HBaseHUT stands for High Updates Throughput for HBase. It was inspired by discussions on HBase mailing lists around the problem with having Get&Put for each update operation, which affects write throughput dramatically.

Another force behind behind the approach used in HBaseHUT was recent activity on Coprocessors development. Although usage of CPs is very limited in the current implementation (see CPs branch) HBaseHUT is designed with broader use of CPs in mind because they add more flexibility when it comes to alternative MapReduce data processing approaches in addition to allowing seamlessly integrate the logic in places where it makes the work to be performed in the most efficient way.

The idea behind HBaseHUT is:

  • Don't do updates of existing data on each put (and hence don't perform Get operation for each put operation). All puts are plain puts with the relevant pure-insert write performance.
  • Defer processing updates to scheduled job (not necessarily MapReduce job) or perform updates on as-needed basis.
  • Serve updated data in "online" manner: user always gets updated record immediately after new data was put, i.e. user "sees" updates immediately after he writes data.

In addition to allowing real-time data processing where it wasn't possible before (where batch processing was used due to update throughput limitations) HBaseHUT also adds such a major feature as ability to roll back changes.

Please refer to Deferring Processing Updates to Increase HBase Write Performance post to find out more detailed/clear explanation of the idea behind this project.


HBaseHUT is a tool packaged as a single jar file and which wraps HBase API (but not changes it!).

HBaseHUT fits well when system handles a lot of updates of stored data and write performance is the main concern, while reading speed requirements are not that strict. Also the following cases may indicate that you want to use HBaseHUT:

  • updates are well spread on the whole large dataset
  • lower rate (among write operations) of "true updates" (i.e. not "inserts of new, previously non-existing data")
  • good portion of data stored/updated may never be accessed
  • system should be able to handle high write peaks in a fast manner
  • roll back operation is needed

The features of HBaseHUT which help address these use-cases:

Please refer to HBaseHUT operations to find out details about other operations.

API Overview

Very brief overview (see respective wiki pages for details).

Writing data:

  Put put = new HutPut(Bytes.add(presentationId , slideId));
  // ...

Reading data:

  Scan scan = new Scan(presentationId);
  ResultScanner resultScanner =
      new HutResultScanner(hTable.getScanner(scan), updateProcessor);
  for (Result current : resultScanner) {...}

Update processing logic:

  public abstract class UpdateProcessor {
    public abstract void process(Iterable<Result> records,
                 UpdateProcessingResult processingResult);

Example implementation:

public class MaxFunction extends UpdateProcessor {
  private byte[] colfam;
  private byte[] qual;

  public MaxFunction(byte[] colfam, byte[] qual) {
    this.colfam = colfam;
    this.qual = qual;

  public void process(Iterable<Result> records, UpdateProcessingResult processingResult) {
    Double maxVal = null;
    // Processing records
    for (Result record : records) {
      byte[] valBytes = record.getValue(colfam, qual);
      if (valBytes != null) {
        double val = Bytes.toDouble(valBytes);
        if (maxVal == null || maxVal < val) {
          maxVal = val;

    // Writing result
    if (maxVal == null) { // nothing to output

    processingResult.add(colfam, qual, Bytes.toBytes(maxVal));

Please refer to Bigger Example