ElasticSearch storage handler for Hive
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 18 commits ahead, 269 commits behind infochimps-labs:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Wonderbee is a Hive storage handler for Elastic Search based in part on Infochimps' Hadoop/Pig interface, Wonderdog.



Using ElasticSearchStorageHandler for Apache Hive

Wonderbee allows you to both write to and read from hive tables backed by Elastic Search.

Create an ElasticSearch backed table:

Either add the jars to hive.aux.jars.path or manually execute the following in Hive:

ADD JAR /path_to_jars/elasticsearch-0.19.4-SNAPSHOT.jar;
ADD JAR /path_to_jars/jline-0.9.94.jar;
ADD JAR /path_to_jars/log4j-1.2.16.jar;
ADD JAR /path_to_jars/lucene-analyzers-3.6.0.jar;
ADD JAR /path_to_jars/lucene-core-3.6.0.jar;
ADD JAR /path_to_jars/lucene-highlighter-3.6.0.jar;
ADD JAR /path_to_jars/lucene-memory-3.6.0.jar;
ADD JAR /path_to_jars/lucene-queries-3.6.0.jar;
ADD JAR /path_to_jars/json-simple-1.1.jar;
ADD JAR /path_to_jars/wonderdog-1.0.jar;

To create a table named user backed by an index named user_index

  id BIGINT,
  name STRING
STORED BY "org.wonderbee.elasticsearch.hive.ElasticSearchStorageHandler"

Here the fields that you set in Hive (eg. 'name') are used as the field names when creating json records for elasticsearch.

Predicate Push Down:

For Hive query predicates <, <=, >, and >=, Wonderbee will convert this into a range query to the underlying index.

TODO: Currently this will only work with a single predicate.

Query Parameters

There are a few query paramaters available:

  • json - (STORE only) When 'true' indicates to the StoreFunc that pre-rendered json records are being indexed. Default is false.
  • size - When storing, this is used as the bulk request size (the number of records to stack up before indexing to elasticsearch). When loading, this is the number of records to fetch per request. Default 1000.
  • q - (LOAD only) A free text query determining which records to load. If empty, matches all documents in the index.
  • id - (STORE only) The name of the field to use as a document id. If blank (or -1) the documents are assumed to have no id and are assigned one by elasticsearch.

Note that elasticsearch.yml and the plugins directory are distributed to every machine in the cluster automatically via hadoop's distributed cache mechanism.