How to use Elephant Bird with Hive
Clone this wiki locally
This page documents how to use Elephant-Bird with Hive.
Let's quickly remind ourselves how Hive reads records so we understand how Elephant-Bird fits into the picture.
Hive data is organized as follows:
- Databases contain Tables
- Tables contain Partitions
- Partitions have a StorageDescriptor
When reading data from a partition, Hive uses the storage descriptor to determine which InputFormat and SerDe to use. Each partition has its own storage descriptor, so you can change file and/or serialization formats at any time, and your existing partitions are still usable.
Hive creates an instance of the InputFormat for the partition, and uses it to read bytes from disk, splitting the input stream of bytes into individual records-worth of bytes.
Hive creates an instance of the SerDe, and initializes it with properties you provided when creating the table (serialization.class & serialization.format). Each record's worth of bytes from the input format is given to the SerDe, which converts it to the correct object type.
Elephant-Bird typically provides functionality of both input format and serde, reading bytes from disk, turning into a record's worth of bytes, and deserializing the bytes into an object of the correct type. To integrate with Hive in the least-intrusive way, Elephant-Bird was updated to include an input format that returns the raw record bytes instead of internally deserializing the record.
An alternative implementation of Elephant-Bird support for Hive could be done at the SerDe layer, providing a SerDe that expects an already deserialized record. An issue with this approach is the HiveMetaStore does not store properties for input formats – only serde's. Adding metastore support for input format properties is certainly possible, but is a larger change for little benefit. For this reason we chose to add raw bytes support to Elephant-Bird so it "just works" with Hive.
Example create table statement
You can create Hive tables as follows for your data stored by Elephant-Bird, and partitions you add to the tables will be readable.
create external table elephantbird_test -- no need to specify a schema - it will be discovered at runtime -- (myint int, myString string, underscore_int int) partitioned by (dt string) -- hive provides a thrift deserializer row format serde "org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer" with serdeproperties ( -- full name of our thrift struct class "serialization.class"="org.apache.hadoop.hive.serde2.thrift.test.IntString", -- use the binary protocol "serialization.format"="org.apache.thrift.protocol.TBinaryProtocol") stored as -- elephant-bird provides an input format for use with hive inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat" -- placeholder as we will not be writing to this table outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
Reading Protocol Buffers
Elephant-Bird provides ProtobufDeserializer for reading records stored as serialized protocol buffers. This deserializer expects
BytesWritable records, making it suitable for use with a variety of input formats, such as DeprecatedRawMultiInputFormat, SequenceFile, and other file formats that store binary values.
create table users row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer" with serdeproperties ( "serialization.class"="com.example.proto.gen.Storage$User") stored as inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat";
Don't we need to store the write-time schema? Won't things break horribly if we change the schema?
A benefit to storing thrift structs is there are well-known rules for evolving structs - never remove fields, never reuse field IDs, and add new fields as optional. If playing by those rules, bytes stored to disk can be restored to thrift structs based on the current schema, marking optional fields as null, or using default values as the case may be.
hive-site.xml property must be set for Elephant-Bird to work correctly.
<property> <name>hive.input.format</name> <value>org.apache.hadoop.hive.ql.io.HiveInputFormat</value> <description>Hive uses a wrapping input format so each partition can be handled differently. The default CombineHiveInputFormat does not actually ask input formats for their splits, instead looks at the files themselves and attempts to optimize based on file size and location. We use HiveInputFormat to disable this optimization as we want our loaders to report their splits. </description> </property>