Find file
Fetching contributors…
Cannot retrieve contributors at this time
39 lines (28 sloc) 1.37 KB

Fast Avro Storage

I got frustrated with the version of AvroStorage bundled with Apache Pig (in Piggybank), so I decided to write my own.

Why god why?

The AvroStorage code is very complicated. It does a lot of unnecesary copying. It doesn't support the latest version of Avro (so it doesn't support Snappy compression). All of these things are bad.

What did you do differently?

I decided on a different approach. In Pig, Tuples are implemented as an Interface. I realized that I could just wrap Avro objects (GenericData objects) into another object that implemented the Tuple interface. That helped reduce the amount of copying required.

I used the latest version of Avro (1.7.2) as as starting point, and rewrote AvroStorage from scratch.

I also wrote a function to load and store data in Trevni format from Pig. Trevni is Doug Cutting's new format for column-oriented stores; it's designed to accept Avro objects and return Avro objects.

Why didn't you contribute this to Apache?

I will contribute this to Apache when I have worked out some bugs, done some performance testing and tuning, and added unit tests.

In the mean time, feel free to try this out. It's alpha quality software; I can make no promises about it of any kind, other than that I wrote it. Hope you find it useful!

-- Joe