Build and Push Jobs for Voldemort Read Only Stores

abh1nay edited this page Oct 24, 2012 · 5 revisions

Introduction

We have been using the Build and Push Job at Linkedin to create Voldemort Read-Only stores from data present in sequence files/ Avro Container files on HDFS.

The VoldemortBuildAndPushJob will behave in the following way:

  1. Build an XML storeDef for your data (based off of the key and value metadata in your JsonSequenceFile/Avro on HDFS).

  2. Connect to push.cluster

  3. Get the storeDefs for all stores in the push.cluster

  4. Look through the storeDefs for a store with the same name as push.store.name. If one is found, validate that the storeDef in the cluster matches the storeDef for your data. If it doesn’t, fail. If no storeDef exists on the cluster that matches push.store.name, then add your storeDef to the cluster.

  5. Build the Voldemort store in Hadoop

  6. Push the Voldemort store to push.cluster

Pushing JSON Data

type=java
job.class=voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob
hadoop.job.ugi=anagpal,hadoop
build.input.path=/tmp/new_op
build.output.dir=/tmp/build_output/
push.store.name=anagpal-test-old
push.cluster=tcp://localhost:6666
push.store.description="test store"
push.store.owners=myemail@myworkplace.com
build.replication.factor=1

Pushing AVRO Data

type=java
job.class=voldemort.store.readonly.mr.azkaban.VoldemortBuildAndPushJob
build.input.path=/user/anagpal/avro-data
build.output.dir=/tmp
push.cluster=tcp://localhost:6666
azkaban.should.proxy=true
user.to.proxy= anagpal
build.replication.factor= 1
build.type.avro=true
build.output.dir=/tmp/
avro.key.field=memberId
avro.value.field=localizedFirstNames
push.store.name=test-avro-store
push.store.description="Testing avro build and push"
push.store.owners= myemail@myworkplace.com
build.input.path=/user/anagpal/avro-data
build.output.dir=/tmp

Notice the following properties:

1) build.type.avro=true This specifies that input data is Avro

2) avro.key.field=memberId This specifies the field to be used as the key

3) avro.value.field=localizedFirstNames This specifies the field to be used as the value

Properties description

The properties listed in the job above are required, but there are additional properties that you may specify:

Property	         |        Required    |               Description

push.store.name	         |         Y	    | The name of the store that the job will try to push.
push.cluster	         |         Y	    | The cluster URI to push to. For example, tcp://localhost:6666.
build.input.path         |	   Y	    | The Hadoop data that you wish to push.
                         |                  | This data must be in Avro/JsonsequenceFile format
build.output.dir         |         Y	    | The Hadoop directory where the Voldemort store will be stored
push.store.owners        |	   Y	    | The owner email ID.
push.store.description   | 	   Y	    | The service name that uses this store and usage in brief.
type	                 |         Y        |  java for our use case
num.chunks	         |         N	    |  Number of chunks per partition.
build.replication.factor |	   N	    |The replication factor you want this store to have.
build.compress.value	 |         N	    |If present, compresses each value using gzip
build.temp.dir	         |         N	    |The Hadoop directory where temporary data should be stored.
                         |                  |Temp data is generated when the store is built on HDFS.
build.chunk.size	 |         N	    | The individual chunk size (advisory).