Protobuf message extension support #143

Open
wants to merge 21 commits into from

4 participants

@angushe

Hi,

Here is an attempt to add protobuf message extension support to elephant-bird.

Any comments would be appreciated.

Thanks

@dvryaboy
Twitter, Inc. member

Thanks, that looks like quite a bit of work!

Just curious, do you guys actively use the codegen? We've switched to almost 100% dynamic I/O Formats, loaders, etc.

Could you add some high-level docs of what you added and how to use it?

@angushe

Thanks Dmitriy. (BTW, Do I use the correct name?)

Yes, we use the codegen but plan to migrate to dynamic implementation, for it
really produces too many codes sometimes and is not very consistent.

In current implementation of protobuf(at least 2.3.0), an ExtensionRegistry
object must be provided to deserialize a protobuf message serialized with
extensions. But once the extensions are added to protobuf builtin
ExtensionRegistry, it can never be retrieved, which is very important in the
conversion between Pig tuple and protobuf message with extension. As a result,
we introduced a new type named ProtobufExtensionRegistry which augments the
builtin with ability to return the registered extensions back.

In order to make elephant-bird recognize the extensions, user must provided a
ProtobufExtensionRegistry instance populated with the extensions, and only
these registered extensions will be identified. Some test cases are modified
to show the use of message extension, and works are still going on.

By the way, the change is backward compatible, so user don't need to modify
existing source codes after upgrade.

@dvryaboy
Twitter, Inc. member

Thanks (and yes, Dmitriy's the right name). I meant add docs to the readme :)

I'll run a few tests and unless something comes up, merge this week -- seems pretty safe since we don't use extensions yet, so it's unlikely to break our stuff.

@rangadi rangadi commented on an outdated diff Feb 21, 2012
...ird/cascading2/io/protobuf/ProtobufSerialization.java
@@ -2,14 +2,14 @@
import java.util.Comparator;
-import com.google.protobuf.Message;
@rangadi
Twitter, Inc. member
rangadi added a note Feb 21, 2012

could you remove all these 'auto reorder imports' patches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rangadi rangadi commented on an outdated diff Feb 21, 2012
...tter/elephantbird/mapreduce/io/ProtobufConverter.java
@@ -13,10 +15,12 @@
* {@link BinaryConverter} for Protobufs
*/
public class ProtobufConverter<M extends Message> implements BinaryConverter<M> {
- private static final Logger LOG = LoggerFactory.getLogger(ProtobufConverter.class);
+ private static final Logger LOG = LoggerFactory.getLogger(
@rangadi
Twitter, Inc. member
rangadi added a note Feb 21, 2012

Please remove all the auto-reformat changes... There are lots of these.
while you are at it, I would recommend disabling these auto stuff while saving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rangadi rangadi and 1 other commented on an outdated diff Feb 22, 2012
.../com/twitter/elephantbird/pig/util/ProtobufToPig.java
@@ -63,18 +74,24 @@ public Tuple toTuple(Message msg) {
}
Descriptor msgDescriptor = msg.getDescriptorForType();
- Tuple tuple = tupleFactory_.newTuple(msgDescriptor.getFields().size());
+ Set<FieldDescriptor> fdSet = new LinkedHashSet<FieldDescriptor>();
@rangadi
Twitter, Inc. member
rangadi added a note Feb 22, 2012

why does this need to be a set?

@angushe
angushe added a note Feb 22, 2012

Raghu,

Yes, It don't need to be a set, it is here just because I thought the code is from the base but actually from Harry Chen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@angushe

Dmitriy & Raghu,

Thanks for the comments. I will try to give another commit later.

@angushe

Thanks, guys.

Dmitriy,

I will give a try to update the readme though my poor English. :)

Raghu,

The code has been reformatted according to your suggestion.

@rangadi rangadi and 1 other commented on an outdated diff Feb 22, 2012
...ter/elephantbird/proto/ProtobufExtensionRegistry.java
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import com.google.common.base.Function;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Collections2;
+import com.google.protobuf.Descriptors.Descriptor;
+import com.google.protobuf.Descriptors.FieldDescriptor;
+import com.google.protobuf.ExtensionRegistry;
+import com.google.protobuf.GeneratedMessage.GeneratedExtension;
+import com.google.protobuf.Message;
+
+import com.twitter.elephantbird.util.Protobufs;
+
+public class ProtobufExtensionRegistry {
@rangadi
Twitter, Inc. member
rangadi added a note Feb 22, 2012

What is this for? javaDoc missing?
I would think this would just be a thin wrapper for protobuf.ExtensionRegistry

@angushe
angushe added a note Feb 22, 2012

Yes, it is just a wrapper for protobuf.
In addition, when the extension is a type of protobuf message, with the wrapper, we can get the corresponding Java class, for it is unable to instantiate the extension like other normal fields with the containing type's Builder.newBuilderForField method.

Yes, Javadoc is missing, and a lot of comments need to updated too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rangadi rangadi commented on the diff Feb 22, 2012
.../com/twitter/elephantbird/pig/util/ProtobufToPig.java
@@ -271,7 +304,12 @@ private FieldSchema messageToFieldSchema(FieldDescriptor fieldDescriptor) throws
* @return the Schema for the nested message.
* @throws FrontendException if Pig decides to.
*/
- private FieldSchema singleFieldToFieldSchema(FieldDescriptor fieldDescriptor) throws FrontendException {
+ public FieldSchema singleFieldToFieldSchema(FieldDescriptor fieldDescriptor) throws FrontendException {
+ return singleFieldToFieldSchema(fieldDescriptor, null);
+ }
+
+ public FieldSchema singleFieldToFieldSchema(FieldDescriptor fieldDescriptor,
+ @SuppressWarnings("unused") ProtobufExtensionRegistry extensionRegistry) throws FrontendException {
@rangadi
Twitter, Inc. member
rangadi added a note Feb 22, 2012

if this registry is unused, do we need this method? This has change has many ripple changes.

@angushe
angushe added a note Feb 22, 2012

Actually we don't need this method since we can get all necessary information from fieldDescriptor.
It is here just for the the consistence with the signature of ProtobufToPig.messageToFieldSchema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rangadi rangadi commented on the diff Feb 22, 2012
.../com/twitter/elephantbird/pig/util/ProtobufToPig.java
@@ -437,7 +491,9 @@ private StringBuffer messageToPigScript(FieldDescriptor fieldDescriptor, int num
* @return the pig script load string for the nested message.
* @throws FrontendException if Pig decides to.
*/
- private StringBuffer singleFieldToPigScript(FieldDescriptor fieldDescriptor, int numTabs, boolean isLast) throws FrontendException {
+ private StringBuffer singleFieldToPigScript(FieldDescriptor fieldDescriptor,
+ int numTabs, boolean isLast,
+ @SuppressWarnings("unused") ProtobufExtensionRegistry extensionRegistry) throws FrontendException {
@rangadi
Twitter, Inc. member
rangadi added a note Feb 22, 2012

Same comment as singleFieldToSchema()

@angushe
angushe added a note Feb 22, 2012

The same comment as singleFieldToFieldSchema too. :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rangadi rangadi commented on the diff Feb 22, 2012
.../com/twitter/elephantbird/pig/util/ProtobufToPig.java
@@ -243,7 +271,12 @@ public Schema toSchema(Descriptor msgDescriptor) {
* @return the Schema for the nested message.
* @throws FrontendException if Pig decides to.
*/
- private FieldSchema messageToFieldSchema(FieldDescriptor fieldDescriptor) throws FrontendException {
+ public FieldSchema messageToFieldSchema(FieldDescriptor fieldDescriptor) throws FrontendException {
+ return messageToFieldSchema(fieldDescriptor, null);
+ }
+
+ public FieldSchema messageToFieldSchema(FieldDescriptor fieldDescriptor,
@rangadi
Twitter, Inc. member
rangadi added a note Feb 22, 2012

how is extensionRegistry used to return appropriate schema?

@angushe
angushe added a note Feb 22, 2012

extensionRegistry is used to get extension field descriptors with the containing message descriptor. Then the returned extension field descriptors are merged with the normal field descriptors to form the complete schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rangadi
Twitter, Inc. member

what should the schema for 'x' below should be?

x = load 'x.lzo' using ProtobufPigLoader('com.twitter.data.proto.tutorial.AddressBookProtos.Person', 'com.twitter.elephantbird.proto.ProtobufExtensionRegistry');
describe x;
@angushe

Raghu,

Sorry for the long delay.

The statement

describe x;

will output the same schema as before. It is something like

x: {name: chararray,id: int,email: chararray,phone: {phone_tuple: (number: chararray,type: chararray)}}

for the default com.twitter.elephantbird.proto.ProtobufExtensionRegistry has no extensions.

Given a custom extension registry like the following

public class FooBar extends ProtobufExtensionRegistry {
  public FooBar() {
    addExtension(AddressBookProtos.PersonExt.extInfo);
  }
}

, the following pig snippet

x = load 'x.lzo' using ProtobufPigLoader('com.twitter.data.proto.tutorial.AddressBookProtos.Person', 'com.foobar.FooBar');
describe x;

will output

x: {name: chararray,id: int,email: chararray,phone: {phone_tuple: (number: chararray,type: chararray)},ext_info: (address: chararray)}

The extension

 AddressBookProtos.JustPersonExtExtEnclosingType.extExtInfo

will not be included in the schema even the messages actually has the data of the extension, for the extension is not registered in the FooBar registry.

@rangadi
Twitter, Inc. member

thanks for getting back. will do a bit more testing.

@bjuhn

Any updates on this feature?

@rangadi
Twitter, Inc. member

This is an important feature.. we will merge it into master. Just didn't get enough time to review properly.
Please feel free to add your feedback or code review.

@bjuhn

The ProtobufBytesToTuple is missing a constructor that accepts ProtobufExtensionRegistry as a string. I think this is needed? or can you give a pig example using it's current implementation?

Thanks!

@bjuhn

How do you feel about adding a ProtobufBytesToTuple and ProtobufPigLoader constructor that accepts a list of protobuf extensions in the constructor, and then have it register their extensions using the protobuf registerAllExtensions method internally?

@rangadi
Twitter, Inc. member

agreed. ProtobufBytesToTuple should be in sync with ProtoBufPigLoader. should add that here.
haven't thought about the list of extensions.

@bjuhn

From my perspective the list of extensions would be preferable, as people won't have to create their own ProtobufExtensionRegistry extension in some cases (in all of my cases).

@angushe

Soulfulresidue & Raghu,

New commit has been made to make the consistency.

As for the constructor receiving a list of protobuf extensions, in my opinion, it should not be in the ProtobufPigLoader for you cannot use it directly in pig script.

@bjuhn

It's working for me after fixing issue #208.

I am curious why the code is taking my GeneratedMessageExtension from my ProtobufExtensionRegistry extension and having to parse the outer class name from the file. Can't this be obtained from protobuf generated code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment