Replace Preconditions.checkNotNull with warning in SequenceFileStorage.putNext #89

sagemintblue · 2011-10-11T17:44:27Z

Replaces checkNotNull in SequenceFileStorage.putNext(...) with warnings to skip null inputs

This allows null inputs to be skipped at runtime instead of running into a NPE,
and possibly failing an entire Pig pipeline. Requested by Jake Mannix.

…gs to skip null inputs This allows null inputs to be skipped at runtime instead of running into a NPE, and possibly failing an entire Pig pipeline. Requested by Jake Mannix.

rangadi · 2011-10-11T18:44:46Z

src/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java

-    K key = keyConverter.toWritable(t.get(0));
-    V value = valueConverter.toWritable(t.get(1));
+    if (t == null) {
+      log.warn("Null tuple found; Skipping tuple");


I think it is a common practice to use null key when you are just interested in sequence of values. That would blow up the logs.

As such I think you should either limit the log or not log at all if it is not an error. You could increment a counter, which is much more accessible to users.

and, you should not skip when t is not null, even if either key or value are nulls.

Ah! I had thought I couldn't get at the counters from that method. Are they accessible via UDFContext?

Good point about value-only (or key-only) SequenceFile output. For these cases we could require the user to explicitly request Writable type 'NullWritable'. How does that sound?

see PigCounterHelper#incrCounter

I think if a user is writing nulls, we should allow it. null values are explicitly supported by sequenceFileFormat I think.

Does Pig ever do putNext(null)? otherwise, could remove all of null checks.

For counters see use of PigCounterHelper in JsonLoader.

Null values in SequenceFiles are not supported unless you explicitly use NullWritable for key or value type. Then, all keys / values must be null.

you are right. then this makes sense. Handling NullWritable might need more changes, it is up to you :)

…h and support null values Details: * Adds NullWritableConverter for explicit conversion to/from NullWritable. * Modifies logic of SequenceFileLoader#getSchema such that if a configured WritableConverter impl returns DataType#NULL, the field will not be included in the output schema. * Adds a number of unit tests to make sure assumptions regarding treatment of nulls and use of NullWritable, NullWritableConverter are correct.

sagemintblue · 2011-10-12T00:55:03Z

The last commit enables uses like the following, to store value-only SequenceFile data:

A = LOAD '$INPUT' AS (value: text)
STORE A INTO '$OUTPUT' USING ...SequenceFileStorage (
  '-t ...NullWritable',
  '-t ...Text -c ...TextConverter'
);

Also, when loading value (or key) only data, the proper schema is reflected by the loader:

A = LOAD '$INPUT' USING ...SequenceFileLoader (
  '-c ...NullWritableConverter',
  '-c ...TextConverter'
);
DESCRIBE A; -- {(value: chararray)}

rangadi · 2011-10-14T16:49:43Z

Looks good. I am assuming you have tested all these on real data.

Returning value only in case of NullWritable does not seem more helpful.. why not just return null? Does it also mean while storing, the relation should have only one column 'value'?

sagemintblue · 2011-10-14T16:59:29Z

I am assuming you have tested all these on real data.

Unit tests include tests for explicit null key and value data, as well as unexpected null values. I haven't tested with larger data.

Does it also mean while storing, the relation should have only one column 'value'?

During STORE eval, if either key or value type is NullWritable, only the first value in input tuples is used. I'm not sure I like this, and impl code is complicated by the fact that the tuple index of key and value data changes based on type configuration.. I'd be happy to rework so key index is always 0 and value index always 1, no matter the type config. Thoughts?

rangadi · 2011-10-14T18:09:38Z

+1 for keeping both key and value irrespective of the type. Looks more consistent to me. User won't be surprised by nulls.

sagemintblue · 2011-10-14T18:14:29Z

I'll make this change later today and push an update. Thanks for feedback!

…er implementations and client use This commit adds `DefaultWritableConverter`, capable of choosing another WritableConverter implementation at runtime which most appropriately supports the data, both during LOAD and STORE expression evaluation. This simplifies client use considerably, when underlying data is of type int, long, text, or null (other basic types easily supported via creation of more WritableConverter impls): ``` -- $INPUT is SequenceFile<IntWritable, Text> pair = LOAD '$INPUT' USING ...SequenceFileLoader(); DESCRIBE pair; -- {(key: int, value: chararray)} -- $INPUT is SequenceFile<NullWritable, IntWritable> pair = LOAD '$INPUT' USING ...SequenceFileLoader(); DESCRIBE pair; -- {(key: null, value: int)} -- $INPUT is SequenceFile<IntWritable, LongWritable> pair = LOAD '$INPUT' USING ...SequenceFileLoader(); DESCRIBE pair; -- {(key: int, value: long)} -- $OUTPUT will be SequenceFile<IntWritable, LongWritable> STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage(); ``` DefaultWritableConverter is able to determine runtime data type via a number of strategies: During LOAD, if the underlying SequenceFile data already exists (it isn't some intermediate output generated earlier in the same Pig script), then key and value Writable classes are pulled directly from the data, and passed on to `DefaultWritableConverter#initialize(..)`. This allows DefaultWritableConverter a chance to select and instantiate an appropriate WritableConverter impl for the given Writable type. As mentioned above, this fails in the case where the underlying data does not yet exist. If Pig had some mechanism to communicate to the LoadFunc the expected schema of loaded data, beyond its LoadPushDown API, then we could do better here. During STORE, if the relation being stored is associated with a schema, SequenceFileStorage will use `WritableConverter#checkStoreSchema(..)` to validate the input schema. This allows DefaultWritableConverter to select and instantiate an appropriate WritableConverter impl for the given Pig data type. This fails when no schema is associated with the relation to be stored. In this case, the user must manually specify the desired Writable type (if supported by DefaultWritableConverter), or an appropriate WritableConverter type. Besides the addition of DefaultWritableConverter, here are a few more important changes in this commit: - SequenceFileLoader, Storage always return/expect tuples of size >= 2, even if schema reports either key or value as `DataType.NULL`. This simplifies impl logic, and client use. - SequenceFileStorage now reports counts of unexpected nulls for input tuple itself, as well as null key or value objects.

This commit removes the bulk of features around `DefaultWritableConverter`, but still simplifies the way clients use `SequenceFileStorage` via a slight extension to the `WritableConverter` API; Method `WritableConverter#getWritableClass()` allows `WritableConverter` impls to report to their owning `SequenceFileStorage` instance the default `Writable` type returned from calls to `WritableConverter#toWritable(..)`. `WritableConverter` implementations for basic types, such as `IntWritableConverter` and `TextConverter`, may now be used as follows, without explicit specification of `Writable` type: ``` pair = LOAD '$INPUT' AS (key: int, value: chararray); STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage ( '-c ...IntWritableConverter', '-c ...TextConverter' ); ``` Only for those `WritableConverter` impls which don't report default `Writable` types, such as `GenericWritableConverter`, must type param be specified: ``` pair = LOAD '$INPUT' AS (key: int, value: bytearray); STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage ( '-c ...IntWritableConverter', '-c ...GenericWritableConverter -t ...MyWritableType' ); ```

sagemintblue · 2011-10-19T21:05:33Z

I got myself into a lengthy refactoring session here, unfortunately; The last two commits first add a lot of new stuff, and then pair it down to a more manageable size for this branch. I may post another pull request to get your feedback on the new features.

The bulk of new stuff is inclusion of a DefaultWritableConverter impl, along with changes to SequenceFileLoader and SequenceFileStorage which allow DefaultWritableConverter to selection and instantiate a WritableConverter which best suits data type at runtime. This dramatically simplifies client use in cases where key, value types are simple (i.e. int, long, chararray, null):

-- read SequenceFile<IntWritable, Text> without args or schema
pair = LOAD '$INPUT' USING ...SequenceFileLoader();
DESCRIBE pair; -- {(key: int, value: chararray)}

-- write SequenceFile<IntWritable, Text> without args
STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage();

After backing out DefaultWritableConverter and related changes, client use of SequenceFileStorage is still simplified from what it was: Specification of Writable type is unnecessary when using a WritableConverter which defines a default Writable type. For example:

pair = LOAD '$INPUT' AS (key: int, value: chararray);

-- store without using '-t MyWritableType' args
STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage (
  '-c ...IntWritableConverter',
  '-c ...TextConverter'
);

In the case of a WritableConverter which does not specify a default Writable type, such as GenericWritableConverter, the type must still be specified manually:

pair = LOAD '$INPUT' AS (key: int, value: bytearray);

-- store without using '-t MyWritableType' args
STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage (
  '-c ...IntWritableConverter',
  '-c ...GenericWritableConverter -t MyWritableType'
);

rangadi · 2011-10-20T00:14:54Z

love the new improvements. When I was reviewing the other pull request, I was thinking on the similar lines : it would be better I don't need to specify the WritatableConverter for thrift and just the Thrift class, since the converter can be derived.

After backing out DefaultWritableConverter and related changes, client use of SequenceFileStorage

It is up to you. I don't mind including DefaultWritableConverter here.

pair = LOAD '$INPUT' USING ...SequenceFileLoader();

How do you figure out it is 'int, chararray"?

sagemintblue · 2011-10-20T00:23:35Z

It is up to you. I don't mind including DefaultWritableConverter here.

I'd rather not include it at this time because I'm not happy with the consistency with which I can properly derive runtime data types during LOAD expression evaluation.

How do you figure out it is 'int, chararray"?

More details are listed in this commit message. Generally, on LOAD, I can deserialize key, value classes from the target SequenceFile data, if it exists. Unfortunately, if the data to be loaded is generated earlier in the same script, it doesn't yet exist when the front end queries the LoadFunc (LoadMetadata) for schema info. It'd be great it Pig communicated the LOAD expression's schema clause to the LoadFunc, but I believe this only happens during LoadPushDown evaluation.

sagemintblue · 2011-10-20T00:26:53Z

it would be better I don't need to specify the WritatableConverter for thrift and just the Thrift class, since the converter can be derived.

The prior DefaultWritableConverter would need to keep track of any extra arguments passed in, such as the name of a Thrift class, in order to support this. Doable, but as I mentioned, I'd prefer to hold off until I figure out a cleaner way to determine schema info during LOAD.

rangadi · 2011-10-20T15:12:15Z

More details are listed in this commit message. Generally, on LOAD,

I see. PigStorageSchema also does something similar. It does not seem to worry about the recently generated table. It just uses the 'location' passed in getSchema(). Is that affected by this issue too?

btw, I think most of the commit message belongs in code itself..

sagemintblue · 2011-10-20T15:44:41Z

It does not seem to worry about the recently generated table. It just uses the 'location' passed in getSchema(). Is that affected by this issue too?

It should be affected, yes-- getSchema(..) is only called on pig front end during planning. If the target location is created earlier in the same script, it won't exist when getSchema(..) is called. Looking at source for PigStorageSchema and util class JsonMetadata, seems as though errors are logged and getSchema(..) returns null when no schema metadata can be found for the target location.

However, if the input location is created earlier in the script via PigStorageSchema, then perhaps if [StoreMetadata#storeSchema(..)](http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/StoreMetadata.html#storeSchema%28org.apache.pig.ResourceSchema, java.lang.String, org.apache.hadoop.mapreduce.Job%29) is also called on front end during planning (creating metadata for target location), then a subsequent PigStorageSchema#getSchema(..) would succeed. Unfortunately, I can't tell from docs when / where StoreMetadata#storeSchema(..) is called.

btw, I think most of the commit message belongs in code itself..

Agreed. When I have time to push DefaultWritableConverter back in, I'll include the commit log description in its javadoc.

sagemintblue · 2011-10-20T20:31:39Z

Raghu, anything more you'd like to see in this branch? I'm keen on getting this shipped as I'm blocking now on EB 2.0.9 release for some other work. Do you keep a regular release schedule for EB? Thanks!

rangadi · 2011-10-20T21:08:30Z

looking at the patch now.. will commit soon.

rangadi · 2011-10-20T21:37:28Z

The prior DefaultWritableConverter would need to keep track of any extra arguments passed in, such as the name of a Thrift class, in order to support this. Doable, but as I mentioned, I'd prefer to hold off until I figure out a cleaner way to determine schema info during LOAD.

This has already improved. Thrift classes don't need to specify ''-t com.twitter.elephantbird.mapreduce.io.ThriftWritable"

sagemintblue · 2011-10-20T21:45:17Z

This has already improved. Thrift classes don't need to specify ''-t com.twitter.elephantbird.mapreduce.io.ThriftWritable"

Indeed! Clients should be able to load Thrift data like this:

pair = LOAD '$INPUT' USING ...SequenceFileStorage (
  '-c ...IntWritableConverter',
  '-c ...ThriftWritableConverter -- ...MyThriftType'
);

No need to additionally specify -t ...ThriftWritable, unless you're using deprecated derived type like ThriftMyTypeWritable extends ThriftWritable<MyType>.

…rning better handling of NULLs in user's tuple while storing. Some more interface improvements and code clean up.

dvryaboy · 2011-10-25T03:21:40Z

I love how you guys snuck in a gigantic refactor under a innocuous-sounding heading of replacing some checks with warnings :)

sagemintblue · 2011-10-25T03:49:29Z

My next branch will be named wahffer_thin_sequence_file_storage_update.

Replaces checkNotNull in SequenceFileStorage.putNext(...) with warnin…

b8a293a

…gs to skip null inputs This allows null inputs to be skipped at runtime instead of running into a NPE, and possibly failing an entire Pig pipeline. Requested by Jake Mannix.

rangadi reviewed Oct 11, 2011
View reviewed changes

sagemintblue added 2 commits October 19, 2011 10:48

rangadi added a commit that referenced this pull request Oct 20, 2011

Merge pull request #89 from sagemintblue/replace-checknotnull-with-wa…

cd7a6b6

…rning better handling of NULLs in user's tuple while storing. Some more interface improvements and code clean up.

rangadi merged commit cd7a6b6 into twitter:eb-dev Oct 20, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Preconditions.checkNotNull with warning in SequenceFileStorage.putNext #89

Replace Preconditions.checkNotNull with warning in SequenceFileStorage.putNext #89

sagemintblue commented Oct 11, 2011

rangadi Oct 11, 2011

rangadi Oct 11, 2011

sagemintblue Oct 11, 2011

sagemintblue Oct 11, 2011

dvryaboy Oct 11, 2011

rangadi Oct 11, 2011

sagemintblue Oct 11, 2011

rangadi Oct 11, 2011

sagemintblue commented Oct 12, 2011

rangadi commented Oct 14, 2011

sagemintblue commented Oct 14, 2011

rangadi commented Oct 14, 2011

sagemintblue commented Oct 14, 2011

sagemintblue commented Oct 19, 2011

rangadi commented Oct 20, 2011

sagemintblue commented Oct 20, 2011

sagemintblue commented Oct 20, 2011

rangadi commented Oct 20, 2011

sagemintblue commented Oct 20, 2011

sagemintblue commented Oct 20, 2011

rangadi commented Oct 20, 2011

rangadi commented Oct 20, 2011

sagemintblue commented Oct 20, 2011

dvryaboy commented Oct 25, 2011

sagemintblue commented Oct 25, 2011

Replace Preconditions.checkNotNull with warning in SequenceFileStorage.putNext #89

Replace Preconditions.checkNotNull with warning in SequenceFileStorage.putNext #89

Conversation

sagemintblue commented Oct 11, 2011

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sagemintblue commented Oct 12, 2011

rangadi commented Oct 14, 2011

sagemintblue commented Oct 14, 2011

rangadi commented Oct 14, 2011

sagemintblue commented Oct 14, 2011

sagemintblue commented Oct 19, 2011

rangadi commented Oct 20, 2011

sagemintblue commented Oct 20, 2011

sagemintblue commented Oct 20, 2011

rangadi commented Oct 20, 2011

sagemintblue commented Oct 20, 2011

sagemintblue commented Oct 20, 2011

rangadi commented Oct 20, 2011

rangadi commented Oct 20, 2011

sagemintblue commented Oct 20, 2011

dvryaboy commented Oct 25, 2011

sagemintblue commented Oct 25, 2011