-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Preconditions.checkNotNull with warning in SequenceFileStorage.putNext #89
Replace Preconditions.checkNotNull with warning in SequenceFileStorage.putNext #89
Conversation
…gs to skip null inputs This allows null inputs to be skipped at runtime instead of running into a NPE, and possibly failing an entire Pig pipeline. Requested by Jake Mannix.
K key = keyConverter.toWritable(t.get(0)); | ||
V value = valueConverter.toWritable(t.get(1)); | ||
if (t == null) { | ||
log.warn("Null tuple found; Skipping tuple"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a common practice to use null key when you are just interested in sequence of values. That would blow up the logs.
As such I think you should either limit the log or not log at all if it is not an error. You could increment a counter, which is much more accessible to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and, you should not skip when t is not null, even if either key or value are nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I had thought I couldn't get at the counters from that method. Are they accessible via UDFContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about value-only (or key-only) SequenceFile output. For these cases we could require the user to explicitly request Writable type 'NullWritable'. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see PigCounterHelper#incrCounter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if a user is writing nulls, we should allow it. null values are explicitly supported by sequenceFileFormat I think.
Does Pig ever do putNext(null)? otherwise, could remove all of null checks.
For counters see use of PigCounterHelper in JsonLoader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Null values in SequenceFiles are not supported unless you explicitly use NullWritable for key or value type. Then, all keys / values must be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. then this makes sense. Handling NullWritable might need more changes, it is up to you :)
…h and support null values Details: * Adds NullWritableConverter for explicit conversion to/from NullWritable. * Modifies logic of SequenceFileLoader#getSchema such that if a configured WritableConverter impl returns DataType#NULL, the field will not be included in the output schema. * Adds a number of unit tests to make sure assumptions regarding treatment of nulls and use of NullWritable, NullWritableConverter are correct.
The last commit enables uses like the following, to store value-only SequenceFile data:
Also, when loading value (or key) only data, the proper schema is reflected by the loader:
|
Looks good. I am assuming you have tested all these on real data. Returning value only in case of NullWritable does not seem more helpful.. why not just return null? Does it also mean while storing, the relation should have only one column 'value'? |
Unit tests include tests for explicit null key and value data, as well as unexpected null values. I haven't tested with larger data.
During STORE eval, if either key or value type is NullWritable, only the first value in input tuples is used. I'm not sure I like this, and impl code is complicated by the fact that the tuple index of key and value data changes based on type configuration.. I'd be happy to rework so key index is always 0 and value index always 1, no matter the type config. Thoughts? |
+1 for keeping both key and value irrespective of the type. Looks more consistent to me. User won't be surprised by nulls. |
I'll make this change later today and push an update. Thanks for feedback! |
…er implementations and client use This commit adds `DefaultWritableConverter`, capable of choosing another WritableConverter implementation at runtime which most appropriately supports the data, both during LOAD and STORE expression evaluation. This simplifies client use considerably, when underlying data is of type int, long, text, or null (other basic types easily supported via creation of more WritableConverter impls): ``` -- $INPUT is SequenceFile<IntWritable, Text> pair = LOAD '$INPUT' USING ...SequenceFileLoader(); DESCRIBE pair; -- {(key: int, value: chararray)} -- $INPUT is SequenceFile<NullWritable, IntWritable> pair = LOAD '$INPUT' USING ...SequenceFileLoader(); DESCRIBE pair; -- {(key: null, value: int)} -- $INPUT is SequenceFile<IntWritable, LongWritable> pair = LOAD '$INPUT' USING ...SequenceFileLoader(); DESCRIBE pair; -- {(key: int, value: long)} -- $OUTPUT will be SequenceFile<IntWritable, LongWritable> STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage(); ``` DefaultWritableConverter is able to determine runtime data type via a number of strategies: During LOAD, if the underlying SequenceFile data already exists (it isn't some intermediate output generated earlier in the same Pig script), then key and value Writable classes are pulled directly from the data, and passed on to `DefaultWritableConverter#initialize(..)`. This allows DefaultWritableConverter a chance to select and instantiate an appropriate WritableConverter impl for the given Writable type. As mentioned above, this fails in the case where the underlying data does not yet exist. If Pig had some mechanism to communicate to the LoadFunc the expected schema of loaded data, beyond its LoadPushDown API, then we could do better here. During STORE, if the relation being stored is associated with a schema, SequenceFileStorage will use `WritableConverter#checkStoreSchema(..)` to validate the input schema. This allows DefaultWritableConverter to select and instantiate an appropriate WritableConverter impl for the given Pig data type. This fails when no schema is associated with the relation to be stored. In this case, the user must manually specify the desired Writable type (if supported by DefaultWritableConverter), or an appropriate WritableConverter type. Besides the addition of DefaultWritableConverter, here are a few more important changes in this commit: - SequenceFileLoader, Storage always return/expect tuples of size >= 2, even if schema reports either key or value as `DataType.NULL`. This simplifies impl logic, and client use. - SequenceFileStorage now reports counts of unexpected nulls for input tuple itself, as well as null key or value objects.
This commit removes the bulk of features around `DefaultWritableConverter`, but still simplifies the way clients use `SequenceFileStorage` via a slight extension to the `WritableConverter` API; Method `WritableConverter#getWritableClass()` allows `WritableConverter` impls to report to their owning `SequenceFileStorage` instance the default `Writable` type returned from calls to `WritableConverter#toWritable(..)`. `WritableConverter` implementations for basic types, such as `IntWritableConverter` and `TextConverter`, may now be used as follows, without explicit specification of `Writable` type: ``` pair = LOAD '$INPUT' AS (key: int, value: chararray); STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage ( '-c ...IntWritableConverter', '-c ...TextConverter' ); ``` Only for those `WritableConverter` impls which don't report default `Writable` types, such as `GenericWritableConverter`, must type param be specified: ``` pair = LOAD '$INPUT' AS (key: int, value: bytearray); STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage ( '-c ...IntWritableConverter', '-c ...GenericWritableConverter -t ...MyWritableType' ); ```
I got myself into a lengthy refactoring session here, unfortunately; The last two commits first add a lot of new stuff, and then pair it down to a more manageable size for this branch. I may post another pull request to get your feedback on the new features. The bulk of new stuff is inclusion of a
After backing out
In the case of a
|
love the new improvements. When I was reviewing the other pull request, I was thinking on the similar lines : it would be better I don't need to specify the WritatableConverter for thrift and just the Thrift class, since the converter can be derived.
It is up to you. I don't mind including DefaultWritableConverter here.
How do you figure out it is 'int, chararray"? |
I'd rather not include it at this time because I'm not happy with the consistency with which I can properly derive runtime data types during
More details are listed in this commit message. Generally, on |
The prior |
I see. btw, I think most of the commit message belongs in code itself.. |
It should be affected, yes-- However, if the input location is created earlier in the script via
Agreed. When I have time to push |
Raghu, anything more you'd like to see in this branch? I'm keen on getting this shipped as I'm blocking now on EB 2.0.9 release for some other work. Do you keep a regular release schedule for EB? Thanks! |
looking at the patch now.. will commit soon. |
This has already improved. Thrift classes don't need to specify ''-t com.twitter.elephantbird.mapreduce.io.ThriftWritable" |
Indeed! Clients should be able to load Thrift data like this:
No need to additionally specify |
…rning better handling of NULLs in user's tuple while storing. Some more interface improvements and code clean up.
I love how you guys snuck in a gigantic refactor under a innocuous-sounding heading of replacing some checks with warnings :) |
My next branch will be named wahffer_thin_sequence_file_storage_update. |
Replaces checkNotNull in SequenceFileStorage.putNext(...) with warnings to skip null inputs
This allows null inputs to be skipped at runtime instead of running into a NPE,
and possibly failing an entire Pig pipeline. Requested by Jake Mannix.