Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Preconditions.checkNotNull with warning in SequenceFileStorage.putNext #89

Merged

Conversation

sagemintblue
Copy link
Contributor

Replaces checkNotNull in SequenceFileStorage.putNext(...) with warnings to skip null inputs

This allows null inputs to be skipped at runtime instead of running into a NPE,
and possibly failing an entire Pig pipeline. Requested by Jake Mannix.

…gs to skip null inputs

This allows null inputs to be skipped at runtime instead of running into a NPE,
and possibly failing an entire Pig pipeline. Requested by Jake Mannix.
K key = keyConverter.toWritable(t.get(0));
V value = valueConverter.toWritable(t.get(1));
if (t == null) {
log.warn("Null tuple found; Skipping tuple");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a common practice to use null key when you are just interested in sequence of values. That would blow up the logs.

As such I think you should either limit the log or not log at all if it is not an error. You could increment a counter, which is much more accessible to users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, you should not skip when t is not null, even if either key or value are nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I had thought I couldn't get at the counters from that method. Are they accessible via UDFContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about value-only (or key-only) SequenceFile output. For these cases we could require the user to explicitly request Writable type 'NullWritable'. How does that sound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see PigCounterHelper#incrCounter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if a user is writing nulls, we should allow it. null values are explicitly supported by sequenceFileFormat I think.

Does Pig ever do putNext(null)? otherwise, could remove all of null checks.

For counters see use of PigCounterHelper in JsonLoader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null values in SequenceFiles are not supported unless you explicitly use NullWritable for key or value type. Then, all keys / values must be null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. then this makes sense. Handling NullWritable might need more changes, it is up to you :)

…h and support null values

Details:
* Adds NullWritableConverter for explicit conversion to/from NullWritable.

* Modifies logic of SequenceFileLoader#getSchema such that if a configured
  WritableConverter impl returns DataType#NULL, the field will not be included
  in the output schema.

* Adds a number of unit tests to make sure assumptions regarding treatment of
  nulls and use of NullWritable, NullWritableConverter are correct.
@sagemintblue
Copy link
Contributor Author

The last commit enables uses like the following, to store value-only SequenceFile data:

A = LOAD '$INPUT' AS (value: text)
STORE A INTO '$OUTPUT' USING ...SequenceFileStorage (
  '-t ...NullWritable',
  '-t ...Text -c ...TextConverter'
);

Also, when loading value (or key) only data, the proper schema is reflected by the loader:

A = LOAD '$INPUT' USING ...SequenceFileLoader (
  '-c ...NullWritableConverter',
  '-c ...TextConverter'
);
DESCRIBE A; -- {(value: chararray)}

@rangadi
Copy link
Contributor

rangadi commented Oct 14, 2011

Looks good. I am assuming you have tested all these on real data.

Returning value only in case of NullWritable does not seem more helpful.. why not just return null? Does it also mean while storing, the relation should have only one column 'value'?

@sagemintblue
Copy link
Contributor Author

I am assuming you have tested all these on real data.

Unit tests include tests for explicit null key and value data, as well as unexpected null values. I haven't tested with larger data.

Does it also mean while storing, the relation should have only one column 'value'?

During STORE eval, if either key or value type is NullWritable, only the first value in input tuples is used. I'm not sure I like this, and impl code is complicated by the fact that the tuple index of key and value data changes based on type configuration.. I'd be happy to rework so key index is always 0 and value index always 1, no matter the type config. Thoughts?

@rangadi
Copy link
Contributor

rangadi commented Oct 14, 2011

+1 for keeping both key and value irrespective of the type. Looks more consistent to me. User won't be surprised by nulls.

@sagemintblue
Copy link
Contributor Author

I'll make this change later today and push an update. Thanks for feedback!

…er implementations and client use

This commit adds `DefaultWritableConverter`, capable of choosing another
WritableConverter implementation at runtime which most appropriately supports
the data, both during LOAD and STORE expression evaluation. This simplifies
client use considerably, when underlying data is of type int, long, text, or
null (other basic types easily supported via creation of more WritableConverter
impls):

```
-- $INPUT is SequenceFile<IntWritable, Text>
pair = LOAD '$INPUT' USING ...SequenceFileLoader();
DESCRIBE pair; -- {(key: int, value: chararray)}

-- $INPUT is SequenceFile<NullWritable, IntWritable>
pair = LOAD '$INPUT' USING ...SequenceFileLoader();
DESCRIBE pair; -- {(key: null, value: int)}

-- $INPUT is SequenceFile<IntWritable, LongWritable>
pair = LOAD '$INPUT' USING ...SequenceFileLoader();
DESCRIBE pair; -- {(key: int, value: long)}

-- $OUTPUT will be SequenceFile<IntWritable, LongWritable>
STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage();
```

DefaultWritableConverter is able to determine runtime data type via a number of
strategies:

During LOAD, if the underlying SequenceFile data already exists (it isn't some
intermediate output generated earlier in the same Pig script), then key and
value Writable classes are pulled directly from the data, and passed on to
`DefaultWritableConverter#initialize(..)`. This allows DefaultWritableConverter
a chance to select and instantiate an appropriate WritableConverter impl for the
given Writable type. As mentioned above, this fails in the case where the
underlying data does not yet exist. If Pig had some mechanism to communicate to
the LoadFunc the expected schema of loaded data, beyond its LoadPushDown API,
then we could do better here.

During STORE, if the relation being stored is associated with a schema,
SequenceFileStorage will use `WritableConverter#checkStoreSchema(..)` to
validate the input schema. This allows DefaultWritableConverter to select and
instantiate an appropriate WritableConverter impl for the given Pig data
type. This fails when no schema is associated with the relation to be stored. In
this case, the user must manually specify the desired Writable type (if
supported by DefaultWritableConverter), or an appropriate WritableConverter
type.

Besides the addition of DefaultWritableConverter, here are a few more important
changes in this commit:

- SequenceFileLoader, Storage always return/expect tuples of size >= 2, even if
  schema reports either key or value as `DataType.NULL`. This simplifies impl
  logic, and client use.

- SequenceFileStorage now reports counts of unexpected nulls for input tuple
  itself, as well as null key or value objects.
This commit removes the bulk of features around `DefaultWritableConverter`, but
still simplifies the way clients use `SequenceFileStorage` via a slight
extension to the `WritableConverter` API; Method
`WritableConverter#getWritableClass()` allows `WritableConverter` impls to
report to their owning `SequenceFileStorage` instance the default `Writable`
type returned from calls to
`WritableConverter#toWritable(..)`. `WritableConverter` implementations for
basic types, such as `IntWritableConverter` and `TextConverter`, may now be used
as follows, without explicit specification of `Writable` type:

```
pair = LOAD '$INPUT' AS (key: int, value: chararray);
STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage (
  '-c ...IntWritableConverter',
  '-c ...TextConverter'
);
```

Only for those `WritableConverter` impls which don't report default `Writable`
types, such as `GenericWritableConverter`, must type param be specified:

```
pair = LOAD '$INPUT' AS (key: int, value: bytearray);
STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage (
  '-c ...IntWritableConverter',
  '-c ...GenericWritableConverter -t ...MyWritableType'
);
```
@sagemintblue
Copy link
Contributor Author

I got myself into a lengthy refactoring session here, unfortunately; The last two commits first add a lot of new stuff, and then pair it down to a more manageable size for this branch. I may post another pull request to get your feedback on the new features.

The bulk of new stuff is inclusion of a DefaultWritableConverter impl, along with changes to SequenceFileLoader and SequenceFileStorage which allow DefaultWritableConverter to selection and instantiate a WritableConverter which best suits data type at runtime. This dramatically simplifies client use in cases where key, value types are simple (i.e. int, long, chararray, null):

-- read SequenceFile<IntWritable, Text> without args or schema
pair = LOAD '$INPUT' USING ...SequenceFileLoader();
DESCRIBE pair; -- {(key: int, value: chararray)}

-- write SequenceFile<IntWritable, Text> without args
STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage();

After backing out DefaultWritableConverter and related changes, client use of SequenceFileStorage is still simplified from what it was: Specification of Writable type is unnecessary when using a WritableConverter which defines a default Writable type. For example:

pair = LOAD '$INPUT' AS (key: int, value: chararray);

-- store without using '-t MyWritableType' args
STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage (
  '-c ...IntWritableConverter',
  '-c ...TextConverter'
);

In the case of a WritableConverter which does not specify a default Writable type, such as GenericWritableConverter, the type must still be specified manually:

pair = LOAD '$INPUT' AS (key: int, value: bytearray);

-- store without using '-t MyWritableType' args
STORE pair INTO '$OUTPUT' USING ...SequenceFileStorage (
  '-c ...IntWritableConverter',
  '-c ...GenericWritableConverter -t MyWritableType'
);

@rangadi
Copy link
Contributor

rangadi commented Oct 20, 2011

love the new improvements. When I was reviewing the other pull request, I was thinking on the similar lines : it would be better I don't need to specify the WritatableConverter for thrift and just the Thrift class, since the converter can be derived.

After backing out DefaultWritableConverter and related changes, client use of SequenceFileStorage

It is up to you. I don't mind including DefaultWritableConverter here.

pair = LOAD '$INPUT' USING ...SequenceFileLoader();

How do you figure out it is 'int, chararray"?

@sagemintblue
Copy link
Contributor Author

It is up to you. I don't mind including DefaultWritableConverter here.

I'd rather not include it at this time because I'm not happy with the consistency with which I can properly derive runtime data types during LOAD expression evaluation.

How do you figure out it is 'int, chararray"?

More details are listed in this commit message. Generally, on LOAD, I can deserialize key, value classes from the target SequenceFile data, if it exists. Unfortunately, if the data to be loaded is generated earlier in the same script, it doesn't yet exist when the front end queries the LoadFunc (LoadMetadata) for schema info. It'd be great it Pig communicated the LOAD expression's schema clause to the LoadFunc, but I believe this only happens during LoadPushDown evaluation.

@sagemintblue
Copy link
Contributor Author

it would be better I don't need to specify the WritatableConverter for thrift and just the Thrift class, since the converter can be derived.

The prior DefaultWritableConverter would need to keep track of any extra arguments passed in, such as the name of a Thrift class, in order to support this. Doable, but as I mentioned, I'd prefer to hold off until I figure out a cleaner way to determine schema info during LOAD.

@rangadi
Copy link
Contributor

rangadi commented Oct 20, 2011

More details are listed in this commit message. Generally, on LOAD,

I see. PigStorageSchema also does something similar. It does not seem to worry about the recently generated table. It just uses the 'location' passed in getSchema(). Is that affected by this issue too?

btw, I think most of the commit message belongs in code itself..

@sagemintblue
Copy link
Contributor Author

It does not seem to worry about the recently generated table. It just uses the 'location' passed in getSchema(). Is that affected by this issue too?

It should be affected, yes-- getSchema(..) is only called on pig front end during planning. If the target location is created earlier in the same script, it won't exist when getSchema(..) is called. Looking at source for PigStorageSchema and util class JsonMetadata, seems as though errors are logged and getSchema(..) returns null when no schema metadata can be found for the target location.

However, if the input location is created earlier in the script via PigStorageSchema, then perhaps if [StoreMetadata#storeSchema(..)](http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/StoreMetadata.html#storeSchema%28org.apache.pig.ResourceSchema, java.lang.String, org.apache.hadoop.mapreduce.Job%29) is also called on front end during planning (creating metadata for target location), then a subsequent PigStorageSchema#getSchema(..) would succeed. Unfortunately, I can't tell from docs when / where StoreMetadata#storeSchema(..) is called.

btw, I think most of the commit message belongs in code itself..

Agreed. When I have time to push DefaultWritableConverter back in, I'll include the commit log description in its javadoc.

@sagemintblue
Copy link
Contributor Author

Raghu, anything more you'd like to see in this branch? I'm keen on getting this shipped as I'm blocking now on EB 2.0.9 release for some other work. Do you keep a regular release schedule for EB? Thanks!

@rangadi
Copy link
Contributor

rangadi commented Oct 20, 2011

looking at the patch now.. will commit soon.

@rangadi
Copy link
Contributor

rangadi commented Oct 20, 2011

The prior DefaultWritableConverter would need to keep track of any extra arguments passed in, such as the name of a Thrift class, in order to support this. Doable, but as I mentioned, I'd prefer to hold off until I figure out a cleaner way to determine schema info during LOAD.

This has already improved. Thrift classes don't need to specify ''-t com.twitter.elephantbird.mapreduce.io.ThriftWritable"

@sagemintblue
Copy link
Contributor Author

This has already improved. Thrift classes don't need to specify ''-t com.twitter.elephantbird.mapreduce.io.ThriftWritable"

Indeed! Clients should be able to load Thrift data like this:

pair = LOAD '$INPUT' USING ...SequenceFileStorage (
  '-c ...IntWritableConverter',
  '-c ...ThriftWritableConverter -- ...MyThriftType'
);

No need to additionally specify -t ...ThriftWritable, unless you're using deprecated derived type like ThriftMyTypeWritable extends ThriftWritable<MyType>.

rangadi added a commit that referenced this pull request Oct 20, 2011
…rning

better handling of NULLs in user's tuple while storing. Some more interface improvements and code clean up.
@rangadi rangadi merged commit cd7a6b6 into twitter:eb-dev Oct 20, 2011
@dvryaboy
Copy link
Contributor

I love how you guys snuck in a gigantic refactor under a innocuous-sounding heading of replacing some checks with warnings :)

@sagemintblue
Copy link
Contributor Author

My next branch will be named wahffer_thin_sequence_file_storage_update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants