specify a size to a column builder #910

jingjingwang · 2017-08-30T02:14:33Z

By default, a column builder reserves 10k data except BlobColumnBuilder. However, when a list of columns (including a blob column) is allocated, the blob column is of size 1 while other columns are of size 10k. This caused OOM when we modified ApplyTest.testApply() by adding a blob column to its schema -- since each TupleBatch is now of size 1, the test created 20k TupleBatches, each has a few columns of size 10k (and one blob column of size 1).
Fix this bug by passing the size to a column builder, instead of getting its size based on its type internally.

coveralls · 2017-08-30T02:23:21Z

Coverage increased (+0.04%) to 26.834% when pulling 7ecb1f4 on fix_tbb_column_size_with_blob into 5732f48 on master.

senderista

The code looks fine, but I have a few higher-level questions:

Why are we hardcoding the batch size of any schema that contains a blob attribute to 1? Surely there must be some use cases with blobs only a few bytes in size (e.g., 32- or 64-bit bitvectors). Are we just optimizing for the particular use case that motivated adding the blob type in the first place and assuming that (extremely large) blob size is typical?
How do I override the default batch size of something that outputs or receives tuples (say an operator or a TupleBatchBuffer), given a schema, without constructing all the ColumnBuilders by hand? Should output batch size be an overridable property of the Operator base class? Right now, if I have a schema containing a blob type and I don't want my output batch size to be 1, I'm not sure what I would do.
I never had a chance to review the design changes to accommodate blob types, so I'm unfamiliar with the rationale for representing blobs as ByteBuffers instead of byte[] arrays. Was marshaling efficiency the deciding factor? (If so, could you elaborate?)

jingjingwang · 2017-09-05T20:58:10Z

For 1), yes the purpose of introducing blob was to handle large binary objects such as images in the first place. We expected that small objects can be encoded using other types (String, for example). That's why we hard-coded the batch size to be 1 when blob is involved. It's not a complete design for sure, which goes to 2): I agree that we would need to add interfaces to Operator to specify the batch sizes of its output tuple batches, but then each operator needs to know how to handle cases when input and output batch sizes are different. There will be some buffering/copying functions to add. At that time, we didn't have a use case to motivate this big change so we didn't implement it.

For 3), it's mainly because that byte[] doesn't implement Comparable. Certainly we can also implement our own wrapper for byte[] with Comparable, but it seems unnecessary if ByteBuffer already has it. It may also have efficient I/O (we didn't benchmark anything tho).

parmitam · 2017-09-05T23:22:00Z

looks good!

senderista · 2017-09-06T16:38:05Z

OK, I guess we can address the design warts when we have a compelling use case.

specify a size to a column builder

7ecb1f4

jingjingwang requested review from senderista, parmitam and orzikhd August 30, 2017 02:15

senderista reviewed Aug 30, 2017

View reviewed changes

senderista merged commit e5b4b2b into master Sep 8, 2017

senderista deleted the fix_tbb_column_size_with_blob branch September 8, 2017 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

specify a size to a column builder #910

specify a size to a column builder #910

jingjingwang commented Aug 30, 2017

coveralls commented Aug 30, 2017

senderista left a comment •

edited

jingjingwang commented Sep 5, 2017

parmitam commented Sep 5, 2017

senderista commented Sep 6, 2017

specify a size to a column builder #910

specify a size to a column builder #910

Conversation

jingjingwang commented Aug 30, 2017

coveralls commented Aug 30, 2017

senderista left a comment • edited

Choose a reason for hiding this comment

jingjingwang commented Sep 5, 2017

parmitam commented Sep 5, 2017

senderista commented Sep 6, 2017

senderista left a comment •

edited