Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specify a size to a column builder #910

Merged
merged 1 commit into from
Sep 8, 2017

Conversation

jingjingwang
Copy link
Contributor

By default, a column builder reserves 10k data except BlobColumnBuilder. However, when a list of columns (including a blob column) is allocated, the blob column is of size 1 while other columns are of size 10k. This caused OOM when we modified ApplyTest.testApply() by adding a blob column to its schema -- since each TupleBatch is now of size 1, the test created 20k TupleBatches, each has a few columns of size 10k (and one blob column of size 1).
Fix this bug by passing the size to a column builder, instead of getting its size based on its type internally.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 26.834% when pulling 7ecb1f4 on fix_tbb_column_size_with_blob into 5732f48 on master.

Copy link
Contributor

@senderista senderista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine, but I have a few higher-level questions:

  • Why are we hardcoding the batch size of any schema that contains a blob attribute to 1? Surely there must be some use cases with blobs only a few bytes in size (e.g., 32- or 64-bit bitvectors). Are we just optimizing for the particular use case that motivated adding the blob type in the first place and assuming that (extremely large) blob size is typical?
  • How do I override the default batch size of something that outputs or receives tuples (say an operator or a TupleBatchBuffer), given a schema, without constructing all the ColumnBuilders by hand? Should output batch size be an overridable property of the Operator base class? Right now, if I have a schema containing a blob type and I don't want my output batch size to be 1, I'm not sure what I would do.
  • I never had a chance to review the design changes to accommodate blob types, so I'm unfamiliar with the rationale for representing blobs as ByteBuffers instead of byte[] arrays. Was marshaling efficiency the deciding factor? (If so, could you elaborate?)

@jingjingwang
Copy link
Contributor Author

For 1), yes the purpose of introducing blob was to handle large binary objects such as images in the first place. We expected that small objects can be encoded using other types (String, for example). That's why we hard-coded the batch size to be 1 when blob is involved. It's not a complete design for sure, which goes to 2): I agree that we would need to add interfaces to Operator to specify the batch sizes of its output tuple batches, but then each operator needs to know how to handle cases when input and output batch sizes are different. There will be some buffering/copying functions to add. At that time, we didn't have a use case to motivate this big change so we didn't implement it.

For 3), it's mainly because that byte[] doesn't implement Comparable. Certainly we can also implement our own wrapper for byte[] with Comparable, but it seems unnecessary if ByteBuffer already has it. It may also have efficient I/O (we didn't benchmark anything tho).

@parmitam
Copy link
Contributor

parmitam commented Sep 5, 2017

looks good!

@senderista
Copy link
Contributor

OK, I guess we can address the design warts when we have a compelling use case.

@senderista senderista merged commit e5b4b2b into master Sep 8, 2017
@senderista senderista deleted the fix_tbb_column_size_with_blob branch September 8, 2017 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants