Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing flatbuffers to Ceph. #3

Closed
jlefevre opened this issue Jun 29, 2018 · 1 comment
Closed

Writing flatbuffers to Ceph. #3

jlefevre opened this issue Jun 29, 2018 · 1 comment
Assignees

Comments

@jlefevre
Copy link
Member

This is a client app to write flatbuffers to Ceph, given our previous flatbuffers design and results. Using the tpch lineitem table data, write a client that continuously calls getnextrow(), hashes the row to a bucket, and flushes buckets to Ceph. A bucket contains a FlatbufferBuilder and buckets are 1:1 with Ceph objects, and assume there are n objects. This code and logic will be used in our foreign data wrapper.

  1. Assume a primary key is always provided and is integral. Create composite keys of 2 columns for testing.

  2. For mapping rows to buckets, hash the key using this algorithm, given the known number of buckets n. The bucket number will also serve as the object_id (oid) for now.
    https://arxiv.org/pdf/1406.2294

  3. A bucket contains a flatbuffer that is being built as rows arrive, and since there is 1 bucket per object, there may be millions of objects. Hence maintaining an open bucket for every object may be prohibitive in terms of memory on the client machine. Although there will be some overhead, for now buckets should be created on demand, rows added as they arrive, and the entire bucket freed when flushed to Ceph. Assume buckets are flushed when the number of rows exceeds the flush_rows parameter.

  4. The bucket should maintain some statistics. When adding a row to a bucket, update a few stats such as min/max/counts for several columns. Then when flushing the bucket data (flatbuffer) to Ceph, the bucket statistics should be written to omap for the corresponding object.

  5. At the end of getnextrow(), iterate over the remaining open buckets and flush to Ceph.

  6. For now, just append flatbuffers to their corresponding objects when flushing to Ceph.

@jlefevre
Copy link
Member Author

Closing this since is completed except for Ceph write append -- moving that to new issue #32. Statistics has also been separated out of this issue and the stats structure is here.

Bulk of this issue (hash rows into flatbuffer buckets, and write flatbuffers to local binary files as objects), was resolved via commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants