Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add blob type for arbitrary binary data #3581

Merged
merged 10 commits into from Oct 23, 2023
Merged

Add blob type for arbitrary binary data #3581

merged 10 commits into from Oct 23, 2023

Conversation

jachris
Copy link
Contributor

@jachris jachris commented Oct 17, 2023

We currently use the string type both for printable strings and for arbitrary binary data. However, this violates Arrow's UTF-8 requirements and leads to several other issues down the road (for example: it's not possible to print arbitrary blobs as JSON strings). Thus, this PR introduces a new blob type, that is designed to handle arbitrary binary data.

Internally, this binary data is directly stored as an arbitrary sequence of bytes (using Arrow's BinaryType for table slices). However, when the data enters or leaves Tenzir, we must have some special handling, depending on the format. For example, assume we want to print the blob b"\x00\xFF" as JSON. This blob cannot be represented as an escaped string. Thus, we base64 encode it, and print "AP8=". When parsing JSON, we cannot assume that all base64-valid strings are binary payloads. However, when a schema with a blob type is given, we base64 decode the blob to get b"\x00\xFF" back. Other formats might use different strategies to encode and decode blobs.

@jachris jachris changed the base branch from main to topic/upgrade-remaining-builders October 17, 2023 12:59
@jachris jachris added feature New functionality engine Core pipeline and storage engine labels Oct 17, 2023
@jachris jachris force-pushed the topic/blob-type branch 2 times, most recently from a19a5b8 to 98c3070 Compare October 17, 2023 14:10
@jachris jachris force-pushed the topic/upgrade-remaining-builders branch 5 times, most recently from 2a6ea81 to 84ffd7d Compare October 18, 2023 07:59
@jachris jachris force-pushed the topic/blob-type branch 4 times, most recently from 7360697 to bbad91d Compare October 18, 2023 08:55
@jachris jachris force-pushed the topic/blob-type branch 2 times, most recently from 8b21e51 to 30902aa Compare October 18, 2023 10:16
Base automatically changed from topic/upgrade-remaining-builders to main October 18, 2023 11:58
jachris added a commit that referenced this pull request Oct 18, 2023
This could have still waited a bit, but it will make
#3581 easier.
@jachris jachris force-pushed the topic/blob-type branch 9 times, most recently from c6f708a to e271fd2 Compare October 19, 2023 09:35
@jachris jachris force-pushed the topic/blob-type branch 7 times, most recently from 254fb5a to 31a7a21 Compare October 19, 2023 11:27
@jachris jachris marked this pull request as ready for review October 19, 2023 11:27
@jachris jachris force-pushed the topic/blob-type branch 2 times, most recently from cf297ff to 10c79d0 Compare October 19, 2023 13:00
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice, thanks for hacking this together so quickly. I have some questions on the code and some answers to questions you left in the code. I'll play around with this in practice some more to test it further and will then approve it afterwards.

libtenzir/builtins/formats/pcap.cpp Outdated Show resolved Hide resolved
libtenzir/builtins/formats/zeek_tsv.cpp Show resolved Hide resolved
libtenzir/include/tenzir/arrow_table_slice.hpp Outdated Show resolved Hide resolved
libtenzir/include/tenzir/concept/printable/tenzir/blob.hpp Outdated Show resolved Hide resolved
libtenzir/include/tenzir/detail/base64.hpp Outdated Show resolved Hide resolved
libtenzir/include/tenzir/type.hpp Outdated Show resolved Hide resolved
libtenzir/include/tenzir/type.hpp Outdated Show resolved Hide resolved
libtenzir/src/format/zeek.cpp Show resolved Hide resolved
tenzir/integration/integration.py Show resolved Hide resolved
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works well enough in practice to be approved. I think once the outstanding discussions are resolved we can merge this and be done with it.

@netantho
Copy link
Sponsor Contributor

This PR solves an issue of stalled pipeline when importing tshark data in (nd)JSON, thanks!

@jachris jachris merged commit d6f6393 into main Oct 23, 2023
39 of 41 checks passed
@jachris jachris deleted the topic/blob-type branch October 23, 2023 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
engine Core pipeline and storage engine feature New functionality
Projects
None yet
4 participants