New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add blob
type for arbitrary binary data
#3581
Conversation
cd88f8c
to
3e467eb
Compare
a19a5b8
to
98c3070
Compare
2a6ea81
to
84ffd7d
Compare
7360697
to
bbad91d
Compare
8b21e51
to
30902aa
Compare
This could have still waited a bit, but it will make #3581 easier.
c6f708a
to
e271fd2
Compare
254fb5a
to
31a7a21
Compare
cf297ff
to
10c79d0
Compare
10c79d0
to
c7eed26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice, thanks for hacking this together so quickly. I have some questions on the code and some answers to questions you left in the code. I'll play around with this in practice some more to test it further and will then approve it afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works well enough in practice to be approved. I think once the outstanding discussions are resolved we can merge this and be done with it.
This PR solves an issue of stalled pipeline when importing tshark data in (nd)JSON, thanks! |
We currently use the
string
type both for printable strings and for arbitrary binary data. However, this violates Arrow's UTF-8 requirements and leads to several other issues down the road (for example: it's not possible to print arbitrary blobs as JSON strings). Thus, this PR introduces a newblob
type, that is designed to handle arbitrary binary data.Internally, this binary data is directly stored as an arbitrary sequence of bytes (using Arrow's
BinaryType
for table slices). However, when the data enters or leaves Tenzir, we must have some special handling, depending on the format. For example, assume we want to print the blobb"\x00\xFF"
as JSON. This blob cannot be represented as an escaped string. Thus, we base64 encode it, and print"AP8="
. When parsing JSON, we cannot assume that all base64-valid strings are binary payloads. However, when a schema with ablob
type is given, we base64 decode the blob to getb"\x00\xFF"
back. Other formats might use different strategies to encode and decode blobs.