Avro: Add decoder #38

xentripetal · 2021-12-28T19:14:30Z

Full avro OCF support. Handles all primitive, complex, and logical types besides decimals.

Able to handle deflate, snappy, and null codecs for blocks.

Requirements for WIP removal:

~~Support common logical types (date, decimal, duration, time, timestamp)~~
~~Add test case with all avro datatypes~~
~~Evaluate viability of splitting avro datum into a subdecoder~~
~~Cleanup OCF header decoding~~
~~Cleanup schema decoding~~
~~Finalize design around handling OCF codecs. (Currently only handles null codec, rest treat the datum as raw bytes and won't decode them)~~

wader

Thanks and nice work, looks good overall i think. I'm sorry the documentation is lacking a the moment. How confusing/frustrating was it to write a decoder? what do you think i should focus on when writing a decoder writing guide etc?

format/avro/ocf.go

format/avro/codecs/long.go

format/avro/ocf.go

format/avro/schema/schema.go

format/avro/testdata/twitter.fqtest

format/avro/codecs/codec.go

format/avro/codecs/bytes.go

format/avro/codecs/enum.go

format/avro/codecs/long.go

format/avro/testdata/firstBlockCountNotGreaterThanZero.fqtest

wader · 2021-12-28T21:32:52Z

BTW you can use make doc to regenerate some .md files (requires some tools, graphviz, ffmpeg) and make lint to run linter.

xentripetal · 2021-12-29T22:48:33Z

BTW you can use make doc to regenerate some .md files (requires some tools, graphviz, ffmpeg) and make lint to run linter.

Cool! I've updated this PR with the regenerated docs

How confusing/frustrating was it to write a decoder? what do you think i should focus on when writing a decoder writing guide etc?

It took a few rewrites to get it working and constantly digging through other decoders and pkg/decode/decode.go.

I feel like the most important areas to cover in early docs would be Scalars, Compounds, and Values. Covering both their relationships to each other and how they are interpreted in the final output.

Additionally anything taking interface{} could use some more docs. But all-in-all I think the actual design is very well done and has a lot of extensibility. Awesome work!

wader · 2021-12-30T07:03:16Z

@xentripetal nice fixes and feesback! Im out traveling atm so will be a bit slow and reviewing, will answer longer in a bit

format/avro/decoders/array.go

format/avro/decoders/record.go

format/avro/ocf.go

format/avro/testdata/twitter.fqtest

wader · 2021-12-30T16:49:59Z

Had some time to think and review a bit, looking better and better. Will continue review later. Nice with the TODO list.

format/avro/testdata/quickstop-deflate.fqtest

format/avro/decoders/bytes.go

xentripetal · 2021-12-31T23:28:38Z

@wader Bringing this discussion out of the code block review and into the comments to make it easier to follow:

I've been thinking about format decoders that themselves are encoding some generic data. Maybe they should have some help functions in jq to convert the decoded structure into something that it represent.

Is this deflate as in deflate compressed? if so you could probably uncompress and add as new buffer if that would be useful

Yeah, the data field can be compressed in deflate, snappy, bzip2, xz, and zstandard.

I skipped adding support for that right now as I wasn't certain what the fq-ish way to resolve those would be.

If you just hide the fact you decompressed it and only show the decompressed data, but then you're not seeing the real binary mapping. This also means you can't tie a field back to a range in the binary data (might be possible but would be incredibly complex), which IMO kills a lot of the coolness of fq.

I saw for some compression formats you decode the compression metadata/fields and add a uncompressed field containing the uncompressed data. Doing that here is completely feasible here but would make it so common commands someone would use on a certain avro file would differ from another depending on its codec.

I think it ties back to what you were suggesting around the _torepr convention, where you would have a representation/semantic layer that hides the compression details. The default representation would include it and you would have to query the uncompressed field to get the actual data.

However, there are already existing tools for interacting with the meaning of the data. It might be out of scope of this project to also add representation parsing to it. I think it brings questions of what formats deserve a representation layer? You could parse out pngs and try to display them as thats their representation, though that would be silly to do.

On the design idea of having a jq phrase to go to the representation, e.g.

  ( .blocks
  | map(
    ( .data
    | map(with_entries(select(.key | endswith("_len") | not))
    )
  )

I don't have enough experience with jq to know if it can handle complex mappings/rules. Is it possible to make a generic phrase that can pull from the data.uncompressed field in all the blocks if avro.codec is defined, else just pull the data field? There's also the issue that avro doesn't require a struct/record be the top level schema. So the data field can just be an array of ints or an array of complex records. Finally, there's an ambiguity in some data types where you could never know which was which based on just the output.

e.g. Map is just an array of key value pairs. You would want the representation of map to be a map/struct. But if someone manually specified a schema as an array of key value pairs, you would want the representation to be an array of key value pairs.

I believe the only way to solve this is to have a context of the schema when parsing it, trying to making a transformation ruleset defined entirely on the output would be impossible. Though I might just be overthinking it.

Thoughts?

wader · 2022-01-01T15:58:18Z

@wader Bringing this discussion out of the code block review and into the comments to make it easier to follow:

Good idea, sorry about that. Would be nice if github allowed "normal" comments when doing a draft review somehow.

Yeah, the data field can be compressed in deflate, snappy, bzip2, xz, and zstandard.

I skipped adding support for that right now as I wasn't certain what the fq-ish way to resolve those would be.

If you just hide the fact you decompressed it and only show the decompressed data, but then you're not seeing the real binary mapping. This also means you can't tie a field back to a range in the binary data (might be possible but would be incredibly complex), which IMO kills a lot of the coolness of fq.

Yes the best fq can currently do for this is that it allows to have new buffers inside a decode tree, it won't allow you to map locations back in any way but it can give some hint of where the in "parent" buffer it originated from. You will notice in a decode value "dump" tree that the address column is indented per "buffer" level to give some hint.

I saw for some compression formats you decode the compression metadata/fields and add a uncompressed field containing the uncompressed data. Doing that here is completely feasible here but would make it so common commands someone would use on a certain avro file would differ from another depending on its codec.

I think it ties back to what you were suggesting around the _torepr convention, where you would have a representation/semantic layer that hides the compression details. The default representation would include it and you would have to query the uncompressed field to get the actual data.

However, there are already existing tools for interacting with the meaning of the data. It might be out of scope of this project to also add representation parsing to it. I think it brings questions of what formats deserve a representation layer? You could parse out pngs and try to display them as thats their representation, though that would be silly to do.

Yeah i think your right, a representation layer only makes sense if there is an obvious way to do it and that it does not involve too much work. Maybe in some cases it's better to have per format documentation with some hint and snippets for how to iterate/traverse the structure?

I've actually joked that fq should try show you everything a normal user won't not want to see :)

On the design idea of having a jq phrase to go to the representation, e.g.
  ( .blocks
  | map(
    ( .data
    | map(with_entries(select(.key | endswith("_len") | not))
    )
  )
I don't have enough experience with jq to know if it can handle complex mappings/rules. Is it possible to make a generic phrase that can pull from the data.uncompressed field in all the blocks if avro.codec is defined, else just pull the data field?

Ignoring that we should probably not do it: yes i think so, jq the langauge is very capable so probably possible. I should really write some introduction to jq with the aim at how to use it with fq. Also i'm using https://github.com/wader/vscode-jq which makes it a lot nicer to write lot if jq code... should really clean up and package that.

There's also the issue that avro doesn't require a struct/record be the top level schema. So the data field can just be an array of ints or an array of complex records. Finally, there's an ambiguity in some data types where you could never know which was which based on just the output. e.g. Map is just an array of key value pairs. You would want the representation of map to be a map/struct. But if someone manually specified a schema as an array of key value pairs, you would want the representation to be an array of key value pairs.

I believe the only way to solve this is to have a context of the schema when parsing it, trying to making a transformation ruleset defined entirely on the output would be impossible. Though I might just be overthinking it.

Thoughts?

Sounds like we should skip trying to make this too smart, let's first make a really good basic decoder and then see how it's used or what feels missing? Really appreciate the work and thinking you put into this!

Hope you had a great new years!

wader · 2022-01-08T00:01:13Z

Hey, hope you'r doing great. Let me know if there was something i missed to answer or so

xentripetal · 2022-01-08T00:13:01Z

Thanks mate, all good. Just haven't had spare time to finish this PR yet.

wader · 2022-01-08T00:20:56Z

👍 Good to hear. No stress and i don't think my messing around in master should cause much more conflicts than it already has if you let it rest for a while

wader · 2022-01-13T09:47:19Z

Hey, i did some cleanup/sorting related to list of formats, it probably caused some collision but after resolve them i think they will be less likely to happen

format/avro/decoders/enum.go

format/avro/decoders/fixed.go

format/avro/decoders/logicaltypes.go

format/avro/decoders/long.go

format/avro/decoders/null.go

wader · 2022-01-16T10:53:44Z

It seems like doc/file.mp4 got removed?

…he first place? Maybe some macos shenanigans.

… bytes

xentripetal · 2022-02-08T05:07:22Z

Hey sorry for the delay, got distracted with some other projects. Finished out all the WIP goals I set for myself.

Ready for full review whenever you have the time!

Doctor-love · 2022-02-08T08:14:45Z

Would be exciting to see this and #51 land in the next release - nice work, @xentripetal !

wader · 2022-02-08T11:31:11Z

Oh sorry i noticed now that i was commenting on a old commit

wader · 2022-02-08T11:34:45Z

Hey sorry for the delay, got distracted with some other projects. Finished out all the WIP goals I set for myself.

Ready for full review whenever you have the time!

Great stuff!

BTW if you want to add documentation specific to avro_ocf you can add a avro_ocf.md file and then make doc will include it in formats documentation. For example limitations, things not implemented etc or just examples how to use it.

wader · 2022-02-10T08:52:48Z

format/avro/avro_ocf.md

+
+Limitations:
+ - Schema does not support self-referential types, only built-in types.
+ - Decimal logical types are not supported for decoding, will just be treated as their primitive type


👍 this puts the other format documentation to shame. Good idea to have links specification etc, if your using fq for something if quite likely you want to have has much details and help as possible :)

I've usually collect links and useful tools etc as comments in the go-files but much of that can probably move out to documentation.

wader · 2022-02-10T08:56:10Z

It does but I'm fairly sure it requires that the compressed data be in a different framing than what avro puts them as https://github.com/google/snappy/blob/main/framing_format.txt

I don't know this for certain but every time I tried to use the snappy.NewReader(...) it resulted in it erroring out with invalid data.

Edit

Yeah the docs for Reader say
// Reader handles the Snappy stream format, not the Snappy block format.
and docs for Decode say
// Decode handles the Snappy block format, not the Snappy stream format.

Aha yeah that could be it, i've run into a similar issue with bzip2 (there i let the reader read a bit earlier i think). We can leave it as it is i think maybe revisit it later. Possibly add a comment about it

wader · 2022-02-10T08:58:08Z

Ah yeah that makes sense. Swapped to bitio.ReadFull

Missed all those functions since I normally just go off methods for feature discovery.

Yeah that is probably the best way atm so no worries. I'm sorry i haven't had time to write proper decoder documentation yet.

wader · 2022-02-10T09:00:56Z

Nice! wasn't aware of that feature. Yeah the full dump is over 8k lines. Swapped to truncate

Sorry again :) should also add more documentation about all options. Things have been a bit in flux so have skipped/forgotten to documenting until i felt it was stable or good enough.

wader · 2022-02-10T09:53:10Z

Seems the fqtest files needs an update after the UTC change

format/avro/decoders/bytes.go

format/avro/decoders/logicaltypes.go

wader · 2022-02-10T20:14:59Z

format/avro/decoders/long.go

+)
+
+const intMask = byte(127)
+const intFlag = byte(128)


Could use 0b111_1111 and 0xb1000_0000 to make it even more explicit? it the byte cast needed btw?

wader · 2022-02-10T20:16:49Z

format/avro/decoders/logicaltypes.go

+	return s, nil
+}
+
+// Todo Decimal: https://github.com/linkedin/goavro/blob/master/logical_type.go


There is BigInt support now so could be used for this i think? but we can do that later maybe

I think it would require Big.Rat support

wader · 2022-02-10T20:26:12Z

Have a look at my last comments, after that i think we're good to merge. Do you have anything left you want to fix?

xentripetal · 2022-02-13T04:37:34Z

I think its good to merge, in the future I think adding decimals and user defined types in schema decoding would be good, but not really required.

wader · 2022-02-13T09:28:54Z

🥳 Thanks a lot for this and hope for more contributions in the future!

wader · 2022-02-13T09:44:44Z

BTW hopefully will do a release quite soon, trying to keep it around every 2-3 weeks for now.

xentripetal added 4 commits December 28, 2021 13:01

initial work for avro OCF files

d6ca481

Merge branch 'wader:master' into avro

fdf2b34

Hook into registry, add codecs

7a8e3ca

Cleanup

7345b8c

wader reviewed Dec 28, 2021

View reviewed changes

xentripetal added 3 commits December 29, 2021 16:12

Change avro codec to funcs

4b809a7

Regenerate docs

27789f2

Cleanup for linting

07ddf36

Use existing scalar description helper

68e85a2

wader reviewed Dec 30, 2021

View reviewed changes

format/avro/decoders/array.go Outdated Show resolved Hide resolved

format/avro/decoders/record.go Outdated Show resolved Hide resolved

format/avro/ocf.go Show resolved Hide resolved

format/avro/testdata/twitter.fqtest Outdated Show resolved Hide resolved

wader reviewed Dec 31, 2021

View reviewed changes

format/avro/testdata/quickstop-deflate.fqtest Outdated Show resolved Hide resolved

wader reviewed Dec 31, 2021

View reviewed changes

format/avro/decoders/bytes.go Outdated Show resolved Hide resolved

xentripetal added 2 commits January 13, 2022 20:59

Pull latest from wader

788fd02

Initial pass on logical types

251053e

wader reviewed Jan 16, 2022

View reviewed changes

xentripetal added 4 commits January 19, 2022 19:07

Undo change to doc/file.mp4. I have no idea how this got changed in t…

06085a2

…he first place? Maybe some macos shenanigans.

cleanup some docs, change enum to mapper, error zigzag on more than 8…

de64a99

… bytes

Pull latest from wader

6e1f338

Update linting

16849c8

xentripetal changed the title ~~WIP: Avro: Add decoder~~ Avro: Add decoder Feb 8, 2022

Fix lint

75b8496

xentripetal added 4 commits February 9, 2022 18:16

Timestamps should be UTC

6f57cdb

Cleanup snappy

4508241

Lint and add basic doc

2605bce

Take heading off doc to match make doc format

0300c95