Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avro: Add decoder #38

Merged
merged 29 commits into from Feb 13, 2022
Merged

Avro: Add decoder #38

merged 29 commits into from Feb 13, 2022

Conversation

xentripetal
Copy link
Contributor

@xentripetal xentripetal commented Dec 28, 2021

Full avro OCF support. Handles all primitive, complex, and logical types besides decimals.

Able to handle deflate, snappy, and null codecs for blocks.

Requirements for WIP removal:

  • Support common logical types (date, decimal, duration, time, timestamp)
  • Add test case with all avro datatypes
  • Evaluate viability of splitting avro datum into a subdecoder
  • Cleanup OCF header decoding
  • Cleanup schema decoding
  • Finalize design around handling OCF codecs. (Currently only handles null codec, rest treat the datum as raw bytes and won't decode them)

Copy link
Owner

@wader wader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and nice work, looks good overall i think. I'm sorry the documentation is lacking a the moment. How confusing/frustrating was it to write a decoder? what do you think i should focus on when writing a decoder writing guide etc?

format/avro/ocf.go Outdated Show resolved Hide resolved
format/avro/codecs/long.go Outdated Show resolved Hide resolved
format/avro/ocf.go Outdated Show resolved Hide resolved
format/avro/schema/schema.go Show resolved Hide resolved
format/avro/testdata/twitter.fqtest Outdated Show resolved Hide resolved
format/avro/codecs/codec.go Outdated Show resolved Hide resolved
format/avro/codecs/bytes.go Outdated Show resolved Hide resolved
format/avro/codecs/enum.go Outdated Show resolved Hide resolved
format/avro/codecs/long.go Outdated Show resolved Hide resolved
@wader
Copy link
Owner

wader commented Dec 28, 2021

BTW you can use make doc to regenerate some .md files (requires some tools, graphviz, ffmpeg) and make lint to run linter.

@xentripetal
Copy link
Contributor Author

BTW you can use make doc to regenerate some .md files (requires some tools, graphviz, ffmpeg) and make lint to run linter.

Cool! I've updated this PR with the regenerated docs

How confusing/frustrating was it to write a decoder? what do you think i should focus on when writing a decoder writing guide etc?

It took a few rewrites to get it working and constantly digging through other decoders and pkg/decode/decode.go.

I feel like the most important areas to cover in early docs would be Scalars, Compounds, and Values. Covering both their relationships to each other and how they are interpreted in the final output.

Additionally anything taking interface{} could use some more docs. But all-in-all I think the actual design is very well done and has a lot of extensibility. Awesome work!

@wader
Copy link
Owner

wader commented Dec 30, 2021

@xentripetal nice fixes and feesback! Im out traveling atm so will be a bit slow and reviewing, will answer longer in a bit

format/avro/decoders/array.go Outdated Show resolved Hide resolved
format/avro/decoders/record.go Outdated Show resolved Hide resolved
format/avro/ocf.go Show resolved Hide resolved
format/avro/testdata/twitter.fqtest Outdated Show resolved Hide resolved
@wader
Copy link
Owner

wader commented Dec 30, 2021

Had some time to think and review a bit, looking better and better. Will continue review later. Nice with the TODO list.

@xentripetal
Copy link
Contributor Author

xentripetal commented Dec 31, 2021

@wader Bringing this discussion out of the code block review and into the comments to make it easier to follow:

I've been thinking about format decoders that themselves are encoding some generic data. Maybe they should have some help functions in jq to convert the decoded structure into something that it represent.

Is this deflate as in deflate compressed? if so you could probably uncompress and add as new buffer if that would be useful

Yeah, the data field can be compressed in deflate, snappy, bzip2, xz, and zstandard.

I skipped adding support for that right now as I wasn't certain what the fq-ish way to resolve those would be.

If you just hide the fact you decompressed it and only show the decompressed data, but then you're not seeing the real binary mapping. This also means you can't tie a field back to a range in the binary data (might be possible but would be incredibly complex), which IMO kills a lot of the coolness of fq.

I saw for some compression formats you decode the compression metadata/fields and add a uncompressed field containing the uncompressed data. Doing that here is completely feasible here but would make it so common commands someone would use on a certain avro file would differ from another depending on its codec.

I think it ties back to what you were suggesting around the _torepr convention, where you would have a representation/semantic layer that hides the compression details. The default representation would include it and you would have to query the uncompressed field to get the actual data.

However, there are already existing tools for interacting with the meaning of the data. It might be out of scope of this project to also add representation parsing to it. I think it brings questions of what formats deserve a representation layer? You could parse out pngs and try to display them as thats their representation, though that would be silly to do.

On the design idea of having a jq phrase to go to the representation, e.g.

  ( .blocks
  | map(
    ( .data
    | map(with_entries(select(.key | endswith("_len") | not))
    )
  )

I don't have enough experience with jq to know if it can handle complex mappings/rules. Is it possible to make a generic phrase that can pull from the data.uncompressed field in all the blocks if avro.codec is defined, else just pull the data field? There's also the issue that avro doesn't require a struct/record be the top level schema. So the data field can just be an array of ints or an array of complex records. Finally, there's an ambiguity in some data types where you could never know which was which based on just the output.

e.g. Map is just an array of key value pairs. You would want the representation of map to be a map/struct. But if someone manually specified a schema as an array of key value pairs, you would want the representation to be an array of key value pairs.

I believe the only way to solve this is to have a context of the schema when parsing it, trying to making a transformation ruleset defined entirely on the output would be impossible. Though I might just be overthinking it.

Thoughts?

@wader
Copy link
Owner

wader commented Jan 1, 2022

@wader Bringing this discussion out of the code block review and into the comments to make it easier to follow:

Good idea, sorry about that. Would be nice if github allowed "normal" comments when doing a draft review somehow.

Yeah, the data field can be compressed in deflate, snappy, bzip2, xz, and zstandard.

I skipped adding support for that right now as I wasn't certain what the fq-ish way to resolve those would be.

If you just hide the fact you decompressed it and only show the decompressed data, but then you're not seeing the real binary mapping. This also means you can't tie a field back to a range in the binary data (might be possible but would be incredibly complex), which IMO kills a lot of the coolness of fq.

Yes the best fq can currently do for this is that it allows to have new buffers inside a decode tree, it won't allow you to map locations back in any way but it can give some hint of where the in "parent" buffer it originated from. You will notice in a decode value "dump" tree that the address column is indented per "buffer" level to give some hint.

I saw for some compression formats you decode the compression metadata/fields and add a uncompressed field containing the uncompressed data. Doing that here is completely feasible here but would make it so common commands someone would use on a certain avro file would differ from another depending on its codec.

I think it ties back to what you were suggesting around the _torepr convention, where you would have a representation/semantic layer that hides the compression details. The default representation would include it and you would have to query the uncompressed field to get the actual data.

However, there are already existing tools for interacting with the meaning of the data. It might be out of scope of this project to also add representation parsing to it. I think it brings questions of what formats deserve a representation layer? You could parse out pngs and try to display them as thats their representation, though that would be silly to do.

Yeah i think your right, a representation layer only makes sense if there is an obvious way to do it and that it does not involve too much work. Maybe in some cases it's better to have per format documentation with some hint and snippets for how to iterate/traverse the structure?

I've actually joked that fq should try show you everything a normal user won't not want to see :)

On the design idea of having a jq phrase to go to the representation, e.g.

  ( .blocks
  | map(
    ( .data
    | map(with_entries(select(.key | endswith("_len") | not))
    )
  )

I don't have enough experience with jq to know if it can handle complex mappings/rules. Is it possible to make a generic phrase that can pull from the data.uncompressed field in all the blocks if avro.codec is defined, else just pull the data field?

Ignoring that we should probably not do it: yes i think so, jq the langauge is very capable so probably possible. I should really write some introduction to jq with the aim at how to use it with fq. Also i'm using https://github.com/wader/vscode-jq which makes it a lot nicer to write lot if jq code... should really clean up and package that.

There's also the issue that avro doesn't require a struct/record be the top level schema. So the data field can just be an array of ints or an array of complex records. Finally, there's an ambiguity in some data types where you could never know which was which based on just the output. e.g. Map is just an array of key value pairs. You would want the representation of map to be a map/struct. But if someone manually specified a schema as an array of key value pairs, you would want the representation to be an array of key value pairs.

I believe the only way to solve this is to have a context of the schema when parsing it, trying to making a transformation ruleset defined entirely on the output would be impossible. Though I might just be overthinking it.

Thoughts?

Sounds like we should skip trying to make this too smart, let's first make a really good basic decoder and then see how it's used or what feels missing? Really appreciate the work and thinking you put into this!

Hope you had a great new years!

@wader
Copy link
Owner

wader commented Jan 8, 2022

Hey, hope you'r doing great. Let me know if there was something i missed to answer or so

@xentripetal
Copy link
Contributor Author

Thanks mate, all good. Just haven't had spare time to finish this PR yet.

@wader
Copy link
Owner

wader commented Jan 8, 2022

👍 Good to hear. No stress and i don't think my messing around in master should cause much more conflicts than it already has if you let it rest for a while

@wader
Copy link
Owner

wader commented Jan 13, 2022

Hey, i did some cleanup/sorting related to list of formats, it probably caused some collision but after resolve them i think they will be less likely to happen

format/avro/decoders/enum.go Outdated Show resolved Hide resolved
format/avro/decoders/fixed.go Outdated Show resolved Hide resolved
format/avro/decoders/logicaltypes.go Outdated Show resolved Hide resolved
format/avro/decoders/logicaltypes.go Show resolved Hide resolved
format/avro/decoders/long.go Show resolved Hide resolved
format/avro/decoders/null.go Outdated Show resolved Hide resolved
@wader
Copy link
Owner

wader commented Jan 16, 2022

It seems like doc/file.mp4 got removed?

@xentripetal xentripetal changed the title WIP: Avro: Add decoder Avro: Add decoder Feb 8, 2022
@xentripetal
Copy link
Contributor Author

Hey sorry for the delay, got distracted with some other projects. Finished out all the WIP goals I set for myself.

Ready for full review whenever you have the time!

@Doctor-love
Copy link
Collaborator

Would be exciting to see this and #51 land in the next release - nice work, @xentripetal !

@wader
Copy link
Owner

wader commented Feb 8, 2022

Oh sorry i noticed now that i was commenting on a old commit

@wader
Copy link
Owner

wader commented Feb 8, 2022

Hey sorry for the delay, got distracted with some other projects. Finished out all the WIP goals I set for myself.

Ready for full review whenever you have the time!

Great stuff!

BTW if you want to add documentation specific to avro_ocf you can add a avro_ocf.md file and then make doc will include it in formats documentation. For example limitations, things not implemented etc or just examples how to use it.


Limitations:
- Schema does not support self-referential types, only built-in types.
- Decimal logical types are not supported for decoding, will just be treated as their primitive type
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this puts the other format documentation to shame. Good idea to have links specification etc, if your using fq for something if quite likely you want to have has much details and help as possible :)

I've usually collect links and useful tools etc as comments in the go-files but much of that can probably move out to documentation.

@wader
Copy link
Owner

wader commented Feb 10, 2022

It does but I'm fairly sure it requires that the compressed data be in a different framing than what avro puts them as https://github.com/google/snappy/blob/main/framing_format.txt

I don't know this for certain but every time I tried to use the snappy.NewReader(...) it resulted in it erroring out with invalid data.

Edit

Yeah the docs for Reader say

// Reader handles the Snappy stream format, not the Snappy block format.

and docs for Decode say

// Decode handles the Snappy block format, not the Snappy stream format.

Aha yeah that could be it, i've run into a similar issue with bzip2 (there i let the reader read a bit earlier i think). We can leave it as it is i think maybe revisit it later. Possibly add a comment about it

@wader
Copy link
Owner

wader commented Feb 10, 2022

Ah yeah that makes sense. Swapped to bitio.ReadFull

Missed all those functions since I normally just go off methods for feature discovery.

Yeah that is probably the best way atm so no worries. I'm sorry i haven't had time to write proper decoder documentation yet.

@wader
Copy link
Owner

wader commented Feb 10, 2022

Nice! wasn't aware of that feature. Yeah the full dump is over 8k lines. Swapped to truncate

Sorry again :) should also add more documentation about all options. Things have been a bit in flux so have skipped/forgotten to documenting until i felt it was stable or good enough.

@wader
Copy link
Owner

wader commented Feb 10, 2022

Seems the fqtest files needs an update after the UTC change

)

const intMask = byte(127)
const intFlag = byte(128)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use 0b111_1111 and 0xb1000_0000 to make it even more explicit? it the byte cast needed btw?

return s, nil
}

// Todo Decimal: https://github.com/linkedin/goavro/blob/master/logical_type.go
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is BigInt support now so could be used for this i think? but we can do that later maybe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would require Big.Rat support

@wader
Copy link
Owner

wader commented Feb 10, 2022

Have a look at my last comments, after that i think we're good to merge. Do you have anything left you want to fix?

@xentripetal
Copy link
Contributor Author

I think its good to merge, in the future I think adding decimals and user defined types in schema decoding would be good, but not really required.

@wader wader merged commit db9c2e3 into wader:master Feb 13, 2022
@wader
Copy link
Owner

wader commented Feb 13, 2022

🥳 Thanks a lot for this and hope for more contributions in the future!

@wader
Copy link
Owner

wader commented Feb 13, 2022

BTW hopefully will do a release quite soon, trying to keep it around every 2-3 weeks for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants