New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avro: Add decoder #38
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and nice work, looks good overall i think. I'm sorry the documentation is lacking a the moment. How confusing/frustrating was it to write a decoder? what do you think i should focus on when writing a decoder writing guide etc?
BTW you can use |
Cool! I've updated this PR with the regenerated docs
It took a few rewrites to get it working and constantly digging through other decoders and pkg/decode/decode.go. I feel like the most important areas to cover in early docs would be Scalars, Compounds, and Values. Covering both their relationships to each other and how they are interpreted in the final output. Additionally anything taking interface{} could use some more docs. But all-in-all I think the actual design is very well done and has a lot of extensibility. Awesome work! |
@xentripetal nice fixes and feesback! Im out traveling atm so will be a bit slow and reviewing, will answer longer in a bit |
Had some time to think and review a bit, looking better and better. Will continue review later. Nice with the TODO list. |
@wader Bringing this discussion out of the code block review and into the comments to make it easier to follow: Yeah, the data field can be compressed in deflate, snappy, bzip2, xz, and zstandard. I skipped adding support for that right now as I wasn't certain what the fq-ish way to resolve those would be. If you just hide the fact you decompressed it and only show the decompressed data, but then you're not seeing the real binary mapping. This also means you can't tie a field back to a range in the binary data (might be possible but would be incredibly complex), which IMO kills a lot of the coolness of fq. I saw for some compression formats you decode the compression metadata/fields and add a I think it ties back to what you were suggesting around the _torepr convention, where you would have a representation/semantic layer that hides the compression details. The default representation would include it and you would have to query the uncompressed field to get the actual data. However, there are already existing tools for interacting with the meaning of the data. It might be out of scope of this project to also add representation parsing to it. I think it brings questions of what formats deserve a representation layer? You could parse out pngs and try to display them as thats their representation, though that would be silly to do. On the design idea of having a jq phrase to go to the representation, e.g.
I don't have enough experience with jq to know if it can handle complex mappings/rules. Is it possible to make a generic phrase that can pull from the e.g. Map is just an array of key value pairs. You would want the representation of map to be a map/struct. But if someone manually specified a schema as an array of key value pairs, you would want the representation to be an array of key value pairs. I believe the only way to solve this is to have a context of the schema when parsing it, trying to making a transformation ruleset defined entirely on the output would be impossible. Though I might just be overthinking it. Thoughts? |
Good idea, sorry about that. Would be nice if github allowed "normal" comments when doing a draft review somehow.
Yes the best fq can currently do for this is that it allows to have new buffers inside a decode tree, it won't allow you to map locations back in any way but it can give some hint of where the in "parent" buffer it originated from. You will notice in a decode value "dump" tree that the address column is indented per "buffer" level to give some hint.
Yeah i think your right, a representation layer only makes sense if there is an obvious way to do it and that it does not involve too much work. Maybe in some cases it's better to have per format documentation with some hint and snippets for how to iterate/traverse the structure? I've actually joked that fq should try show you everything a normal user won't not want to see :)
Ignoring that we should probably not do it: yes i think so, jq the langauge is very capable so probably possible. I should really write some introduction to jq with the aim at how to use it with fq. Also i'm using https://github.com/wader/vscode-jq which makes it a lot nicer to write lot if jq code... should really clean up and package that.
Sounds like we should skip trying to make this too smart, let's first make a really good basic decoder and then see how it's used or what feels missing? Really appreciate the work and thinking you put into this! Hope you had a great new years! |
Hey, hope you'r doing great. Let me know if there was something i missed to answer or so |
Thanks mate, all good. Just haven't had spare time to finish this PR yet. |
👍 Good to hear. No stress and i don't think my messing around in master should cause much more conflicts than it already has if you let it rest for a while |
Hey, i did some cleanup/sorting related to list of formats, it probably caused some collision but after resolve them i think they will be less likely to happen |
It seems like |
…he first place? Maybe some macos shenanigans.
Hey sorry for the delay, got distracted with some other projects. Finished out all the WIP goals I set for myself. Ready for full review whenever you have the time! |
Would be exciting to see this and #51 land in the next release - nice work, @xentripetal ! |
Oh sorry i noticed now that i was commenting on a old commit |
Great stuff! BTW if you want to add documentation specific to |
|
||
Limitations: | ||
- Schema does not support self-referential types, only built-in types. | ||
- Decimal logical types are not supported for decoding, will just be treated as their primitive type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this puts the other format documentation to shame. Good idea to have links specification etc, if your using fq for something if quite likely you want to have has much details and help as possible :)
I've usually collect links and useful tools etc as comments in the go-files but much of that can probably move out to documentation.
Aha yeah that could be it, i've run into a similar issue with bzip2 (there i let the reader read a bit earlier i think). We can leave it as it is i think maybe revisit it later. Possibly add a comment about it |
Yeah that is probably the best way atm so no worries. I'm sorry i haven't had time to write proper decoder documentation yet. |
Sorry again :) should also add more documentation about all options. Things have been a bit in flux so have skipped/forgotten to documenting until i felt it was stable or good enough. |
Seems the fqtest files needs an update after the UTC change |
format/avro/decoders/long.go
Outdated
) | ||
|
||
const intMask = byte(127) | ||
const intFlag = byte(128) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use 0b111_1111
and 0xb1000_0000
to make it even more explicit? it the byte cast needed btw?
return s, nil | ||
} | ||
|
||
// Todo Decimal: https://github.com/linkedin/goavro/blob/master/logical_type.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is BigInt support now so could be used for this i think? but we can do that later maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would require Big.Rat support
Have a look at my last comments, after that i think we're good to merge. Do you have anything left you want to fix? |
I think its good to merge, in the future I think adding decimals and user defined types in schema decoding would be good, but not really required. |
🥳 Thanks a lot for this and hope for more contributions in the future! |
BTW hopefully will do a release quite soon, trying to keep it around every 2-3 weeks for now. |
Full avro OCF support. Handles all primitive, complex, and logical types besides decimals.
Able to handle deflate, snappy, and null codecs for blocks.
Requirements for WIP removal:
Support common logical types (date, decimal, duration, time, timestamp)Add test case with all avro datatypesEvaluate viability of splitting avro datum into a subdecoderCleanup OCF header decodingCleanup schema decodingFinalize design around handling OCF codecs. (Currently only handles null codec, rest treat the datum as raw bytes and won't decode them)