-
Notifications
You must be signed in to change notification settings - Fork 107
Codecs for variable length items #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| >>> import numcodecs | ||
| >>> import numpy as np | ||
| >>> x = np.array([b'foo', b'bar', b'baz'], dtype='object') | ||
| >>> codec = numcodecs.VLenUTF8() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just caught this mismatch when reading the docs, added PR ( https://github.com/alimanfoo/numcodecs/pull/58 ) to fix it.
|
@alimanfoo , it occurs to me that other encodings in parquet-land might be even more efficient, although no parquet files seem to use them: you can have all the string lengths before the strings themselves, so then you get a well-compressible int32 block, which could further be reduced with delta encoding or RLE. Just a thought for the future, the performance here is probably already good enough. |
|
Thanks Martin, yes it would be great to explore these at some point,
there's lots of good ideas for encoding in the parquet format.
Interestingly when I was first looking into this I did try out pulling all
the string lengths to a block at the front, I didn't try any delta encoding
or RLE but just did a straight comparison between all string lengths at the
front versus interleaved with the values. After passing the encoded data
through a compressor (Zstd) the interleaved layout gave better compression
ratio, I'm guessing because then the string length becomes part of the
pattern for each string. That might change if you first did RLE or delta on
the lengths block, but thought it was interesting and not what I expected.
It suggests that in situations (like zarr) when you can assume the data
will be passing through a chain of codecs, with a compressor codec as the
final stage, then the emphasis on codecs earlier in the chain is less about
reducing absolute size and more about creating good patterns in the data
for the compressor to pick up on.
…On Thursday, December 7, 2017, Martin Durant ***@***.***> wrote:
@alimanfoo <https://github.com/alimanfoo> , it occurs to me that other
encodings in parquet-land might be even more efficient, although no parquet
files seem to use them: you can have all the string lengths before the
strings themselves, so then you get a well-compressible int32 block, which
could further be reduced with delta encoding or RLE. Just a thought for the
future, the performance here is probably already good enough.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/numcodecs/pull/56#issuecomment-349980640>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QqVxpD0Akr62PrmapEG7XYQNwKVJks5s9_RrgaJpZM4Q1oTS>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
|
That is interesting, thanks for sharing. |
This PR implements codecs for variable length unicode strings (VLenUTF8), bytes (VLenBytes) and arrays (VLenArray), using the Parquet-style byte array encoding. Resolves #50, resolves #51, resolves #52.
TODO: