Skip to content

Conversation

@alimanfoo
Copy link
Member

@alimanfoo alimanfoo commented Dec 5, 2017

This PR implements codecs for variable length unicode strings (VLenUTF8), bytes (VLenBytes) and arrays (VLenArray), using the Parquet-style byte array encoding. Resolves #50, resolves #51, resolves #52.

TODO:

@alimanfoo alimanfoo changed the title WIP codecs for variable length items Codecs for variable length items Dec 5, 2017
@alimanfoo alimanfoo merged commit a2da7a2 into master Dec 5, 2017
@alimanfoo alimanfoo deleted the vlen-20171204 branch December 5, 2017 16:49
>>> import numcodecs
>>> import numpy as np
>>> x = np.array([b'foo', b'bar', b'baz'], dtype='object')
>>> codec = numcodecs.VLenUTF8()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just caught this mismatch when reading the docs, added PR ( https://github.com/alimanfoo/numcodecs/pull/58 ) to fix it.

@martindurant
Copy link
Member

@alimanfoo , it occurs to me that other encodings in parquet-land might be even more efficient, although no parquet files seem to use them: you can have all the string lengths before the strings themselves, so then you get a well-compressible int32 block, which could further be reduced with delta encoding or RLE. Just a thought for the future, the performance here is probably already good enough.

@alimanfoo
Copy link
Member Author

alimanfoo commented Dec 7, 2017 via email

@martindurant
Copy link
Member

martindurant commented Dec 7, 2017

That is interesting, thanks for sharing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bytes VLen Parquet UTF8

4 participants