Codecs for variable length items #56

alimanfoo · 2017-12-05T02:21:34Z

This PR implements codecs for variable length unicode strings (VLenUTF8), bytes (VLenBytes) and arrays (VLenArray), using the Parquet-style byte array encoding. Resolves #50, resolves #51, resolves #52.

TODO:

release notes
fix test coverage
benchmarks

jakirkham · 2017-12-06T19:12:40Z

numcodecs/vlen.pyx

+    >>> import numcodecs
+    >>> import numpy as np
+    >>> x = np.array([b'foo', b'bar', b'baz'], dtype='object')
+    >>> codec = numcodecs.VLenUTF8()


Just caught this mismatch when reading the docs, added PR ( https://github.com/alimanfoo/numcodecs/pull/58 ) to fix it.

martindurant · 2017-12-07T14:21:31Z

@alimanfoo , it occurs to me that other encodings in parquet-land might be even more efficient, although no parquet files seem to use them: you can have all the string lengths before the strings themselves, so then you get a well-compressible int32 block, which could further be reduced with delta encoding or RLE. Just a thought for the future, the performance here is probably already good enough.

alimanfoo · 2017-12-07T14:51:42Z

Thanks Martin, yes it would be great to explore these at some point, there's lots of good ideas for encoding in the parquet format. Interestingly when I was first looking into this I did try out pulling all the string lengths to a block at the front, I didn't try any delta encoding or RLE but just did a straight comparison between all string lengths at the front versus interleaved with the values. After passing the encoded data through a compressor (Zstd) the interleaved layout gave better compression ratio, I'm guessing because then the string length becomes part of the pattern for each string. That might change if you first did RLE or delta on the lengths block, but thought it was interesting and not what I expected. It suggests that in situations (like zarr) when you can assume the data will be passing through a chain of codecs, with a compressor codec as the final stage, then the emphasis on codecs earlier in the chain is less about reducing absolute size and more about creating good patterns in the data for the compressor to pick up on.

…

On Thursday, December 7, 2017, Martin Durant ***@***.***> wrote: @alimanfoo <https://github.com/alimanfoo> , it occurs to me that other encodings in parquet-land might be even more efficient, although no parquet files seem to use them: you can have all the string lengths before the strings themselves, so then you get a well-compressible int32 block, which could further be reduced with delta encoding or RLE. Just a thought for the future, the performance here is probably already good enough. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/numcodecs/pull/56#issuecomment-349980640>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QqVxpD0Akr62PrmapEG7XYQNwKVJks5s9_RrgaJpZM4Q1oTS> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

martindurant · 2017-12-07T14:56:57Z

That is interesting, thanks for sharing.

alimanfoo added 11 commits December 4, 2017 17:12

WIP vlen utf8 codec

53a3859

benchmarking vlen utf8

92887ae

interesting list code

c90b363

test vlen utf8

3e2bb3d

tidy

67dfe0b

tidy

42fad12

rebuild

4692156

broken in PY2, bytes objects get GCed

dca1f8d

vlen-bytes

ac7dac5

document vlen

defdae6

add vlen-array

d430bc5

alimanfoo changed the title ~~WIP codecs for variable length items~~ Codecs for variable length items Dec 5, 2017

alimanfoo added 6 commits December 5, 2017 15:07

fix compilation issues and compat

d7c18ed

rename variables for clarity

ca10d02

fix win32

62141fb

revert to 32-bit header

cde8d62

data corruption checks, simplify header

e94a182

release notes [ci skip]

5822694

alimanfoo merged commit a2da7a2 into master Dec 5, 2017

alimanfoo deleted the vlen-20171204 branch December 5, 2017 16:49

jakirkham reviewed Dec 6, 2017

View reviewed changes

This was referenced Dec 7, 2017

Reverted fixed strings regression, allowed to avoid stats on write dask/fastparquet#258

Closed

Speedup speedup dask/fastparquet#259

Closed

jakirkham mentioned this pull request May 6, 2022

Drop/replace array_decode_utf8 from benchmark_vlen.ipynb #319

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Codecs for variable length items #56

Codecs for variable length items #56

Uh oh!

alimanfoo commented Dec 5, 2017 •

edited

Loading

Uh oh!

jakirkham Dec 6, 2017

Uh oh!

martindurant commented Dec 7, 2017

Uh oh!

alimanfoo commented Dec 7, 2017 via email

Uh oh!

martindurant commented Dec 7, 2017 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Codecs for variable length items #56

Codecs for variable length items #56

Uh oh!

Conversation

alimanfoo commented Dec 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham Dec 6, 2017

Choose a reason for hiding this comment

Uh oh!

martindurant commented Dec 7, 2017

Uh oh!

alimanfoo commented Dec 7, 2017 via email

Uh oh!

martindurant commented Dec 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alimanfoo commented Dec 5, 2017 •

edited

Loading

martindurant commented Dec 7, 2017 •

edited

Loading