Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for byte shuffle as a codec so that it can be used as a Zarr filter #260

Closed
pbranson opened this issue Nov 26, 2020 · 7 comments

Comments

@pbranson
Copy link
Contributor

Enhancement Request

Following on from discussion here:
fsspec/kerchunk#11
It would be great to expose the blosc library shuffle operations as a numcodec Codec so that shuffle could be included as a zarr filter.

This will assist with efforts to expand the functionality provided by the fsspec ReferenceFileSystem to a broader range of datasets stored in hdf format on S3.

The plan would be to expose https://github.com/Blosc/c-blosc/blob/9fae1c9acb659159321aca69aefcdbce663e2374/blosc/shuffle.h as a cython shuffle.pyx module in numcodecs.

If this sounds like a good enhancement I would be happy to try making a PR for this.

@alimanfoo
Copy link
Member

Hi @pbranson, just to say this sounds like a good idea to me, PR welcome.

@rsignell-usgs
Copy link

@pbranson , this is exciting! I think many NetCDF4 files "in the wild" will have shuffle on as that appears to be the default for the NCO tools.

@pbranson
Copy link
Contributor Author

pbranson commented Dec 9, 2020

Thanks for the encouragement @rsignell-usgs ! Have made a start but getting hit by a pre-christmas crunch. Hopefully have a PR ready for review by early next week.

@pbranson
Copy link
Contributor Author

pbranson commented Dec 10, 2020

I have had a go at exporting the internal blosc shuffle functions, however the buffers used by those functions use an unsigned char* pointer, presumably due to the byte shuffle, this presents some warnings during compile as the Buffer convenience class uses a char* pointer.

The c function header is:
blosc_internal_shuffle(const size_t bytesoftype, const size_t blocksize, const uint8_t* _src, const uint8_t* _dest)

I have some branches of numcodecs
https://github.com/pbranson/numcodecs/tree/shufflecodec
and blosc-c
https://github.com/pbranson/c-blosc/tree/exportshuffle
which compile, however the encode just returns a byte array filled with 0's using the following test code:

import numpy as np
from numcodecs import blosc
from numcodecs.blosc import Blosc, Shuffle

codec = Shuffle()
arr = np.random.normal(loc=1000, scale=1, size=(10,))
enc = codec.encode(arr)
dec = codec.decode(enc)

print(arr)
print(enc)
print(dec)

which outputs:

[ 999.11031606 1001.00931215  998.04921042  998.75484377  999.72977095
 1000.98084361 1002.04415172 1000.65107064  998.97131428 1000.28149948]
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b"\xe4\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00numcodecs.compat_extd module 'numcodecs' has no attribute 'vlen' (most likely due to a circular "

Clearly the encode output buffer isnt being written to. I tried a few different ways to try declaring the cython pointer as unsigned char* but that conflicts with the types used in the Buffer convenience class...

Wondering if this path is worth pursuing, or if just directly coding simpler shuffle that doesnt make use of hardware optimisations might be more achievable and sufficient for now?

This is my first foray into cpython/cython so would be greatful for any advice.

@rabernat
Copy link
Contributor

For deep C-related issues in numcodecs I nearly always tag @jakirkham! 😉

@pbranson
Copy link
Contributor Author

pbranson commented Feb 17, 2021

Sorry this has taken a while to get back to!

I abandoned the approach of trying to use blosc-c as I couldnt see a clear path forward. Resorted to using numba - if one of the main devs could take a look at the PR #273 to give some guidance on if this is an acceptable approach that would be great.

@pbranson
Copy link
Contributor Author

Closed by #273

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants