Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data array is flattened in numcodecs which reduced the compression ratio that ZFP can provide on multi-dimension arrays #303

Closed
halehawk opened this issue Feb 8, 2022 · 35 comments · Fixed by #307

Comments

@halehawk
Copy link
Contributor

halehawk commented Feb 8, 2022

Minimal, reproducible code sample, a copy-pastable example if possible

In ensure_contiguous_ndarray:
if arr.flags.c_contiguous or arr.flags.f_contiguous:
        # can flatten without copy
        arr = arr.reshape(-1, order='A')

Problem description

ZFP is a lossy compressor that expects to get better compression ratio when compressing multi-dimension arrays. Currently in the zfpy.encode, ensure_contiguous_ndarray function flattens the array and feeds to ZFP compressor. In this way, ZFP compressor only can get less expected compression ratio. So I tried to add a switch in ensure_contiguous_ndarray, if the codec is ZFP, I skipped the flattening. However, this only works on C-order array, it doesn't work on F-order array because ZFP decoder only returns C-order array. When the decoder returns C-order multi-dimension array for an original F-order multi-dimension array, ZFP compressor cannot pass the F-order array encode and decode test.

Version and installation information

Please provide the following:

  • 0.9.0 of numcodecs.__version__
  • Python 3.7, 3.8
  • Operating system (Linux/Windows)
  • pip install
@halehawk
Copy link
Contributor Author

halehawk commented Feb 8, 2022

@lindstro @jakirkham please add your comments.

@lindstro
Copy link

lindstro commented Feb 8, 2022

I believe there are multiple independent issues here.

First, arrays should not be flattened when used with zfp. I see no need to even make arrays contiguous as zfp supports strided compression.

Second, it is true that zfp returns decompressed arrays in C order (the default layout in NumPy), regardless of the original layout. zfp does not capture the original layout, nor is it clear that it should. For instance, if the array being compressed is not contiguous (think array-of-struct storage), would we really want to return a decompressed array full of holes? We did consider supporting strided decompression with zfpy.decompress_numpy(), but as I recall there were some issues with this. That said, I believe one could use zfpy._decompress() to decompress into an existing strided array passed as out argument.

Third, it is an issue if zfp's tests break if given an original array to compress in Fortran order, and I'd be surprised if this were not handled properly. @halehawk, can you provide a simple example that illustrates such a bug? If this is indeed the case, then we need to fix the zfp implementation/tests, not numcodecs.

@rabernat
Copy link
Contributor

rabernat commented Feb 8, 2022

Sounds like we could fix the flattening pretty easily with a new keyword argument here flatten=True

def ensure_contiguous_ndarray(buf, max_buffer_size=None):

All the existing codecs would work the same with the default flatten=True, but we could call it from zfp here

buf = ensure_contiguous_ndarray(buf)

with flatten=False

@halehawk
Copy link
Contributor Author

halehawk commented Feb 8, 2022 via email

@halehawk halehawk closed this as completed Feb 9, 2022
@halehawk halehawk reopened this Feb 9, 2022
@rabernat
Copy link
Contributor

rabernat commented Feb 9, 2022

So why don't we just disallow F order for the zfp codec? I am becoming more and more convinced that it is unnecessary to support it. See discussion in zarr-developers/community#41.

@halehawk
Copy link
Contributor Author

halehawk commented Feb 9, 2022

@rabernat I agree with you it is great if we can skip Fortran array test here. I tested xarray.open_dataset with backend netcdf4, it always opens a netcdf file as a c-order data array even the netcdf file was written to f-order originally. I didn't check with hdf5 or scipy backends. If anyone can confirm that hdf5 and scipy are the same cases, then I think it is all right to skip the Fortran array test in the test_zfpy.py.

@rabernat
Copy link
Contributor

rabernat commented Feb 9, 2022

HDF5 and NetCDF simply do not support F order on disk at all. I think it's fine for a particular codec to not support F-order memory layout. The user can always transform their data if need be if they happen to be creating F order. We just need to raise an informative error in the event of seeing F order inputs to the codec.

@lindstro
Copy link

lindstro commented Feb 9, 2022

HDF5 and NetCDF simply do not support F order on disk at all.

I partially agree. HDF5 imposes C order and (to my knowledge) does not support strided storage, where strides are recorded in the file. I believe it does not support strides on reads either, e.g., to force F order. But you can always store F order data in HDF5 by simply reversing the order of dimensions.

zfpy actually examines the NumPy array's strides and passes them on to zfp. Because zfp uses F order, zfpy reverses the dimensions in anticipation that in most cases, the NumPy order will be the default C order, which promotes good data locality during compression.

Now, the original memory layout is not recorded by zfp, just like how this information is absent from HDF5. Therefore, if upon decompression the user wants to recover the array in the same order as it was compressed in, that kind of information would have to be stored separately and the data would have to be permuted.

That said, it is not clear to me why it's necessary to ensure that compressed and decompressed data appear in the same order. Both arrays will agree on the dimensions of the array, just not how it is organized in memory. Is there a compelling reason to guarantee that both arrays have the same layout? After all, numcodecs forces both contiguous and flat storage, so the original memory layout is already lost, no?

@jakirkham
Copy link
Member

Ryan's suggestion seems reasonable.

Forget whether we discussed this, but does zfp work with N-D data? If not, what happens if it is fed an array with N greater than what is supported by zfp.

Also how would this work for decompression? Does zfp return an array of the right dimensionality?

@lindstro
Copy link

Only when 1 <= N <= 4. :-) For higher-dimensional arrays, you need to consolidate some dimensions (ideally the ones that are least correlated). This could for instance be done using a reshape operation, though you'd probably want to keep track of the dimensions of the original array.

We're planning to support higher-dimensional arrays in future versions of zfp. It is unlikely that the compression scheme would support more than four dimensions (it's unusual to have data that is correlated in that many dimensions), but the API would allow you to describe higher-dimensional arrays and to identify the smoothest dimensions along which to compress (may be less than four).

The current zfp C API does not accommodate N > 4; the zfpy module will raise an exception if asked to compress a NumPy array with N > 4.

@jakirkham
Copy link
Member

Thanks for the info Peter :)

Follow up question, when the data is decompressed, will it return the expected N-D array or is it 1-D (IOW up to the user to reshape)?

Side note: In microscopy it is common to have 5-D data (TZYXC or time + 3D spatial + channel).

@lindstro
Copy link

Thanks for the info Peter :)

Follow up question, when the data is decompressed, will it return the expected N-D array or is it 1-D (IOW up to the user to reshape)?

zfpy will return a decompressed N-D NumPy array of the same shape as the array that was compressed. However, the way zfpy currently interfaces with numcodecs, it is handed a 1D reshaped array to compress, which is the issue at hand. In this case, zfpy will also return a 1D decompressed array. Moreover, compression is severely handicapped by not being able to exploit higher-dimensional correlations.

Side note: In microscopy it is common to have 5-D data (TZYXC or time + 3D spatial + channel).

Right, though I suspect there's little correlation across channels. In radiation transport, we deal with 6D data (3 spatial + 2 angular + energy group) that is correlated in all six dimensions, but there are diminishing returns in compressing more than three dimensions (at least as far as zfp is concerned). I agree that zfp should handle such data sets, but probably in the API layer and not in the compression algorithm itself.

@halehawk
Copy link
Contributor Author

halehawk commented Feb 10, 2022 via email

@halehawk
Copy link
Contributor Author

I just created a PR #307 , it successfully built and tested on windows/linux, I thought it still skipped the macos tests for numcodecs.zfpy. Hopefully @rabernat @jakirkham can approve the merge soon.

@lindstro
Copy link

Peter, I saw your zfpy pip package has Python 3.9 and 3.10 version, does your 3.9 macos version work on mac? Did you merge my PR for building that pip package?

I'm getting a dlopen() error using the 3.9 macOS package, which I think is something we saw earlier and what your PR addresses.

I wasn't involved in generating these wheels, but it looks like we still have an outstanding PR from you. I'm sorry for dropping the ball on this. I think we ran into issues with the PR not being tested and I may have been waiting on this to get resolved. Since then, we've switched over from Travis to GitHub Actions, though it appears that we're still not testing on Mac.

@da-wad Would you have some time to work with @halehawk on getting her PR tested, ideally by adding testing on macOS? We're very close to the next zfp release and it would be great to have this worked out before we build Mac wheels for the next version.

@da-wad
Copy link

da-wad commented Feb 14, 2022

Building wheels for Linux and Windows turned out to be way easier in Github Actions than Travis+Multibuild (not to mention faster), so for ease of maintenance that should be used going forwards. Sorry about this, @halehawk , but switching CI actually now invalidates your PR... However, reading the cuibuildwheel docs, I think there is a pretty trivial way to include macOS wheels now. All you should have to do is add corresponding tags from this table to line 27 here: https://github.com/LLNL/zfpy-wheels/blob/master/.github/workflows/wheels.yml#L27

@halehawk
Copy link
Contributor Author

halehawk commented Feb 15, 2022 via email

@da-wad
Copy link

da-wad commented Feb 16, 2022

Ah, no those are no longer used. I have removed them and updated the README. It should be no more than adding those tags!

Wheels for Windows are still made in Appveyor, but this should be switched over to Github Actions too and I'd planned to do this with the release of ZFP 0.5.6

@halehawk
Copy link
Contributor Author

halehawk commented Feb 16, 2022 via email

@da-wad
Copy link

da-wad commented Feb 18, 2022

Ah, yes. I only said "should be"... of course reality is more complicated.

Take a look at this then:
image

@halehawk
Copy link
Contributor Author

halehawk commented Feb 18, 2022 via email

@jakirkham
Copy link
Member

May benefit from copying some changes that are being made to the Numcodecs wheel ( #309 )

@halehawk
Copy link
Contributor Author

halehawk commented Feb 18, 2022 via email

@da-wad
Copy link

da-wad commented Feb 24, 2022

@halehawk Many thanks for your PR, I had to fix it up a bit as it appears that the PyPI upload needs to be running on Linux, so has to be separated out from the wheel building. (See: https://github.com/pypa/gh-action-pypi-publish#non-goals)

Things seem to be working now... at least the job is failing at uploading pre-existing wheels (which it should). I think the existing macOS ones were broken, so we can fix them with 0.5.6 However, I wonder if you could test the 3.10 wheel on Mac?

image

@halehawk
Copy link
Contributor Author

halehawk commented Feb 24, 2022 via email

@halehawk
Copy link
Contributor Author

halehawk commented Feb 24, 2022 via email

@da-wad
Copy link

da-wad commented Feb 24, 2022

Excellent! Sounds like this is fixed, and all we need is a new commit-hash from @lindstro of a 0.5.6 tag :-)

I'll look into migrating Windows builds out of Appveyor and into Github Actiosn too...

@halehawk
Copy link
Contributor Author

halehawk commented Feb 24, 2022 via email

@lindstro
Copy link

It'll be a little while longer, I'm afraid, as I'm on travel this week and next. It's likely that we will release zfp as version 1.0.0 due to the many C/C++ ABI incompatibilities we've introduced. The Python interface will not change, however. Thanks to both of you for helping out with the macOS wheels.

@halehawk
Copy link
Contributor Author

halehawk commented Feb 24, 2022 via email

@da-wad
Copy link

da-wad commented Feb 25, 2022

Yes, they were rebuilt and upload was attempted, as I said previously: "the job is failing at uploading pre-existing wheels (which it should)". Any file uploaded to PyPI cannot be replaced, so that we can't break anything which does work... this is a feature of PyPI I guess you weren't aware of. Unfortunately this means that 0.5.5 is broken forever for Python 3.9 and 3.8 on MacOS...

We could get around this by uncoupling the zfpy version number from the zfp one, but since a 1.0.0 release is imminent I don't know if this is necessary/wise.

@halehawk
Copy link
Contributor Author

halehawk commented Feb 25, 2022 via email

@jakirkham
Copy link
Member

@da-wad, one could use *.postX for very simple fixes (IOW non-code changes where packages were just broken). Also it is possible to yank broken packages on PyPI

@da-wad
Copy link

da-wad commented Feb 28, 2022

Yanking the entire 0.5.5 release because of two broken wheels seems a little extreme... @jakirkham!

However, thanks for the nudge to find the proper way to do this. It turns out that there is a build tag in the wheel naming convention for solving exactly this problem: https://packaging.python.org/en/latest/specifications/binary-distribution-format/#file-name-convention

Therefore... I was able to download the artifacts from the zfpy-wheels build, rename the two macosx wheels which weren't uploaded and use twine to pop them onto PyPI manually. Do these work for you now?
image

@halehawk
Copy link
Contributor Author

halehawk commented Feb 28, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants