Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression failures reading fsspec ReferenceFileSystem file with zarr 2.13.6 #1324

Closed
cgohlke opened this issue Jan 19, 2023 · 9 comments
Closed
Labels
bug Potential issues with the zarr-python library

Comments

@cgohlke
Copy link
Contributor

cgohlke commented Jan 19, 2023

Zarr version

2.13.6

Numcodecs version

0.11.0

Python Version

3.10.9, any version

Operating System

Windows

Installation

pip

Description

Fsspec ReferenceFileSystem files for image file sequences produced by tifffile can no longer be read correctly by zarr 2.13.6. Version 2.13.3 worked fine.

Tifffile's test_tifffile::test_write_fsspec_sequence and earthbigdata.py fail with zarr 2.13.6.

It looks like the chunks are no longer correctly mapped to their indices.

Steps to reproduce

Example

import tifffile
import fsspec
import zarr

from tifffile.numcodecs import register_codec

register_codec()

print(f'tifffile {tifffile.__version__}')
print(f'fsspec {fsspec.__version__}')
print(f'zarr {zarr.__version__}')

# generate file sequence
for i in range(16):
    for j in range(16):
        tifffile.imwrite(f'test_{i}_{j}.tif', [i * j])

expected = tifffile.imread('*.tif', pattern=r'_(\d+)_(\d+)')

# write fsspec ReferenceFileSystem file
with tifffile.imread('*.tif', pattern=r'_(\d+)_(\d+)', aszarr=True) as store:
    store.write_fsspec('test.json', '')

# get zarr array from reference file
mapper = fsspec.get_mapper('reference://', fo='test.json', target_protocol='file')
from_zarr = zarr.open(mapper, mode='r')[:]

# compare images
from matplotlib import pyplot

pyplot.subplot(211)
pyplot.imshow(expected)
pyplot.subplot(212)
pyplot.imshow(from_zarr)
pyplot.show()

Output

tifffile 2022.10.10
fsspec 2022.11.0
zarr 2.13.6

zarr_issue

Additional output

test.json.zip

@cgohlke cgohlke added the bug Potential issues with the zarr-python library label Jan 19, 2023
@jakirkham
Copy link
Member

Thanks Christoph! Sorry for the trouble

Is it possible to run git bisect?

@cgohlke
Copy link
Contributor Author

cgohlke commented Jan 19, 2023

Is it possible to run git bisect?

#1304

4e633ad9aa434304296900790c4c65e0fa0dfa12 is the first bad commit
commit 4e633ad9aa434304296900790c4c65e0fa0dfa12
Author: Rafal Wojdyla <ravwojdyla@gmail.com>
Date:   Thu Dec 22 09:33:45 2022 +0900

    Handle fsspec.FSMap using FSStore store (#1304)

 docs/release.rst              |  2 ++
 zarr/_storage/v3.py           | 10 ++++++++++
 zarr/storage.py               | 19 +++++++++++++++++++
 zarr/tests/test_storage.py    |  5 +++++
 zarr/tests/test_storage_v3.py |  5 +++++
 5 files changed, 41 insertions(+)
bisect found first bad commit

@jakirkham
Copy link
Member

Thanks! 🙏

@ravwojdyla @martindurant, do either of you have thoughts here?

@martindurant
Copy link
Member

I believe the main point of that PR was to provide listdir() on what otherwise would be an unstructured dict-like interface. I don't know why that would cause a difference, though.

@rabernat
Copy link
Contributor

My guess would be that the FSMap created by get_mapper is not being properly converted into an FSStore by this code

zarr-python/zarr/storage.py

Lines 144 to 151 in 385b5d3

if isinstance(store, fsspec.FSMap):
return FSStore(store.root,
fs=store.fs,
mode=mode,
check=store.check,
create=store.create,
missing_exceptions=store.missing_exceptions,
**(storage_options or {}))

This is not tested with the reference filesystem. We hoped it would just work with all fsspec implementations. However, there is a lot of variability and special cases across different implementations. If I had to speculate further, I would guess that the reference filesystem FSMap has certain attributes (beyond check, create, and missing_exceptions) that are not copied over to the FSStore.

@rabernat
Copy link
Contributor

I believe the main point of that PR was to provide listdir() on what otherwise would be an unstructured dict-like interface.

The main point of #1304 was to fix the performance regression reported in #1296, which also resulted from the divergence between FSStore and FSMap.

@joshmoore
Copy link
Member

Thanks, @cgohlke, as always.

On the one hand, happy to help roll out a fix ASAP. Solutions that occur to me (beyond a PR) if anyone wants to raise them:

  • pull 2.13.6
  • revert the mentioned PRs

One things have settled, I'd also vote for having @cgohlke's code in an automated test somewhere. That can be an additional workflow here, or workflows in other repos that run with pip install --pre zarr so as to help catch these issues.

@ap--
Copy link

ap-- commented Jan 19, 2023

Hello everyone,

It seems it's a regression introduced in fsspec==2022.11.0 that was revealed with the new zarr version.

I had a look at this and found that changing:

https://github.com/fsspec/filesystem_spec/blame/b4661bbe85b106f065faa2ef390d6404f8cd6cc4/fsspec/implementations/reference.py#L353

in fsspec.implementations.reference to: sort=False,

Fixes the test code provided above.
Installing fsspec==2022.10.0 does too.

I can dig further later today.

@cgohlke
Copy link
Contributor Author

cgohlke commented Jan 19, 2023

Fixed in fsspec 2023.1.0. Thank you all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

6 participants