Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand (and pass) nested FSStore tests #709

Merged
merged 19 commits into from
May 19, 2021

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Mar 11, 2021

This PR adds tests to FSStore(..., key_separator="/"). The first test checks that writing to a zarr array backed by a nested FSStore succeeds. This test fails without changes to FSStore.getitems -- the original getitems implementation normalizes keys, calls self.map.getitems(normalized_keys), and returns a dict with those normalized keys. It seems that the code calling FSStore.getitems() may not recognize these keys and thus discard all the values, producing an array with only fill value. I modified FSStore.getitems to return a dict with the input keys.

This PR also expands on a second test (reading the same array with a separate store instance) which currently fails. I will try to get this second test passing, but any pointers would be appreciated. cc @martindurant

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

…asses after changes to FSStore. The second test fails.
@pep8speaks
Copy link

pep8speaks commented Mar 11, 2021

Hello @d-v-b! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-04 15:44:03 UTC

@codecov
Copy link

codecov bot commented Mar 11, 2021

Codecov Report

Merging #709 (d25b2b2) into master (ab5d91b) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #709      +/-   ##
==========================================
+ Coverage   99.92%   99.94%   +0.01%     
==========================================
  Files          28       28              
  Lines       10408    10537     +129     
==========================================
+ Hits        10400    10531     +131     
+ Misses          8        6       -2     
Impacted Files Coverage Δ
zarr/storage.py 100.00% <100.00%> (ø)
zarr/tests/test_core.py 100.00% <100.00%> (ø)
zarr/tests/test_hierarchy.py 100.00% <100.00%> (ø)
zarr/util.py 100.00% <100.00%> (ø)
zarr/creation.py 100.00% <0.00%> (+1.21%) ⬆️

@martindurant
Copy link
Member

Ping me when I should have a look

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 11, 2021

Don't look too closely at the git history, it reflects my confusion as I try to understand what is getting tested where :)

I would like some input on how best to parameterize test_core.TestArrayWithFSStore over key_separators, since there are multiple test classes that would be affected by this and I would rather not duplicate a ton of code. It seems that classes that inherit from unitttest.testcase, e.g. TestArray, only have methods that take no arguments, so the key_separator cannot be passed as an argument to methods like test_hexdigest. I don't know pytest well enough to see a simple solution that doesn't involve a lot of copy + paste.

cc @martindurant

@joshmoore
Copy link
Member

I thought the fixture-based style in #417 was pretty intuitive.

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 14, 2021

I think this is ready to look at. Instead of making any big changes to the testing architecture I just went with the flow and added two more subclasses of TestArray in test_core.py -- TestArrayWithFSStoreNested and TestArrayWithFSStoreNestedPartialRead. Getting tests to pass for TestArrayWithFSStoreNested required changing FSStore.getitems and FSStore.listdir, and both changes bridge the "."-separated chunk keys used in the core zarr api to the "/" separated keys of the storage implementation.

Additionally, there are a bunch of failing tests for TestArrayWithFSStoreNestedPartialRead, which are due to the same issue -- code is looking for "."-separated keys where there are none. But I don't have a fix for those failing tests at the moment, since the partial read codebase is more opaque to me.

cc @martindurant

@martindurant
Copy link
Member

Is pydata/xarray#5028 related, perhaps?

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 15, 2021

Is pydata/xarray#5028 related, perhaps?

I think that issue is only mildly related, it's due to this line in https://github.com/zarr-developers/zarr-python/blob/master/zarr/storage.py#L1051 in conjunction with normalize_keys=True in the fsstore constructor.

@martindurant
Copy link
Member

I'm not sure what the argument for normalize to be True versus False by default. Are you saying that it ought to be an obvious way to pass the argument down?

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 15, 2021

I'm not sure what the argument for normalize to be True versus False by default. Are you saying that it ought to be an obvious way to pass the argument down?

I don't really have an opinion on that... I think the upper case vs lower case thing is orthogonal to this PR -- the issues I tried to fix in this PR turn on "0.0.0" vs "0/0/0" for chunk keys.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with the changes to FSStore.
I note many "added line not covered" notes in the diff.

zarr/storage.py Show resolved Hide resolved
zarr/storage.py Outdated Show resolved Hide resolved
@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 15, 2021

I'm not sure what to make of the test coverage warnings, since I'm not changing the API at all, and some of the lines of code being flagged are inside the tests themselves. I do note that store.getitems appears untested, should I add tests for that in test_storage?

@martindurant
Copy link
Member

I do note that store.getitems appears untested, should I add tests for that in test_storage

If this is the case, then yes; the intent was definitely to test it, and previously we required ==100% coverage.

@joshmoore
Copy link
Member

Final commit doesn't appear to have triggered. Trying a re-open.

@joshmoore joshmoore closed this Mar 23, 2021
@joshmoore
Copy link
Member

I've rebased and added the tests from #718. Looks like there are test failures from partial reads (#667)

cc: @andrewfulton9

@grlee77
Copy link
Contributor

grlee77 commented Apr 30, 2021

The one test failure in the non-nested FSStore test_hexdigest (see here) seems to be due to the addition of a dimension_separator attribute in .zarray.

The tested hex would match for the .zattr below if the line containing dimension_separator was removed. (For DirectoryStore, there is no dimension_separator entry):

{
    "chunks": [
        100
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dimension_separator": ".",
    "dtype": "<i4",
    "fill_value": null,
    "filters": null,
    "order": "C",
    "shape": [
        1050
    ],
    "zarr_format": 2
}

@grlee77
Copy link
Contributor

grlee77 commented Apr 30, 2021

Note that if the FSStore is initialized without a key_separator kwarg being passed in (as is the case in TestArrayWithFSStorePartialRead), then there is no dimension_separator entry in the .zarray. Is this the expected behavior, or should a dimension_separator entry always be present?

@joshmoore
Copy link
Member

joshmoore commented Apr 30, 2021

Is this the expected behavior, or should a dimension_separator entry always be present?

@grlee77 : yes, due to just the hexdigest issues you're mentioning. The initial commits tried deprecating key_separator and always storing dimension_separator. Always storing dimension_separator, however, led to more changes in the hexdigest() value for existing datasets. I wasn't sure of the overall impact of that, so decided to write out the value as infrequently as possible. Since dimension_separator could then be None, I needed something that could be guaranteed non-None, and that's key_separator. That's a reversible chain that could likely use some more discussion.

@grlee77
Copy link
Contributor

grlee77 commented Apr 30, 2021

The remaining nested FSStore test failures not related to hex digests are resolved if the key_path attribute of zarr.util.PartialReadBuffer is updated to take into account '.' -> '/' in the nested case:

For example, switching the current:

        self.key_path = self.map._key_to_str(store_key)

to

        _key_path = self.map._key_to_str(store_key)
        _key_path = _key_path.split('/')
        _key_path = '/'.join(_key_path[:-1] + [self.chunk_store._normalize_key(_key_path[-1])])
        self.key_path = _key_path

works, but that is just the quick hack I came up with during debugging and I haven't looked too closely to see where this fix should actually live. recommendations?

@joshmoore
Copy link
Member

@grlee77, it doesn't look insane 🙂 and if all current tests (incl. yours) are happy, I'd say it's an improvement. Based on @d-v-b's earlier evaluation, it seems like we want a clearer delineation between keys as used in storage.py and core.py, where the former can use "/" and the latter never users "/" (?). PartialReadBuffer is only used in core.py, but I don't know where partial read perhaps is also a leaky abstraction. Lots of moving parts.

@joshmoore
Copy link
Member

@d-v-b : any thoughts on @grlee77's proposal? I imagine a 2.9.0 could go out soon and it'd be nice to include this (or have it beforehand in a 2.8.2).

@d-v-b
Copy link
Contributor Author

d-v-b commented May 4, 2021

Disclaimer: I don't know anything at all about the partial read functionality. That being said, it looks like the function in question (zarr.util.PartialReadBuffer) is operating at the storage level, i.e. it's working with real keys to objects on a file system. In that case the proposed change doesn't raise any red flags from me.

@joshmoore
Copy link
Member

Ok. Pushed @grlee77's fix as well as hexdigest fixes. Let's see if we can get it all green!

@joshmoore joshmoore changed the title [WIP] Expand (and pass) nested FSStore tests Expand (and pass) nested FSStore tests May 4, 2021
@joshmoore
Copy link
Member

Whew!

@andrewfulton9
Copy link
Contributor

I think what's happening with the PartialReadBuffer is that its using the mapper class to rebuild the full path to the file in the storage system so that it can use fs.read_block, to only read a part of the file, instead of using FSMap to get the file. I think that maybe when I tested that PR, I didn't account for store_keys having a leading '/' which is why @grlee77 had to make the above changes to make it pass.

The code below would also work for fixing the issue, but I haven't tested it, and the above fix works fine too I think.

self.key_path = self.map._key_to_str(self.chunk_store._normalize_key(store_key))

@joshmoore
Copy link
Member

Can confirm that the tests pass with @andrewfulton9's version as well.

return children
else:
if array_meta_key in children:
# special handling of directories containing an array to map nested chunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant: can you comment if this matches your expectations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant : any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now I have reminded myself what it's doing, and I suppose this is reasonable. As to it's accuracy in all cases, I can only guess it looks right.

It makes you wonder how useful listdir actually if for the case of fsspec. I suppose it remains true that a user might want to start exploring a dataset at some high-level group and descend to a specific array, yet make sure they never list the entire set of files (which could be expensive).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @martindurant. I'm going to take that as a 👍 and get this into a 2.8.1 release. It seems that this may be something that needs a re-evaluation down the line.

@joshmoore
Copy link
Member

joshmoore commented May 7, 2021

@martindurant: this is perhaps a bit out of scope for this PR since it's specifically focused on FSStore but in case we want to re-evaluate at which level dimension_separator needs to be applied, I just realized that this doesn't work:

import zarr as zr
import dask.array as da

z = zr.storage.FSStore(
    "foo.zarr",
    dimension_separator="/",
    auto_mkdir=True,
    mode="w",
)
a = zr.creation.array(store=z, data=[range(4) for x in range(4)])

# Passes
base = da.from_zarr(z, "")
m = base.max().compute()
assert m == 3, m

# Fails
base = da.from_zarr("foo.zarr")
m = base.max().compute()
assert m == 3, m

since FSStore is not used but only FSMap

cc: @will-moore

will-moore added a commit to will-moore/ome-zarr-py that referenced this pull request May 7, 2021
@joshmoore
Copy link
Member

I've migrated the da.from_zarr issue to dask/dask#7673

@joshmoore joshmoore merged commit a756586 into zarr-developers:master May 19, 2021
@d-v-b d-v-b deleted the fsstore_nested_tests branch December 21, 2022 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants