Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FSStore #546

Merged
merged 23 commits into from
Sep 14, 2020
Merged

Add FSStore #546

merged 23 commits into from
Sep 14, 2020

Conversation

martindurant
Copy link
Member

@martindurant martindurant commented Mar 17, 2020

Fixes #540
Ref #373 (comment) (@rabernat )

Introduces a short Store implementation for generic fsspec url+options. Allows both lowercasing of keys and choice between '.' and '/'-based ("nested") keys.

For testing, have hijacked the TestNestedDirectoryStore tests just as an example - this is not how things will remain.

@alimanfoo
Copy link
Member

Thanks Martin, great to see this.

@rabernat
Copy link
Contributor

Thanks a lot @martindurant for getting this started! Let me know when you're ready for a review or some feedback.

It would be really nice to implement getsize and rename as well.

@martindurant
Copy link
Member Author

getsize is already there ( https://github.com/zarr-developers/zarr-python/pull/546/files#diff-31d15042dbeedbf2942ace2ad4b9b2e2R1022 )
Do you have the signature for rename?

@chrisroat
Copy link

@martindurant I was just reloading the underlying issue into my brain's cache this morning, and excited to see this. What is the overall plan, with regard to how this would interact with things like N5Store and NestedDirectoryStore? How can I help?

@martindurant
Copy link
Member Author

The idea is, that nested and consolidated, at least, would be options to this store, no need for extra classes. I don't know enough about the N5 case to know if that is as simple.

As far as I know, the little code that is here "works", but needs a good deal of testing and doubtless some massage to get totally right. At the moment, the PR just hijacks some existing nested tests, which all then pass locally, and almost all on CI.

@chrisroat
Copy link

If this is to fix 540, then I would like to see how I could help update N5Store to use this. The N5Store is a NestedDirectoryStore (also a MutableMapping like FSStore) which lays out the data according the N5 spec.

I do not think N5Store can just inherit from FSStore, since it will lose the Directory/NestedDirectory functionality. It's possible that N5Store could take a FSStore in it's c'tor, but that may not be the cleanest way to combine the N5, FS, and NestedDirectory functionality.

Is FSStore a possible replacement for a DirectoryStore?

Regarding the comment above about the rename method, here is the DirectoryStore
implementation.

@martindurant
Copy link
Member Author

I do not think N5Store can just inherit from FSStore, since it will lose the Directory/NestedDirectory functionality

In this POC, FSStore is tested as if it were a nested store, but it can do both, depending on the arguments passed. That's what I meant bu not needing to have separate classes.

here is the DirectoryStore implementation.

OK, so simple enough

@alimanfoo
Copy link
Member

Just to say FWIW I think the best way forward for N5Store is to modify it so it becomes a transformation layer over another store, as suggested here and in zarr-developers/n5py#9. That way N5Store can be used with any type of underlying storage. I.e., work on N5Store can be orthogonal to this PR.

@chrisroat
Copy link

@martindurant I did a quick swap on N5Store to have it inherit from FSStore (and use the FSStore listdir within N5Store.listdir), and it does seem to correctly read/write a single array to GCS. I didn't test other functionality, like groups or anything related to listdir explicitly.

@alimanfoo OK. I'll pick up the discussion about the correct interaction of N5Store and FSStore on 540.

@martindurant
Copy link
Member Author

Right, there will be problems - so far I only ran the nestedstore tests, and see that they pass. You must of course pass key_separator='/' to the init, which for a subclass would be

    super().__init__(key_separator='/', other-kwargs)

I haven't got very far yet, but the code here should show that it isn't too complicated.

zarr/storage.py Outdated Show resolved Hide resolved
Martin Durant added 3 commits April 16, 2020 09:46
This demonstrates the purpose of this PR!

Now allows, for exmple `zarr.open('http://localhost:8000')`,
although there is no way to pass on further args with this method
yet.
@pep8speaks
Copy link

pep8speaks commented Apr 16, 2020

Hello @martindurant! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-11 14:08:37 UTC

zarr/storage.py Outdated
@@ -949,16 +949,17 @@ def atexit_rmglob(path,
class FSStore(MutableMapping):

def __init__(self, url, normalize_keys=True, key_separator='.',
**storage_options):
mode='r', **storage_options):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, the TestNestedDirectoryStore failure becomes:

_____________________________________________ TestNestedDirectoryStore.test_chunk_nesting _____________________________________________

self = <zarr.tests.test_storage.TestNestedDirectoryStore testMethod=test_chunk_nesting>

    def test_chunk_nesting(self):
        store = self.create_store()
        # any path where last segment looks like a chunk key gets special handling
>       store['0.0'] = b'xxx'

zarr/tests/test_storage.py:773:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <zarr.storage.FSStore object at 0x7fb7501c2c18>, key = '0.0', value = b'xxx'

    def __setitem__(self, key, value):
        if self.mode == 'r':
>           raise PermissionError
E           PermissionError

zarr/storage.py:974: PermissionError

We realized recently that the default mode for a regular DirectoryStore is w meaning that directories are automatically created:

In [18]: z=zarr.open('this/file/does/not/exist')
In [19]: list(z.groups()), list(z.arrays())
Out[19]: ([], [])

I was debating suggesting a change to "r" on a separate issue, but we'll probably need to keep it in mind for this PR as well.

If I force a "w" mode:

@@ -764,7 +764,7 @@ class TestNestedDirectoryStore(TestDirectoryStore, unittest.TestCase):
         path = tempfile.mkdtemp()
         atexit.register(atexit_rmtree, path)
         store = NestedDirectoryStore(path, normalize_keys=normalize_keys,
-                                     key_separator='/', auto_mkdir=True)
+                                     key_separator='/', auto_mkdir=True, mode="w")
         return store

then I again see the previous error:

zarr/tests/test_storage.py:230:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
zarr/storage.py:970: in __getitem__
    return self.map[key]
/opt/anaconda/envs/py36/lib/python3.6/site-packages/fsspec/mapping.py:76: in __getitem__
    result = self.fs.cat(k)
/opt/anaconda/envs/py36/lib/python3.6/site-packages/fsspec/spec.py:587: in cat
    return self.open(path, "rb").read()
/opt/anaconda/envs/py36/lib/python3.6/site-packages/fsspec/spec.py:775: in open
    **kwargs
/opt/anaconda/envs/py36/lib/python3.6/site-packages/fsspec/implementations/local.py:108: in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
/opt/anaconda/envs/py36/lib/python3.6/site-packages/fsspec/implementations/local.py:175: in __init__
    self._open()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <fsspec.implementations.local.LocalFileOpener object at 0x7faa10221908>

    def _open(self):
        if self.f is None or self.f.closed:
            if self.autocommit or "w" not in self.mode:
>               self.f = open(self.path, mode=self.mode)
E               IsADirectoryError: [Errno 21] Is a directory: '/var/folders/z5/txc_jj6x5l5cm81r56ck1n9c0000gn/T/tmpwebfv41i/c'

/opt/anaconda/envs/py36/lib/python3.6/site-packages/fsspec/implementations/local.py:180: IsADirectoryError

which seems related, but I'm unclear how to get started with it, since it seems to be saying that fsspec expects one of the intermediate keys to be a file rather than a directory. I'll keep poking around but suggestions welcome.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably should have started with the non-nested stores for testing...
Exactly when directory trees should be removed and when files should be removed to make way for directories, I tries to reverse-engineer from the existing tests.

The actual error here, is that zarr expects a KeyError when trying to read "c", but fsspec as of fsspec/filesystem_spec#259 only converts FileNotFound into a KeyError. I suppose IsADirectory should be caught too, or the code should call fs.isfile()?

@martindurant
Copy link
Member Author

This appears to pass all tests with fsspec master, where I have substituted FSStore for DirectoryStore and NestedDirectoryStore in test_storage (not the final solution!).

Note that FSStore now directly implements consolidation, also in write mode (which is controversial, since it makes each metadata write take two remote calls, and introduces possible sync problems).

@martindurant
Copy link
Member Author

Would it be useful to have fsspec master in this POC to show it will pass?

@rabernat
Copy link
Contributor

I appreciate all the work on this from @martindurant. At the same time, I do find it confusing to hear things like "FSStore now directly implements consolidation". We should have a clear separation between a storage layer and the zarr layer. It makes me nervous to see more features seeping into the storage layer.

@martindurant
Copy link
Member Author

We should have a clear separation between a storage layer and the zarr layer

In my mind, fsspec is the storage layer, and everything in zarr.storage can be specific/specialised (such as the original conslolidated). You could use the previous consolidated with FSStore if you wanted, but this new version seems to me to remove duplication, as well as adding write-mode.

On of the issues I am trying to solve is "which store do I subclass": anyone coming to zarr and wanting to implement something new now would have to accommodate regular keys, nested keys or consolidated, not any combination.

zarr/storage.py Outdated
self.fs = self.map.fs # for direct operations
self.mode = mode
self.exceptions = exceptions
# TODO: should warn if consolidated and write mode?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO ?

zarr/storage.py Outdated
self.consolidated = consolidated
self.metadata_key = metadata_key
if consolidated:
self.meta = json.loads(self.map.get(metadata_key, b"{}").decode())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a json_loads utils function used in other place of this file:

def json_loads(s):
    """Read JSON in a consistent way."""
    return json.loads(ensure_text(s, 'ascii'))

Do you want to use it ?

@alimanfoo
Copy link
Member

Dropping in comments from today's discussion on the community call. I have no objection to this PR, would be great to see it go in. Two points that come up are:

  • Should this store support consolidated metadata internally? Or can it let that be handled in the layer above? I would have a mild preference to let this be handled in the layer above, but happy to go either way.

  • Should this store support a chunk key separator argument? Or can it let that be handled in a layer above via composable stores? Ultimately composable stores would seem like a good idea, but given they don't exist yet it could be useful for @joshmoore and others to have FSStore support this for now.

@martindurant happy for you to make the call on how to round this one off.

@martindurant
Copy link
Member Author

Has anyone any idea why numcodecs failed to build during the docs stage here ?

@joshmoore
Copy link
Member

Has anyone any idea why numcodecs failed to build during the docs stage here ?

Due to the now yanked https://pypi.org/project/numcodecs/0.7.0/ as we try to get the wheels deployed. (Sorry 'bout that)

@martindurant
Copy link
Member Author

Should the problem go away then?

Also, I have coverage failure even though "169 of 169 new or added lines in 5 files covered. (100.0%)" (because fsspec is not tested on py35?). That's annoying...

@martindurant
Copy link
Member Author

(never mind on the latter, coverage just updated itself)

@joshmoore
Copy link
Member

joshmoore commented Sep 10, 2020

Should the problem go away then?

It should already be gone! Oddly travis is still finding it, perhaps because of some caching??

https://files.pythonhosted.org/packages/a1/b2/9c4fc0e4bc10a59442ced40dafc474f31bdc49699479da2e8912714e88af/numcodecs-0.7.0.tar.gz (3.0MB) ...

All the more reason to get 0.7.1 out ASAP.

@martindurant
Copy link
Member Author

@joshmoore : is that S3 test OK for you?

@martindurant
Copy link
Member Author

(I plan to merge this as is, unless there are further comments)

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests pass, thanks @martindurant. I also tried adding a ```@pytest.mark.skipif(have_fsspec is False, reason="needs fsspec")
class TestNestedFSStore2(TestFSStore):

def create_store(self, normalize_keys=False):
    path = tempfile.mkdtemp()
    atexit.register(atexit_rmtree, path)
    store = FSStore(path, normalize_keys=normalize_keys,
                    key_separator='/', auto_mkdir=True)
    return store
locally which equally passed. :+1:

@martindurant martindurant merged commit bb6b905 into zarr-developers:master Sep 14, 2020
@martindurant martindurant deleted the fsspec branch September 14, 2020 13:41
@Carreau Carreau added this to the v2.5 milestone Sep 14, 2020
@shoyer
Copy link
Contributor

shoyer commented May 12, 2021

Quick question -- is there a good reason why FSStore defaults to normalize_keys=True, unlike the other Zarr stores? We were pretty surprised to find different behavior using Zarr directly and GCSFS.

@joshmoore
Copy link
Member

I had the same question. I've need to hard-code the argument in all of my uses. cc: @martindurant

@martindurant
Copy link
Member Author

I certainly don't mind the default changing, if that is the consensus. I would like to claim there was a good reason for the current , but ...

@joshmoore
Copy link
Member

Bubbling this up into a new issue: #739

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

N5Store support of cloud buckets
8 participants