Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node name case sensitivity #57

Open
alimanfoo opened this issue Mar 31, 2020 · 15 comments
Open

Node name case sensitivity #57

alimanfoo opened this issue Mar 31, 2020 · 15 comments
Labels

Comments

@alimanfoo
Copy link
Member

Currently the v3.0 core protocol draft section on node names states the following:

Note that node names are used to form storage keys, and that some storage systems will perform a case-insensitive comparison of storage keys during retrieval. Therefore, within a hierarchy, all nodes within a set of sibling nodes must have a name that is unique under case-insensitive comparison. E.g., the names “foo” and “FOO” are not allowed for sibling nodes.

This constraint, however, is problematic in scenarios where nodes are being created in parallel. This is because requiring that sibling nodes have a unique name under case-insensitive comparison would require that a check is made before node creation to ensure no name collisions, and any such check would require synchronisation of node creation at least within the same group.

Currently a design goal for the spec is to avoid constraints which require operations to be synchronised, thus allowing multiple parallel processes to work simultaneously, including creating nodes within the same hierarchy. This avoids implementation complexity (no need for locks or other synchronisation mechanisms) and also works better on stores with eventual consistency behaviour.

I suggest we remove this constraint from the core protocol. I.e., implementations of the core protocol are not expected to check for case-insensitive node name collisions.

However, instead we add a usage note, which provides some guidance for implementing applications using a zarr protocol implementation, including recommending that applications avoid creating sibling nodes with case-insensitive name collisions wherever that is possible.

I.e., we push this to the application layer.

@jrbourbeau jrbourbeau added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Mar 31, 2020
@Carreau
Copy link
Contributor

Carreau commented May 5, 2020

However, instead we add a usage note, which provides some guidance for implementing applications using a zarr protocol implementation, including recommending that applications avoid creating sibling nodes with case-insensitive name collisions wherever that is possible.

Then do you want to add on the spec that stores must be able to handle case sensitive names ? In the spec it says Note that node names are used to form storage keys it does not say how, and that might push for example filesystems to not just use a key but another scheme.

@alimanfoo
Copy link
Member Author

Hi @Carreau,

The subsection on storage keys describes how keys are constructed from node paths.

It is then up to a store spec to decide how to use these keys and store data.

For example, the file system store spec defines how to translate keys into file paths. I expect this will be a commonly-used store, and so the question about case (in)sensitivity mainly applies to this store. But it is not a fixed choice, alternative store specs could be proposed which takes a different approach.

Hope that makes sense.

@Carreau
Copy link
Contributor

Carreau commented May 6, 2020

Yes, thanks that what I understood, I just want to make sure decisions are not taken on implementations leaking. The fact that the directory store happens to be case (in)sensitive and how to do key retrieval can as far as I can tell be worked around pretty easily).

from hashlib import md5
from base64 import b32encode
import os

def f(path):
    return os.sep.join( 
        x+'-'+md5(x.encode()).hexdigest()[:5] for x in path.split('/')
    ).lower()

print(f('FOO/bar')) #foo-90189/bar-37b51
print(f('foo/bar')) #foo-acbd1/bar-37b51

def g(path):
    return os.sep.join( 
        x+'-'+''.join(['1' if c.isupper() else '0' for c in x]) for x in path.split('/')
    ).lower()

print(g('FoO/bar')) #foo-101/bar-000
print(g('foo/bar')) #foo-000/bar-000


def h(path):
    return os.sep.join( 
        b32encode(x.encode()).decode() for x in path.split('/')
    ).lower()

print(h('FoO/bar')) #izxu6===/mjqxe===
print(h('foo/bar')) #mzxw6===/mjqxe===

Now each of those have different tradeoff on readability/length of path, key enumeration... etc.

Now the problem I see with requiring avoiding if possible, is that you may end up with an code implementation on one system that happened to rely on some case insensitivity of the underlying store and will be broken on another system.

If the choice were mine, I would probably impose case sensitivity for a store spec.

This is also not a complete unrelated discussion from unicode as casefolding for non-ascii and case folding it complex (it may even depends on user locale, i'm not sure) (https://unicode.org/faq/casemap_charprop.html have some crazy things.)

from a performance perspective you may even be able to just do bytes comparison as well, which I'm guessing compiled languages would be happy(ier) about.

@alimanfoo
Copy link
Member Author

Thanks @Carreau, very helpful.

If the choice were mine, I would probably impose case sensitivity for a store spec.

FWIW I think this is the most natural thing to do.

To put it into the terms of the v3 protocol spec, I think this would mean stating in the section on the store interface that we expect stores to perform a case sensitive comparison of keys for the get, set and delete operations.

What then should the file system store spec do? That spec defines a translation from storage keys to file paths that relies on the underlying file system being case sensitive, and so should not be used on a case insensitive file system. Maybe that's OK, and we just add a big red note saying "only use on case sensitive file systems".

I guess we could always define another file system storage spec which is designed to work on case insensitive file systems, and so carries some of the extra overhead of longer and/or more obscure file paths.

The two storage specs could live alongside each other, and people could choose to implement and use which they think best.

Thoughts welcome.

@Carreau
Copy link
Contributor

Carreau commented May 7, 2020

To put it into the terms of the v3 protocol spec, I think this would mean stating in the section on the store interface that we expect stores to perform a case sensitive comparison of keys for the get, set and delete operations.

probably list keys as well.

What then should the file system store spec do? That spec defines a translation from storage keys to file paths that relies on the underlying file system being case sensitive, and so should not be used on a case insensitive file system.

Maybe that's OK, and we just add a big red note saying "only use on case sensitive file systems".

I think that might be tough.

And yes, I would go with a different file storage spec, and is also one of the reasons I opened the issue about detecting the kind of store independently of the zarr version in order to dispatch.

Now I'm' pretty new to the project, I'm trying to give my best advice with what I believe is the direction of the project you are envisaging. I might be completely wrong.

@Carreau
Copy link
Contributor

Carreau commented Oct 9, 2020

Jim Pivarski suggest we escape uppercase characters, and this would keep readability.

@jstriebel
Copy link
Member

The v3 core spec currently states that

Node names are case sensitive, e.g., the names “foo” and “FOO” are not identical

which should resolve that at least on the spec level, right? Should this issue still be kept open? I guess it will become important once stores which are case insensitive are added, then escaping uppercase characters might be the proper solution.

@jstriebel jstriebel added store and removed core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec todo pre-rfc labels Nov 24, 2022
@jbms
Copy link
Contributor

jbms commented Nov 24, 2022

This requirement would preclude use of the filesystem store (without escaping) with normal local filesystems on macos and windows. Instead it should be defined by the store/implementation.

@jstriebel
Copy link
Member

See also #175, where I added notes about case insensitivity for Windows and MacOS. However, escaping might still be an option. Should that be added as an option to the current FileSystemStore? (There's no such thing as to specify configs for a store in the metadata atm, will open an issue about that soon).

@jbms
Copy link
Contributor

jbms commented Nov 24, 2022

Escaping could be a solution, but it would essentially amount to a separate incompatible store, and would sacrifice readability of the keys.

@jstriebel
Copy link
Member

it would essentially amount to a separate incompatible store

Good point, so it could also be added later as an additional store. IMO this is good enough to go forward with v3.

@jstriebel
Copy link
Member

PS: Just noticed that escaping might be done via a storage transformer.

@jstriebel
Copy link
Member

See also #201.

@joshmoore
Copy link
Member

@jstriebel: what was the status here?

@jbms
Copy link
Contributor

jbms commented Jul 15, 2023

https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#node-names

We basically leave it up to the underlying store.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants