Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Amazon S3 limit on the length of keys #77

Closed
DennisHeimbigner opened this issue Jun 7, 2020 · 2 comments · Fixed by #175
Closed

The Amazon S3 limit on the length of keys #77

DennisHeimbigner opened this issue Jun 7, 2020 · 2 comments · Fixed by #175
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec

Comments

@DennisHeimbigner
Copy link

I noticed that Amazon S3 (and apparently also Google) define a
limit of 1024 bytes for object keys. This limit apparently
applies to the whole key and not, say, segments of the key,
where segment is the name between '/' occurrences.

I know that for atmospheric sciences netcdf-4 datasets, variable
names are used to encode a variety of properties such as dates
and locations. This often results in long variable
names. Additionally, deeply nested groups are used to also
classify sets of variables. Bottom line: it is probable that
such datasets will run up against the 1024 byte limit in the
near future.

So my question to the community is: how do we deal with the 1024
byte limit? Or do we ignore it?

One might hope that Amazon will up that limit Real-Soon-Now. My
guess is that a limit of 4096 bytes would be adequate to push
the problem off to a more distant future.

If such a length increase does not happen, then we may need to
rethink the Zarr layout so that this limit is circumvented.
Below are some initial thoughts about this. I hope I am not
overthinking this and that there is some simpler approach that I
have not considered.

One possible proposal is to use a structure where
the long key is replaced with the hash of the long key.
This leads to an inode-like system with flat space of hash keys
and the objects for those hashkeys contain metadata and chunk-data.
In order to represent the group structure, one
would need to extend this to have some "inodes" be directory-like
objects that map a key segment to the hashkey of the inodes
"contained" in the directory.

I am sure there are other ways to do this. Is may also be worth
asking about the purpose of the groups. Right now they serve
as a namespace and as a primitive indexing mechanism for the leaf
content-bearing objects. Perhaps they are superfluous.

In any case, the 1024 byte key-length limit is likely
to be a problem for Zarr in the near future.
The community needs to decide if it wants to ignore this
limitation or address it in some general way.

=Dennis Heimbigner
Unidata

@Carreau
Copy link
Contributor

Carreau commented Jun 9, 2020

Thanks I'll try to see if I can add some of that into the spec.

I think that the length limitation workaround might need to be on a pe-store basis. At least in spec v3 there is the data/ and meta/ prefix so it would be easy to have the equivalent of "mount points"/ references.

I'm not a huge fan of the hashing/ inode-like as this will likely mean a single place where we store the mapping which would require locking, amd make listing more difficult.

Note that some windows path APIs are also [limited to 260 char in length](https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#:~:text=In%20the%20Windows%20API%20(with,and%20a%20terminating%20null%20character.) and this has been a problem in the JS ecosystem with node_modules.

@jstriebel jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 16, 2022
@jstriebel
Copy link
Member

I added notes about this in #175.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants