Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting UTF-8 data type #83

Open
jakirkham opened this issue Jul 1, 2020 · 12 comments
Open

Supporting UTF-8 data type #83

jakirkham opened this issue Jul 1, 2020 · 12 comments
Labels
data-type protocol-extension Protocol extension related issue

Comments

@jakirkham
Copy link
Member

In today's discussion the need for UTF-8 came up. Thought we already had an issue for this, but am not finding it.

Would be useful to have UTF-8 support in the spec or as a high priority extension. Raising here to start the discussion about how we want to approach this.

cc @joshmoore @alimanfoo @shoyer @Carreau

@shoyer
Copy link

shoyer commented Jul 1, 2020

As I noted in the call, I think how HDF5 supports strings (including UTF-8) is pretty sane:

  • Strings data types always have an explicit encoding, which can be either ascii or utf8.
  • Strings data types are either fixed width (which refers to the number of bytes used in the encoded representation, not the number of unicode characters like in Python) or variable width

I'm not sure there's a real use for ascii these days (given that it's a strict subset of utf8), but there are certainly use cases for both fixed width and variable width utf8 strings.

@alimanfoo alimanfoo changed the title Supporting UTF-8 Supporting UTF-8 data type Jul 15, 2020
@jstriebel jstriebel added core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec protocol-extension Protocol extension related issue data-type labels Nov 16, 2022
@ivirshup
Copy link

ivirshup commented Dec 6, 2022

I think having a utf8 string type is very important for v3.

I would also be a strong proponent of a variable length utf-8, as most text data is variable length.

I am concerned by the current specs use of fixed length utf-32, since it's an uncommon encoding with little support beyond numpy.

My ideal scenario would be to have the string extension spec essentially use arrows string type encoding specification, e.g. a string is a variable length list of bytes (docs on layout). This means the chunk would include multiple buffers, including an offset buffer and a data buffer. Arrow also includes information about validity for null values – which is nice but I'm not sure necessary.

For expediency, it could make sense to include fixed length utf8 strings as an extension in zarr v3. I'm not sure I would update the AnnData formats to zarr v3 until variable length strings existed, since I'd rather not go back to the issues we had with fixed length strings. E.g. I would really like to kerchunk together arrays of labels, and labels vary widely in size.


@DennisHeimbigner, we briefly talked about this at the end of the last zarr call, though I hadn't had a chance to read the spec yet. You had mentioned varlength was proposed, but was that in an issue/ PR?

@jbms
Copy link
Contributor

jbms commented Dec 6, 2022

I agree --- I would also like to see variable length byte sequence and variable length Unicode code point sequence as data types.

I believe the existing fixed length string type extensions are definitely not intended to be part of the core spec. They were added to document the existing zarr v2 behavior, and haven't been reviewed too much. Despite the fact that they don't seem terribly useful, I also don't think they are unreasonable to have as optional extensions.

@ivirshup
Copy link

ivirshup commented Dec 6, 2022

I agree --- I would also like to see variable length byte sequence and variable length Unicode code point sequence as data types.

A point that is a little confusing to me right now is "core", "extension", or "extension but on zarr-specs.readthedocs.io". Which were you thinking for these types?

I also don't think they are unreasonable to have as optional extensions.

I agree these aren't unreasonable by themselves. I think it might be bad if utf-32 were the only unicode representation for v3 on zarr-specs.

@jbms
Copy link
Contributor

jbms commented Dec 6, 2022

I think we still have to sort out exactly how extensions and other additions of features in later spec versions will be specified in the metadata.

But I certainly agree that the utf-32 encoding is not very useful.

@jstriebel jstriebel removed the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Dec 7, 2022
@tomwhite
Copy link

tomwhite commented Jul 9, 2024

I'd like to add my vote for adding support for variable-length strings in v3. We need this for supporting Zarr v3 in sgkit's VCF Zarr support (see sgkit-dev/bio2zarr#254).

The way we are using it currently in v2 is the way that's recommended in the Zarr Tutorial:

>>> import numcodecs
>>> import zarr.v2 as zarr
>>> z = zarr.array(["Hi", "Hey"], dtype=object, object_codec=numcodecs.VLenUTF8())
>>> z
<zarr.v2.core.Array (2,) object>
>>> z[:]
array(['Hi', 'Hey'], dtype=object)

Perhaps Zarr v3 should take advantage of the new NumPy UTF-8 variable-width string dtype for this?

@d-v-b
Copy link
Contributor

d-v-b commented Jul 9, 2024

I'm not too familiar with numpy string arrays but my impression is that an array of a variable-length type cannot use a contiguous memory buffer for the in-memory representation. As zarr-python v3 internal APIs are very much centered around contiguous memory buffers, this might be a challenge!

@normanrz do you have any insight into how variable length types would fit into the current chunk processing framework in zarr python v3?

@normanrz
Copy link
Contributor

normanrz commented Jul 9, 2024

I think adding variable-length strings to zarr-python would take some work but is not impossible. The numpy-backed buffers are still quite flexible. We use them for handling the object dtype in v2 arrays as well. Other buffers might need more work.

@rabernat
Copy link
Contributor

rabernat commented Jul 9, 2024

Perhaps Zarr v3 should take advantage of the new NumPy UTF-8 variable-width string dtype for this?

I don't think this is much help for Zarr, because "string data are stored outside the array buffer" (see https://numpy.org/neps/nep-0055-string_dtype.html#serialization), i.e. the arrays just stores a pointer to the actual string data.

A much better reference point would be Arrow string encoding, or more generally, Arrow variable sized binary layout. Variable-length types require at least two buffers: one to store the actual data and one to store offsets into the data where the items begin.

We already support all of this in Zarr V2 via numcodecs vlen codecs! https://numcodecs.readthedocs.io/en/stable/vlen.html

Shouldn't it be straightforward to adapt this approach to V3? They key will be to not rely on anything Python specific (e.g. python objects). Arrow points the way here.

@jeromekelleher
Copy link
Member

jeromekelleher commented Jul 9, 2024

Just adding my +1 to @tomwhite's comment above. Strings are crucial for supporting genetic variation data, which there is an awful lot of, and Zarr would be amazing for. See our preprint for background and details.

@normanrz
Copy link
Contributor

I think this issue needs a champion who wants to write a ZEP.

@rabernat
Copy link
Contributor

Over at zarr-developers/zarr-python#2031 I have a proof-of-concept that we can very easily support UTF-8 and variable length strings by leveraging Arrow encoding of string arrays. Would love some feedback on whether that approach seems promising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-type protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests

10 participants