Appending to zarr with string dtype #2789

davidbrochart · 2019-02-26T14:31:42Z

import xarray as xr

da = xr.DataArray(['foo'])
ds = da.to_dataset(name='da')
ds.to_zarr('ds') # no special encoding specified

ds = xr.open_zarr('ds')
print(ds.da.values)

The following code prints ['foo'] (string type). The encoding chosen by zarr is "dtype": "|S3", which corresponds to bytes, but it seems to be decoded to a string, which is what we want.

$ cat ds/da/.zarray 
{
    "chunks": [
        1
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "|S3",
    "fill_value": null,
    "filters": null,
    "order": "C",
    "shape": [
        1
    ],
    "zarr_format": 2
}

The problem is that if I want to append to the zarr archive, like so:

import zarr

ds = zarr.open('ds', mode='a')
da_new = xr.DataArray(['barbar'])
ds.da.append(da_new)

ds = xr.open_zarr('ds')
print(ds.da.values)

It prints ['foo' 'bar']. Indeed the encoding was kept as "dtype": "|S3", which is fine for a string of 3 characters but not for 6.

If I want to specify the encoding with the maximum length, e.g:

ds.to_zarr('ds', encoding={'da': {'dtype': '|S6'}})

It solves the length problem, but now my strings are kept as bytes: [b'foo' b'barbar']. If I specify a Unicode encoding:

ds.to_zarr('ds', encoding={'da': {'dtype': 'U6'}})

It is not taken into account. The zarr encoding is "dtype": "|S3" and I am back to my length problem: ['foo' 'bar'].

The solution with 'dtype': '|S6' is acceptable, but I need to encode my strings to bytes when indexing, which is annoying.

The text was updated successfully, but these errors were encountered:

davidbrochart · 2019-02-26T18:16:12Z

An example involving only zarr (not xarray):

import zarr
import numpy as np

z1 = zarr.open('a.zarr', mode='w', shape=(1), chunks=(1), dtype='U6')
z1[:] = 'foo'

z2 = zarr.open('a.zarr', mode='a')
z2.append(np.array(['barbar'], dtype='U6'))

z3 = zarr.open('a.zarr', mode='r')
print(z3[:])

It works fine and prints ['foo' 'barbar'], and the encoding is "dtype": "<U6":

$ cat a.zarr/.zarray 
{
    "chunks": [
        1
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "<U6",
    "fill_value": "0",
    "filters": null,
    "order": "C",
    "shape": [
        1
    ],
    "zarr_format": 2
}

shoyer · 2019-02-27T00:25:46Z

I think variable length strings should probably be the default behavior for xarray with zarr -- see #2724

davidbrochart changed the title ~~to_zarr with string dtype~~ Appending to zarr with string dtype Feb 26, 2019

dcherian added the topic-zarr Related to zarr storage library label Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appending to zarr with string dtype #2789

Appending to zarr with string dtype #2789

davidbrochart commented Feb 26, 2019 •

edited

davidbrochart commented Feb 26, 2019 •

edited

shoyer commented Feb 27, 2019

Appending to zarr with string dtype #2789

Appending to zarr with string dtype #2789

Comments

davidbrochart commented Feb 26, 2019 • edited

davidbrochart commented Feb 26, 2019 • edited

shoyer commented Feb 27, 2019

davidbrochart commented Feb 26, 2019 •

edited

davidbrochart commented Feb 26, 2019 •

edited