Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending to zarr with string dtype #2789

Open
davidbrochart opened this issue Feb 26, 2019 · 2 comments
Open

Appending to zarr with string dtype #2789

davidbrochart opened this issue Feb 26, 2019 · 2 comments
Labels
topic-zarr Related to zarr storage library

Comments

@davidbrochart
Copy link
Contributor

davidbrochart commented Feb 26, 2019

import xarray as xr

da = xr.DataArray(['foo'])
ds = da.to_dataset(name='da')
ds.to_zarr('ds') # no special encoding specified

ds = xr.open_zarr('ds')
print(ds.da.values)

The following code prints ['foo'] (string type). The encoding chosen by zarr is "dtype": "|S3", which corresponds to bytes, but it seems to be decoded to a string, which is what we want.

$ cat ds/da/.zarray 
{
    "chunks": [
        1
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "|S3",
    "fill_value": null,
    "filters": null,
    "order": "C",
    "shape": [
        1
    ],
    "zarr_format": 2
}

The problem is that if I want to append to the zarr archive, like so:

import zarr

ds = zarr.open('ds', mode='a')
da_new = xr.DataArray(['barbar'])
ds.da.append(da_new)

ds = xr.open_zarr('ds')
print(ds.da.values)

It prints ['foo' 'bar']. Indeed the encoding was kept as "dtype": "|S3", which is fine for a string of 3 characters but not for 6.

If I want to specify the encoding with the maximum length, e.g:

ds.to_zarr('ds', encoding={'da': {'dtype': '|S6'}})

It solves the length problem, but now my strings are kept as bytes: [b'foo' b'barbar']. If I specify a Unicode encoding:

ds.to_zarr('ds', encoding={'da': {'dtype': 'U6'}})

It is not taken into account. The zarr encoding is "dtype": "|S3" and I am back to my length problem: ['foo' 'bar'].

The solution with 'dtype': '|S6' is acceptable, but I need to encode my strings to bytes when indexing, which is annoying.

@davidbrochart davidbrochart changed the title to_zarr with string dtype Appending to zarr with string dtype Feb 26, 2019
@davidbrochart
Copy link
Contributor Author

davidbrochart commented Feb 26, 2019

An example involving only zarr (not xarray):

import zarr
import numpy as np

z1 = zarr.open('a.zarr', mode='w', shape=(1), chunks=(1), dtype='U6')
z1[:] = 'foo'

z2 = zarr.open('a.zarr', mode='a')
z2.append(np.array(['barbar'], dtype='U6'))

z3 = zarr.open('a.zarr', mode='r')
print(z3[:])

It works fine and prints ['foo' 'barbar'], and the encoding is "dtype": "<U6":

$ cat a.zarr/.zarray 
{
    "chunks": [
        1
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "<U6",
    "fill_value": "0",
    "filters": null,
    "order": "C",
    "shape": [
        1
    ],
    "zarr_format": 2
}

@shoyer
Copy link
Member

shoyer commented Feb 27, 2019

I think variable length strings should probably be the default behavior for xarray with zarr -- see #2724

@dcherian dcherian added the topic-zarr Related to zarr storage library label Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

3 participants