Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr.copy() doesn't seem to skip attributes when setting without_attrs=True #726

Open
ricardog opened this issue Apr 28, 2021 · 0 comments

Comments

@ricardog
Copy link

The input file is available from the LUH2 website (I use a local copy).

Minimal, reproducible code sample

import h5py
from sys import stdout
import zarr

src = h5py.File('states.nc')
zarr.tree(src)
dst = zarr.open('data/luh2.zarr', mode='w')
zarr.copy_all(src, dst, log=stdout, without_attrs=True, dry_run=True, if_exists='replace')

Problem description

I want to convert a set of rasters in netCDF format to zarr. Following the tutorial I used zarr.copy_all() but the function raises an exception when JSON coding the attributes even when I set without_attrs=True. I expected copy_all() to skip copying the attributes. Is there a way to get copy_all() to skip all attributes or to pass a specialized JSON encoder?

The value of the attributes of the input file are python bytes.

>>> print('\n'.join([f'{k}: {v}' for k, v in src.attrs.items()]))
_nc3_strict: 1
host: b'UMD College Park'
comment: b'LUH2'
contact: b'gchurtt@umd.edu, lchini@umd.edu, steve.frolking@unh.edu, ritvik@umd.edu'
creation_date: b'2017-11-13T16:10:52Z'
title: b'UofMD LUH2f dataset prepared for input4MIPs'
activity_id: b'input4MIPs'
Conventions: b'CF-1.6'
data_structure: b'grid'
source: b'LUH2 v2.1f: Land-Use Harmonization Data Set'
dataset_version_number: b'2.1f'
dataset_category: b'landState'
variable_id: b'multiple'
grid_label: b'gn'
mip_era: b'CMIP6'
further_info_url: b'http://luh.umd.edu'
frequency: b'yr'
institution_id: b'UofMD'
institution: b'University of Maryland (UofMD), College Park, MD 20742, USA'
realm: b'land'
references: b'Hurtt, Chini et al. 2011'
license: b'Land-Use Harmonization data produced by the University of Maryland is licensed under a Creative Commons Attribution "Share Alike" 4.0 International License (http://creativecommons.org/licenses/by/4.0/). The data producers and data providers make no warranty, either express or implied, including but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law.'
target_mip: b'[\xe2\x80\x98ScenarioMIP\xe2\x80\x99, \xe2\x80\x98AerChemMIP\xe2\x80\x99, \xe2\x80\x98C4MIP\xe2\x80\x99, \xe2\x80\x98DCPP\xe2\x80\x99, \xe2\x80\x98GeoMIP\xe2\x80\x99, \xe2\x80\x98LS3MIP\xe2\x80\x99, \xe2\x80\x98LUMIP\xe2\x80\x99]'
source_id: b'UofMD-landState-MESSAGE-ssp245-2-1-f'

For completeness, here is the backtrace.

>>> zarr.copy_all(src, dst, log=stdout, without_attrs=True, dry_run=True, if_exists='replace')
copy /bounds (2,) >f4
copy /c3ann (86, 720, 1440) float32
copy /c3nfx (86, 720, 1440) float32
copy /c3per (86, 720, 1440) float32
copy /c4ann (86, 720, 1440) float32
copy /c4per (86, 720, 1440) float32
copy /lat (720,) float64
copy /lat_bounds (720, 2) float64
copy /lon (1440,) float64
copy /lon_bounds (1440, 2) float64
copy /pastr (86, 720, 1440) float32
copy /primf (86, 720, 1440) float32
copy /primn (86, 720, 1440) float32
copy /range (86, 720, 1440) float32
copy /secdf (86, 720, 1440) float32
copy /secdn (86, 720, 1440) float32
copy /secma (86, 720, 1440) float32
copy /secmb (86, 720, 1440) float32
copy /time (86,) float64
copy /time_bnds (86, 2) int32
copy /urban (86, 720, 1440) float32
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ricardog/.pyenv/versions/dask/lib/python3.8/site-packages/zarr/convenience.py", line 1068, in copy_all
    dest.attrs.update(**source.attrs)
  File "/Users/ricardog/.pyenv/versions/dask/lib/python3.8/site-packages/zarr/attrs.py", line 119, in update
    self._write_op(self._update_nosync, *args, **kwargs)
  File "/Users/ricardog/.pyenv/versions/dask/lib/python3.8/site-packages/zarr/attrs.py", line 73, in _write_op
    return f(*args, **kwargs)
  File "/Users/ricardog/.pyenv/versions/dask/lib/python3.8/site-packages/zarr/attrs.py", line 130, in _update_nosync
    self._put_nosync(d)
  File "/Users/ricardog/.pyenv/versions/dask/lib/python3.8/site-packages/zarr/attrs.py", line 112, in _put_nosync
    self.store[self.key] = json_dumps(d)
  File "/Users/ricardog/.pyenv/versions/dask/lib/python3.8/site-packages/zarr/util.py", line 29, in json_dumps
    return json.dumps(o, indent=4, sort_keys=True, ensure_ascii=True,
  File "/Users/ricardog/.pyenv/versions/3.8.0/lib/python3.8/json/__init__.py", line 234, in dumps
    return cls(
  File "/Users/ricardog/.pyenv/versions/3.8.0/lib/python3.8/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/Users/ricardog/.pyenv/versions/3.8.0/lib/python3.8/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/Users/ricardog/.pyenv/versions/3.8.0/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/Users/ricardog/.pyenv/versions/3.8.0/lib/python3.8/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/Users/ricardog/.pyenv/versions/3.8.0/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type bytes_ is not JSON serializable

Version and installation information

  • zarr: 2.8.0
  • numcodecs: 0.7.3'
  • Python: 3.8.0
  • OS: Mac
  • Zarr installed with: pip

I don't know that it is relevant, but here is the output of pip list

Package           Version
----------------- ---------
affine            2.3.0
asciitree         0.3.3
attrs             20.3.0
blosc             1.9.2
bokeh             2.3.1
certifi           2020.12.5
cftime            1.4.1
click             7.1.2
click-plugins     1.1.1
cligj             0.7.1
cloudpickle       1.6.0
cycler            0.10.0
dask              2021.4.0
distributed       2021.4.0
fasteners         0.16
Fiona             1.8.19
fsspec            2021.4.0
geopandas         0.9.0
h5netcdf          0.11.0
h5py              3.2.1
HeapDict          1.0.1
Jinja2            2.11.3
kiwisolver        1.3.1
locket            0.2.1
lz4               3.1.1
MarkupSafe        1.1.1
matplotlib        3.4.1
msgpack           1.0.0
munch             2.5.0
netCDF4           1.5.6
numcodecs         0.7.3
numpy             1.18.1
packaging         20.9
pandas            1.2.4
partd             1.2.0
Pillow            8.2.0
pip               21.0.1
psutil            5.8.0
pyflakes          2.3.1
pyparsing         2.4.7
pyproj            3.0.1
python-dateutil   2.8.1
pytz              2021.1
PyYAML            5.4.1
rasterio          1.2.2
rioxarray         0.3.1
scipy             1.6.2
setuptools        56.0.0
Shapely           1.7.1
six               1.15.0
snuggs            1.4.7
sortedcontainers  2.3.0
tblib             1.7.0
toolz             0.11.1
tornado           6.1
typing-extensions 3.7.4.3
wheel             0.36.2
xarray            0.17.0
zarr              2.8.0
zict              2.0.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant