Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy saving to NetCDF4 fails randomly if an array is used multiple times #6300

Open
pnuu opened this issue Feb 24, 2022 · 1 comment
Open

Comments

@pnuu
Copy link

pnuu commented Feb 24, 2022

What happened?

Saving xr.Dataset() lazily to NetCDF4 (dset.to_netcdf(..., compute=False)) fails seemingly randomly if an array is used either as a coordinate to multiple variables, or saved with different names as standalone variable. The trace I get is shown below in the log section.

What did you expect to happen?

The saving should work consistently between different runs.

Minimal Complete Verifiable Example

#!/usr/bin/env python

import datetime as dt

import numpy as np
import dask.array as da
import xarray as xr

COMPUTE = False
FNAME = "xr_test.nc"


def main():
    y = np.arange(1000, dtype=np.uint16)
    x = np.arange(2000, dtype=np.uint16)

    # Create a time array that is used as a Y-coordinate for the data
    now = dt.datetime.utcnow()
    time_arr = np.array([now + dt.timedelta(seconds=i) for i in range(y.size)], dtype=np.datetime64)
    times = xr.DataArray(time_arr, coords={'y': y})

    # Write root
    root = xr.Dataset({}, attrs={'global': 'attribute'})
    written = [root.to_netcdf(FNAME, mode='w')]

    # Write first dataset
    data1 = xr.DataArray(da.random.random((y.size, x.size)), dims=['y', 'x'],
                         coords={'y': y, 'x': x, 'time': times})
    dset1 = xr.Dataset({'data1': data1})
    written.append(dset1.to_netcdf(FNAME, mode='a', compute=COMPUTE))

    # Write second dataset using the same time coordinates
    data2 = xr.DataArray(da.random.random((y.size, x.size)), dims=['y', 'x'],
                         coords={'y': y, 'x': x, 'time': times})
    dset2 = xr.Dataset({'data2': data2})
    written.append(dset2.to_netcdf(FNAME, mode='a', compute=COMPUTE))

    if not COMPUTE:
        da.compute(written)


if __name__ == "__main__":
    main()

Relevant log output

Traceback (most recent call last):
  File "/home/lahtinep/bin/test_lazy_netcdf_saving.py", line 43, in <module>
    main()
  File "/home/lahtinep/bin/test_lazy_netcdf_saving.py", line 39, in main
    da.compute(written)
  File "/home/lahtinep/mambaforge/envs/pytroll/lib/python3.9/site-packages/dask/base.py", line 571, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/lahtinep/mambaforge/envs/pytroll/lib/python3.9/site-packages/dask/threaded.py", line 79, in get
    results = get_async(
  File "/home/lahtinep/mambaforge/envs/pytroll/lib/python3.9/site-packages/dask/local.py", line 507, in get_async
    raise_exception(exc, tb)
  File "/home/lahtinep/mambaforge/envs/pytroll/lib/python3.9/site-packages/dask/local.py", line 315, in reraise
    raise exc
  File "/home/lahtinep/mambaforge/envs/pytroll/lib/python3.9/site-packages/dask/local.py", line 220, in execute_task
    result = _execute_task(task, data)
  File "/home/lahtinep/mambaforge/envs/pytroll/lib/python3.9/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/home/lahtinep/mambaforge/envs/pytroll/lib/python3.9/site-packages/dask/array/core.py", line 4099, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
  File "/home/lahtinep/mambaforge/envs/pytroll/lib/python3.9/site-packages/dask/array/core.py", line 4086, in load_store_chunk
    out[index] = x
  File "/home/lahtinep/mambaforge/envs/pytroll/lib/python3.9/site-packages/xarray/backends/netCDF4_.py", line 69, in __setitem__
    data[key] = value
  File "src/netCDF4/_netCDF4.pyx", line 4903, in netCDF4._netCDF4.Variable.__setitem__
  File "src/netCDF4/_netCDF4.pyx", line 4073, in netCDF4._netCDF4.Variable.shape.__get__
  File "src/netCDF4/_netCDF4.pyx", line 3462, in netCDF4._netCDF4.Dimension.__len__
  File "src/netCDF4/_netCDF4.pyx", line 1927, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Not a valid ID

Anything else we need to know?

The above script fails randomly, thus it should be run several times. Out of ten runs I got the trace twice. If COMPUTE = True, the script works every time (after ~100 tries, at least).

The same behaviour is seen if the time coordinates are removed completely and data1 is used also in dset2 in place of data2.

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:41:03)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.13.0-30-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 0.20.2
pandas: 1.3.5
numpy: 1.22.0
scipy: 1.7.3
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.13.0
h5py: 3.6.0
Nio: None
zarr: 2.10.3
cftime: 1.5.1.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: None
iris: None
bottleneck: None
dask: 2022.01.0
distributed: 2022.01.0
matplotlib: 3.5.1
cartopy: 0.20.2
seaborn: 0.11.2
numbagg: None
fsspec: 2022.01.0
cupy: None
pint: None
sparse: None
setuptools: 59.8.0
pip: 21.3.1
conda: None
pytest: 6.2.5
IPython: 8.0.0
sphinx: 4.3.2

@pnuu pnuu added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 24, 2022
@dcherian dcherian added needs triage Issue that has not been reviewed by xarray team member topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 24, 2022
@dcherian dcherian removed the needs triage Issue that has not been reviewed by xarray team member label Apr 9, 2022
@gerritholl
Copy link
Contributor

I experience the same problem under the same circumstances. My versions:

INSTALLED VERSIONS
------------------
commit: None
python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-305.12.1.el8_4.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 0.19.0
pandas: 1.5.0
numpy: 1.23.3
scipy: 1.9.1
netCDF4: 1.6.1
pydap: None
h5netcdf: None
h5py: 3.7.0
Nio: None
zarr: 2.13.3
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.2
cfgrib: None
iris: None
bottleneck: None
dask: 2021.12.0
distributed: 2022.9.2
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 65.4.1
pip: 22.2.2
conda: None
pytest: None
IPython: 8.5.0
sphinx: None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants