-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What happened?
It seems netcdf4 does not work well currently with s3fs
the FUSE filesystem layer over S3 compatible storage with either the default netcdf4
engine nor with the h5netcdf
.
Here is an example
import numpy as np
import xarray as xr
from datetime import datetime, timedelta
NTIMES=48
start = datetime(2022,10,6,0,0)
time_vals = [start + timedelta(minutes=20*t) for t in range(NTIMES)]
times = xr.DataArray(data = [t.strftime('%Y%m%d%H%M%S').encode() for t in time_vals], dims=['Time'])
v1 = xr.DataArray(data=np.zeros((len(times), 201, 201)), dims=['Time', 'x', 'y'])
ds = xr.Dataset(data_vars=dict(times=times, v1=v1))
ds.to_netcdf(path='/my_s3_fs/test_netcdf.nc', format='NETCDF4', mode='w')
On my system this code crashes with NTIMES=48, but completes without an error with NTIMES=24.
The output with NTIMES=48
is
There are 1 HDF5 objects open!
Report: open objects on 72057594037927936
Segmentation fault (core dumped)
I have tried the other engine that handles NETCDF4 in xarray with engine='h5netcdf'
and also got a segfault.
A quick workaround seems to be to use the local filesystem to write the NetCDF file and then move the complete file to S3.
ds.to_netcdf(path='/tmp/test_netcdf.nc', format='NETCDF4', mode='w')
shutil.move('/tmp/test_netcdf.nc', '/my_s3_fs/test_netcdf.nc')
There are several pieces of software involved here: the xarray package (0.16.1), netcdf4 (1.5.4), HDF5 (1.10.6), and s3fs (1.79). If this is not a bug in my code but in the underlying libraries, most likely it is not an xarray bug, but since it fails with both Netcdf4 engines, I decided to report it here.
What did you expect to happen?
With NTIMES=24 I am getting a file /my_s3_fs/test_netcdf.nc
of about 7.8 MBytes. WIth NTIMES=36 I get an empty file. I would expect to have this code run without a segfault and produce a nonempty file.
Minimal Complete Verifiable Example
import numpy as np
import xarray as xr
from datetime import datetime, timedelta
NTIMES=48
start = datetime(2022,10,6,0,0)
time_vals = [start + timedelta(minutes=20*t) for t in range(NTIMES)]
times = xr.DataArray(data = [t.strftime('%Y%m%d%H%M%S').encode() for t in time_vals], dims=['Time'])
v1 = xr.DataArray(data=np.zeros((len(times), 201, 201)), dims=['Time', 'x', 'y'])
ds = xr.Dataset(data_vars=dict(times=times, v1=v1))
ds.to_netcdf(path='/my_s3_fs/test_netcdf.nc', format='NETCDF4', mode='w')
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
There are 1 HDF5 objects open!
Report: open objects on 72057594037927936
Segmentation fault (core dumped)
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None
python: 3.8.3 | packaged by conda-forge | (default, Jun 1 2020, 17:43:00)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-26-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4
xarray: 0.16.1
pandas: 1.1.3
numpy: 1.19.1
scipy: 1.5.2
netCDF4: 1.5.4
pydap: None
h5netcdf: 1.0.2
h5py: 3.1.0
Nio: None
zarr: None
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.30.0
distributed: None
matplotlib: 3.3.1
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 50.3.0.post20201006
pip: 20.2.3
conda: 22.9.0
pytest: 6.1.1
IPython: 7.18.1
sphinx: None