Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading datasets of numpy string arrays leads to error and/or segfault #5706

Closed
scottstanie opened this issue Aug 13, 2021 · 8 comments
Closed

Comments

@scottstanie
Copy link
Contributor

scottstanie commented Aug 13, 2021

What happened:
Numpy arrays of strings that are saved with h5py cause errors and segfaults, not always the same result.

What you expected to happen:

This works fine with engine='h5netcdf':

In [3]: ds = xr.load_dataset("test_str_list.h5", engine='h5netcdf', phony_dims='sort')

but will consistently have a segfault with engine='netcdf4'.

I'm assuming this is a netcdf backend issue, but thought I'd raise it here since xarray was how I discovered it.

Minimal Complete Verifiable Example:

import h5py
import xarray as xr

with h5py.File("test_str_list.h5", "w") as hf:
    hf["pairs"] = np.array([["20200101", "20200201"], ["20200101", "20200301"]]).astype("S")

ds = xr.load_dataset("test_str_list.h5")
*** Error in `/home/scott/miniconda3/envs/mapping/bin/python': munmap_chunk(): invalid pointer: 0x0000559c40956070 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f7c4)[0x7f4a9a6bb7c4]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/h5py/../../../libhdf5.so.103(H5MM_xfree+0xf)[0x7f4a7a93c3ef]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/h5py/../../../libhdf5.so.103(H5C__untag_entry+0xc6)[0x7f4a7a854836]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/h5py/../../../libhdf5.so.103(H5C__flush_single_entry+0x275)[0x7f4a7a846085]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/h5py/../../../libhdf5.so.103(+0x80de3)[0x7f4a7a846de3]
... (few thousand line backtrace)

Anything else we need to know?:

Even stranger, it doesn't seem to be deterministic. After the crash, I tried the same load_dataset:

In [2]: ds = xr.load_dataset("test_str_list.h5")
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-475169bc9c75> in <module>
----> 1 ds = xr.load_dataset("test_str_list.h5")

~/miniconda3/envs/mapping/lib/python3.8/site-packages/xarray/backends/api.py in load_dataset(filename_or_obj, **kwargs)
    242
    243     with open_dataset(filename_or_obj, **kwargs) as ds:
--> 244         return ds.load()
    245
    246

~/miniconda3/envs/mapping/lib/python3.8/site-packages/xarray/core/dataset.py in load(self, **kwargs)
    871         for k, v in self.variables.items():
    872             if k not in lazy_data:
--> 873                 v.load()
    874
    875         return self

~/miniconda3/envs/mapping/lib/python3.8/site-packages/xarray/core/variable.py in load(self, **kwargs)
    449             self._data = as_compatible_data(self._data.compute(**kwargs))
    450         elif not is_duck_array(self._data):
--> 451             self._data = np.asarray(self._data)
    452         return self
    453

~/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84
     85

~/miniconda3/envs/mapping/lib/python3.8/site-packages/xarray/core/indexing.py in __array__(self, dtype)
    546
    547     def __array__(self, dtype=None):
--> 548         self._ensure_cached()
    549         return np.asarray(self.array, dtype=dtype)
    550

~/miniconda3/envs/mapping/lib/python3.8/site-packages/xarray/core/indexing.py in _ensure_cached(self)
    543     def _ensure_cached(self):
    544         if not isinstance(self.array, NumpyIndexingAdapter):
--> 545             self.array = NumpyIndexingAdapter(np.asarray(self.array))
    546
    547     def __array__(self, dtype=None):

~/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84
     85

~/miniconda3/envs/mapping/lib/python3.8/site-packages/xarray/core/indexing.py in __array__(self, dtype)
    516
    517     def __array__(self, dtype=None):
--> 518         return np.asarray(self.array, dtype=dtype)
    519
    520     def __getitem__(self, key):

~/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84
     85

~/miniconda3/envs/mapping/lib/python3.8/site-packages/xarray/core/indexing.py in __array__(self, dtype)
    417     def __array__(self, dtype=None):
    418         array = as_indexable(self.array)
--> 419         return np.asarray(array[self.key], dtype=None)
    420
    421     def transpose(self, order):

~/miniconda3/envs/mapping/lib/python3.8/site-packages/xarray/backends/netCDF4_.py in __getitem__(self, key)
     89
     90     def __getitem__(self, key):
---> 91         return indexing.explicit_indexing_adapter(
     92             key, self.shape, indexing.IndexingSupport.OUTER, self._getitem
     93         )

~/miniconda3/envs/mapping/lib/python3.8/site-packages/xarray/core/indexing.py in explicit_indexing_adapter(key, shape, indexing_support, raw_indexing_method)
    708     """
    709     raw_key, numpy_indices = decompose_indexer(key, shape, indexing_support)
--> 710     result = raw_indexing_method(raw_key.tuple)
    711     if numpy_indices.tuple:
    712         # index the loaded np.ndarray

~/miniconda3/envs/mapping/lib/python3.8/site-packages/xarray/backends/netCDF4_.py in _getitem(self, key)
    102             with self.datastore.lock:
    103                 original_array = self.get_array(needs_lock=False)
--> 104                 array = getitem(original_array, key)
    105         except IndexError:
    106             # Catch IndexError in netCDF4 and return a more informative

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__getitem__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._get()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 0: invalid continuation byte

But then immediately after, another segault

In [4]: ds = xr.load_dataset("test_str_list.h5", engine='netcdf4')
*** Error in `/home/scott/miniconda3/envs/mapping/bin/python': corrupted size vs. prev_size: 0x000055f97e7194a0 ***
======= Backtrace: =========
Beginning of segfault stack trace, but goes on
======= Backtrace: =========
/lib64/libc.so.6(+0x7f7c4)[0x7f1ba11a87c4]
/lib64/libc.so.6(+0x818bb)[0x7f1ba11aa8bb]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/h5py/../../../libhdf5.so.103(H5MM_xfree+0xf)[0x7f1b8142d3ef]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/h5py/../../../libhdf5.so.103(H5S_close+0x84)[0x7f1b814a69a4]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/h5py/../../../libhdf5.so.103(H5I_dec_ref+0x77)[0x7f1b8141a407]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/h5py/../../../libhdf5.so.103(H5I_dec_app_ref+0x29)[0x7f1b8141a4d9]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/h5py/../../../libhdf5.so.103(H5Sclose+0x73)[0x7f1b814a7023]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/rasterio/../../.././libnetcdf.so.18(NC4_get_vars+0x5ad)[0x7f1b7bbc46ad]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/rasterio/../../.././libnetcdf.so.18(NC4_get_vara+0x12)[0x7f1b7bbc4e62]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/rasterio/../../.././libnetcdf.so.18(NC_get_vara+0x6f)[0x7f1b7bb6b5df]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/rasterio/../../.././libnetcdf.so.18(nc_get_vara+0x8b)[0x7f1b7bb6c35b]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/netCDF4/_netCDF4.cpython-38-x86_64-linux-gnu.so(+0xccf21)[0x7f1b4d0daf21]
/home/scott/miniconda3/envs/mapping/bin/python(+0x13a77e)[0x55f97aeca77e]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/netCDF4/_netCDF4.cpython-38-x86_64-linux-gnu.so(+0x224fd)[0x7f1b4d0304fd]
/home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/netCDF4/_netCDF4.cpython-38-x86_64-linux-gnu.so(+0x559d9)[0x7f1b4d0639d9]
/home/scott/miniconda3/envs/mapping/bin/python(PyObject_GetItem+0x48)[0x55f97af10aa8]
/home/scott/miniconda3/envs/mapping/bin/python(+0x139acd)[0x55f97aec9acd]

Environment:

Output of xr.show_versions()

In [1]: xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.5 | packaged by conda-forge | (default, Aug 29 2020, 01:22:49)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1062.4.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.19.0
pandas: 1.1.0
numpy: 1.19.2
scipy: 1.5.3
netCDF4: 1.5.4
pydap: None
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: 2.8.3
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.5
cfgrib: 0.9.8.5
iris: None
bottleneck: 1.3.2
dask: 2021.01.0
distributed: 2.20.0
matplotlib: 3.3.1
cartopy: 0.17.0
seaborn: None
numbagg: None
pint: 0.17
setuptools: 50.3.2
pip: 21.1.3
conda: 4.8.4
pytest: None
IPython: 7.18.1
sphinx: 4.0.2

@kmuehlbauer
Copy link
Contributor

@scottstanie Could you please provide the output of h5dump test_str_list.h5? I've a hunch but want to be sure. Also, what is the output with ncdump?

@scottstanie
Copy link
Contributor Author

scottstanie commented Jan 12, 2022

sure! here it is:

$ h5dump test_str_list.h5
HDF5 "test_str_list.h5" {
GROUP "/" {
   DATASET "pairs" {
      DATATYPE  H5T_STRING {
         STRSIZE 8;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
      DATA {
      (0,0): "20200101", "20200201",
      (1,0): "20200101", "20200301"
      }
   }
}
}

(and just to include the specific traceback that hapened now, in case my versions are different from what I showed):

In [4]: import h5py ...: import xarray as xr ...: ...: with h5py.File("test_str_list.h5", "w") as hf: ...: hf["pairs"] = np.array([["20200101", "20200201"], ["20200101", "20200301"]]).astype("S") ...: ...: ds = xr.load_dataset("test_str_list.h5") /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/xarray/backends/plugins.py:68: RuntimeWarning: Engine 'cfgrib' loading failed: /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/gribapi/_bindings.cpython-38-x86_64-linux-gnu.so: undefined symbol: codes_bufr_key_is_header warnings.warn(f"Engine {name!r} loading failed:\n{ex}", RuntimeWarning) /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/fsspec/implementations/local.py:29: FutureWarning: The default value of auto_mkdir=True has been deprecated and will be changed to auto_mkdir=False by default in a future release. warnings.warn( *** Error in `/home/scott/miniconda3/envs/mapping/bin/python': free(): invalid next size (fast): 0x00005564b64622a0 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x81679)[0x7f56e752b679] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/netCDF4/../../../libnetcdf.so.18(nc_free_string+0x25)[0x7f54cf53d1a5] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/netCDF4/_netCDF4.cpython-38-x86_64-linux-gnu.so(+0xcf3c8)[0x7f54cf7313c8] /home/scott/miniconda3/envs/mapping/bin/python(PyCFunction_Call+0x54)[0x5564b397df44] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/netCDF4/_netCDF4.cpython-38-x86_64-linux-gnu.so(+0x224fd)[0x7f54cf6844fd] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/netCDF4/_netCDF4.cpython-38-x86_64-linux-gnu.so(+0x559d9)[0x7f54cf6b79d9] /home/scott/miniconda3/envs/mapping/bin/python(PyObject_GetItem+0x45)[0x5564b39d7935] /home/scott/miniconda3/envs/mapping/bin/python(+0x128e0b)[0x5564b397ae0b] /home/scott/miniconda3/envs/mapping/bin/python(_PyEval_EvalFrameDefault+0x947)[0x5564b3a1ec77] /home/scott/miniconda3/envs/mapping/bin/python(+0x1b0736)[0x5564b3a02736] /home/scott/miniconda3/envs/mapping/bin/python(_PyEval_EvalFrameDefault+0x947)[0x5564b3a1ec77] /home/scott/miniconda3/envs/mapping/bin/python(_PyFunction_Vectorcall+0x1a6)[0x5564b3a01fc6] /home/scott/miniconda3/envs/mapping/bin/python(_PyEval_EvalFrameDefault+0x4e03)[0x5564b3a23133] /home/scott/miniconda3/envs/mapping/bin/python(_PyFunction_Vectorcall+0x1a6)[0x5564b3a01fc6] /home/scott/miniconda3/envs/mapping/bin/python(+0x1800cd)[0x5564b39d20cd] /home/scott/miniconda3/envs/mapping/bin/python(PyObject_GetItem+0x45)[0x5564b39d7935] /home/scott/miniconda3/envs/mapping/bin/python(_PyEval_EvalFrameDefault+0xd53)[0x5564b3a1f083] /home/scott/miniconda3/envs/mapping/bin/python(_PyEval_EvalCodeWithName+0x2c3)[0x5564b3a00db3] /home/scott/miniconda3/envs/mapping/bin/python(_PyFunction_Vectorcall+0x378)[0x5564b3a02198] /home/scott/miniconda3/envs/mapping/bin/python(+0x1b0841)[0x5564b3a02841] /home/scott/miniconda3/envs/mapping/bin/python(+0x12404d)[0x5564b397604d] /home/scott/miniconda3/envs/mapping/bin/python(_PyObject_CallFunction_SizeT+0x99)[0x5564b39761f9] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0xa11fd)[0x7f56dddfe1fd] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0xa54d7)[0x7f56dde024d7] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0x8a2d5)[0x7f56ddde72d5] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0x8adc4)[0x7f56ddde7dc4] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0xa559a)[0x7f56dde0259a] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0xa5ac9)[0x7f56dde02ac9] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0x13f2b7)[0x7f56dde9c2b7] /home/scott/miniconda3/envs/mapping/bin/python(+0x129082)[0x5564b397b082] /home/scott/miniconda3/envs/mapping/bin/python(_PyEval_EvalFrameDefault+0x181e)[0x5564b3a1fb4e] /home/scott/miniconda3/envs/mapping/bin/python(_PyEval_EvalCodeWithName+0x2c3)[0x5564b3a00db3] /home/scott/miniconda3/envs/mapping/bin/python(_PyFunction_Vectorcall+0x378)[0x5564b3a02198] /home/scott/miniconda3/envs/mapping/bin/python(+0x1b0841)[0x5564b3a02841] /home/scott/miniconda3/envs/mapping/bin/python(+0x12404d)[0x5564b397604d] /home/scott/miniconda3/envs/mapping/bin/python(_PyObject_CallFunction_SizeT+0x99)[0x5564b39761f9] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0xa11fd)[0x7f56dddfe1fd] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0xa54d7)[0x7f56dde024d7] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0x8a2d5)[0x7f56ddde72d5] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0x8adc4)[0x7f56ddde7dc4] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0xa559a)[0x7f56dde0259a] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0xa5ac9)[0x7f56dde02ac9] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0x13f2b7)[0x7f56dde9c2b7] /home/scott/miniconda3/envs/mapping/bin/python(+0x129082)[0x5564b397b082] /home/scott/miniconda3/envs/mapping/bin/python(_PyEval_EvalFrameDefault+0x4e03)[0x5564b3a23133] /home/scott/miniconda3/envs/mapping/bin/python(_PyFunction_Vectorcall+0x1a6)[0x5564b3a01fc6] /home/scott/miniconda3/envs/mapping/bin/python(_PyEval_EvalFrameDefault+0xa63)[0x5564b3a1ed93] /home/scott/miniconda3/envs/mapping/bin/python(_PyEval_EvalCodeWithName+0x2c3)[0x5564b3a00db3] /home/scott/miniconda3/envs/mapping/bin/python(_PyFunction_Vectorcall+0x378)[0x5564b3a02198] /home/scott/miniconda3/envs/mapping/bin/python(+0x1b0841)[0x5564b3a02841] /home/scott/miniconda3/envs/mapping/bin/python(+0x12404d)[0x5564b397604d] /home/scott/miniconda3/envs/mapping/bin/python(_PyObject_CallFunction_SizeT+0x99)[0x5564b39761f9] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0xa11fd)[0x7f56dddfe1fd] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0xa54d7)[0x7f56dde024d7] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0x8a2d5)[0x7f56ddde72d5] /home/scott/miniconda3/envs/mapping/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so Aborted (core dumped)

xr.show_versions

In [2]: xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1062.4.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.20.2
pandas: 1.1.0
numpy: 1.21.2
scipy: 1.5.3
netCDF4: 1.5.4
pydap: None
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: 2.8.3
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.6
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.01.0
distributed: 2.20.0
matplotlib: 3.3.1
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
fsspec: 0.6.3
cupy: 9.0.0
pint: 0.17
sparse: None
setuptools: 50.3.2
pip: 21.2.4
conda: 4.8.4
pytest: 6.2.4
IPython: 7.18.1
sphinx: 4.0.2

@kmuehlbauer
Copy link
Contributor

@scottstanie Here is the output of ncdump:

netcdf test_str_list {
dimensions:
	phony_dim_0 = 2 ;
	phony_dim_1 = 2 ;
variables:
	string pairs(phony_dim_0, phony_dim_1) ;
data:

 pairs =
  "2020010120200201�\f\033��U", NIL,
  "2020010120200301 ", NIL ;
}

You see the trailing garbage. This is obviously a problem with netcdf-c/netcdf4-python, as it is not there with pure hdf5 (h5py/h5netcdf).

But, there is a difference with Attributes and Datasets:

import h5py
import xarray as xr

with h5py.File("test_str_list_attr.h5", "w") as hf:
    sid = h5py.h5s.create_simple((2, 2), (2, 2))
    tid1 = h5py.h5t.TypeID.copy(h5py.h5t.C_S1)
    tid1.set_size(8)
    tid1.set_strpad(h5py.h5t.STR_NULLPAD)
    
    tid2 = h5py.h5t.TypeID.copy(h5py.h5t.C_S1)
    tid2.set_size(9)
    tid2.set_strpad(h5py.h5t.STR_NULLTERM)
    
    blob = np.array([["20200101", "20200201"], ["20200101", "20200301"]]).astype("S")
    
    # Attributes
    aid = h5py.h5a.create(hf.id, b"NULLPAD", tid1, sid)
    ret = aid.write(blob)
    
    aid = h5py.h5a.create(hf.id, b"NULLTERM", tid2, sid)
    ret = aid.write(blob)
    
    hf.attrs["numpy_S"] = blob
    hf.attrs["numpy_O"] = blob.astype("O")
    
    
!h5dump test_str_list_attr.h5
!ncdump test_str_list_attr.h5

with xr.load_dataset("test_str_list_attr.h5", engine="h5netcdf", phony_dims="sort") as ds:
    display(ds)
with xr.load_dataset("test_str_list_attr.h5", engine="netcdf4") as ds:
    display(ds)
with nc.Dataset("test_str_list_attr.h5") as ds:
    display(ds)
    display(ds.NULLTERM)
    display(ds.NULLPAD)
    display(ds.numpy_O)
    display(ds.numpy_S)

Output:
HDF5 "test_str_list_attr.h5" {
GROUP "/" {
   ATTRIBUTE "NULLPAD" {
      DATATYPE  H5T_STRING {
         STRSIZE 8;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
      DATA {
      (0,0): "20200101", "20200201",
      (1,0): "20200101", "20200301"
      }
   }
   ATTRIBUTE "NULLTERM" {
      DATATYPE  H5T_STRING {
         STRSIZE 9;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
      DATA {
      (0,0): "20200101", "20200201",
      (1,0): "20200101", "20200301"
      }
   }
   ATTRIBUTE "numpy_O" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
      DATA {
      (0,0): "20200101", "20200201",
      (1,0): "20200101", "20200301"
      }
   }
   ATTRIBUTE "numpy_S" {
      DATATYPE  H5T_STRING {
         STRSIZE 8;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
      DATA {
      (0,0): "20200101", "20200201",
      (1,0): "20200101", "20200301"
      }
   }
}
}
netcdf test_str_list_attr {

// global attributes:
		string :NULLPAD = "20200101", "20200201", "20200101", "20200301" ;
		string :NULLTERM = "20200101", "20200201", "20200101", "20200301" ;
		string :numpy_S = "20200101", "20200201@�s}�U", "20200101", "20200301�6t}�U" ;
		string :numpy_O = "20200101", "20200201", "20200101", "20200301" ;
}
<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*
Attributes:
    NULLPAD:   [[b'20200101' b'20200201']\n [b'20200101' b'20200301']]
    NULLTERM:  [[b'20200101' b'20200201']\n [b'20200101' b'20200301']]
    numpy_O:   [['20200101' '20200201']\n ['20200101' '20200301']]
    numpy_S:   [[b'20200101' b'20200201']\n [b'20200101' b'20200301']]
<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*
Attributes:
    NULLPAD:   ['20200101', '20200201', '20200101', '20200301']
    NULLTERM:  ['20200101', '20200201', '20200101', '20200301']
    numpy_S:   ['20200101', '20200201', '20200101p��i�U', '20200301']
    numpy_O:   ['20200101', '20200201', '20200101', '20200301']
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    NULLPAD: ['20200101', '20200201', '20200101', '20200301']
    NULLTERM: ['20200101', '20200201', '20200101', '20200301']
    numpy_S: ['20200101', '20200201', '20200101', '20200301']
    numpy_O: ['20200101', '20200201', '20200101', '20200301']
    dimensions(sizes): 
    variables(dimensions): 
    groups: 
['20200101', '20200201', '20200101', '20200301']
['20200101', '20200201', '20200101', '20200301']
['20200101', '20200201', '20200101', '20200301']
['20200101', '20200201', '20200101', '20200301']

It's clearly seen, that the Datasets are correct in hdf5 dump, but somehow netcdf-c has issues with the string NULLPAD/NULLTERM. But at least there is no segfault with attributes. Othe than with Datasets/Variables:

import h5py
import xarray as xr

with h5py.File("test_str_list_ds.h5", "w") as hf:
    blob = np.array([["20200101", "20200201"], ["20200101", "20200301"]]).astype("S")
    
    # Datasets
    sid = h5py.h5s.create_simple((2, 2), (2, 2))
    
    tid3 = h5py.h5t.TypeID.copy(h5py.h5t.C_S1)
    tid3.set_size(8)
    tid3.set_strpad(h5py.h5t.STR_NULLPAD)
    
    tid4 = h5py.h5t.TypeID.copy(h5py.h5t.C_S1)
    tid4.set_size(9)
    tid4.set_strpad(h5py.h5t.STR_NULLTERM)
    
    aid = h5py.h5d.create(hf.id, b"NULLPAD", tid3, sid)
    ret = aid.write(sid, h5py.h5s.ALL, blob)
    
    aid = h5py.h5d.create(hf.id, b"NULLTERM", tid4, sid)
    ret = aid.write(sid, h5py.h5s.ALL, blob)
    
    hf["numpy_S"] = blob
    hf["numpy_O"] = blob.astype("O")
    
!h5dump test_str_list_ds.h5
!ncdump test_str_list_ds.h5    

with xr.load_dataset("test_str_list_ds.h5", engine="h5netcdf", phony_dims="sort") as ds:
    display(ds)

# with xr.load_dataset("test_str_list_ds.h5", engine="netcdf4") as ds:
#     display(ds["numpy_O"])
    
# with nc.Dataset("test_str_list_ds.h5") as ds:
#     display(ds)
#     #display("NULLTERM:", ds["NULLTERM"][:])
#     #display("NULLPAD:", ds["NULLPAD"][:])
#     display("numpy_O", ds["numpy_O"][:])
#     #display("numpy_S", ds["numpy_S"][:])

Output:
HDF5 "test_str_list_ds.h5" {
GROUP "/" {
   DATASET "NULLPAD" {
      DATATYPE  H5T_STRING {
         STRSIZE 8;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
      DATA {
      (0,0): "20200101", "20200201",
      (1,0): "20200101", "20200301"
      }
   }
   DATASET "NULLTERM" {
      DATATYPE  H5T_STRING {
         STRSIZE 9;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
      DATA {
      (0,0): "20200101", "20200201",
      (1,0): "20200101", "20200301"
      }
   }
   DATASET "numpy_O" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
      DATA {
      (0,0): "20200101", "20200201",
      (1,0): "20200101", "20200301"
      }
   }
   DATASET "numpy_S" {
      DATATYPE  H5T_STRING {
         STRSIZE 8;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 2, 2 ) / ( 2, 2 ) }
      DATA {
      (0,0): "20200101", "20200201",
      (1,0): "20200101", "20200301"
      }
   }
}
}
netcdf test_str_list_ds {
dimensions:
	phony_dim_0 = 2 ;
	phony_dim_1 = 2 ;
variables:
	string NULLPAD(phony_dim_0, phony_dim_1) ;
	string NULLTERM(phony_dim_0, phony_dim_1) ;
	string numpy_O(phony_dim_0, phony_dim_1) ;
	string numpy_S(phony_dim_0, phony_dim_1) ;
data:

 NULLPAD =
  "2020010120200201�4k�U", NIL,
  "2020010120200301 ", NIL ;

 NULLTERM =
  "20200101", NIL,
  "20200101", NIL ;

 numpy_O =
  "20200101", "20200201",
  "20200101", "20200301" ;

 numpy_S =
  "2020010120200201", NIL,
  "2020010120200301 ", NIL ;
}
<xarray.Dataset>
Dimensions:   (phony_dim_0: 2, phony_dim_1: 2)
Dimensions without coordinates: phony_dim_0, phony_dim_1
Data variables:
    NULLPAD   (phony_dim_0, phony_dim_1) |S8 b'20200101' ... b'20200301'
    NULLTERM  (phony_dim_0, phony_dim_1) |S9 b'20200101' ... b'20200301'
    numpy_O   (phony_dim_0, phony_dim_1) object '20200101' ... '20200301'
    numpy_S   (phony_dim_0, phony_dim_1) |S8 b'20200101' ... b'20200301'

So here, netcdf-c/netcdf4-python will segfault for all variables beside numpy_O.

It looks like the only option to achieve this for datasets/variables is to use numpy opaque dtype.

@scottstanie
Copy link
Contributor Author

ah sorry, didn't see the request for ncdump.

$ ncdump test_str_list.h5
netcdf test_str_list {
dimensions:
	phony_dim_0 = 2 ;
	phony_dim_1 = 2 ;
variables:
	string pairs(phony_dim_0, phony_dim_1) ;
data:

 pairs =
  "2020010120200201 ", NIL,
  "2020010120200301 ", NIL ;
}

Interesting that my pairs seems different than yours without the obvious trailing garbage.
Also, when I run your first code snippet, I have different areas that are garbled, with both NULLPAD and numpy_S displaying garbage

netcdf test_str_list_attr {

// global attributes:
		string :NULLPAD = "20200101�<T��\007", "20200201", "20200101�=T��\007", "20200301" ;
		string :NULLTERM = "20200101", "20200201", "20200101", "20200301" ;
		string :numpy_S = "20200101", "20200201\1775T��\007", "20200101", "20200301�3T��\007" ;
		string :numpy_O = "20200101", "20200201", "20200101", "20200301" ;
}

@kmuehlbauer
Copy link
Contributor

@scottstanie I'll check my h5py/hdf5 settings. But I doubt that might be the difference. I've experienced that the trailing garbage is changing from run to run, sometimes disappearing.

@scottstanie
Copy link
Contributor Author

Sounds good, but it seems like you're correct that it's a netcdf/netcdf4-python problem here, so I'll defer to others as to what the best changes to default settings would be to avoid the segfaults

@kmuehlbauer
Copy link
Contributor

Problem source identified in netcdf-c: Unidata/netcdf-c#2159

@kmuehlbauer
Copy link
Contributor

This is resolved in recent netcdf-c/netcdf4-python and works with recent Xarray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants