flexible codecs cannot handle compound dtypes containing objects #333

martindurant · 2022-07-06T14:21:03Z

This is missing, but reasonably expectable, functionality. If you have a zarr array with a compound dtype (i.e., records) and any of the fields are object type (i.e., strings), then you cannot roundtrip the data even though JSON/msgpack/pickle are capable of converting the array.

Minimal, reproducible code sample, a copy-pastable example if possible

>>> a = np.array([('aaa', 1, 4.2),
...               ('bbb', 2, 8.4),
...               ('ccc', 3, 12.6)],
...              dtype=[('foo', 'O'), ('bar', 'i4'), ('baz', 'f8')])
>>> z = zarr.array(a, object_codec=numcodecs.JSON(), fill_value=None)
>>> z["foo"]
ValueError: setting an array element with a sequence.

(without fill_value, this errors earlier sue to the use of np.zeros to guess a fill; with an appropriate fill_value=("", 0, 0.), it fails at array creation too)

Problem description

JSON and similar codecs store only the dtype.str, which is a "Vxx" in these cases, which means a suitable empty array cannot be made at load time.

The following fixes this for JSON, but looks ugly.

--- a/numcodecs/json.py
+++ b/numcodecs/json.py
@@ -1,3 +1,4 @@
+import ast
 import json as _json
 import textwrap

@@ -56,14 +57,18 @@ class JSON(Codec):
     def encode(self, buf):
         buf = np.asarray(buf)
         items = buf.tolist()
-        items.append(buf.dtype.str)
+        items.append(str(buf.dtype))
         items.append(buf.shape)
         return self._encoder.encode(items).encode(self._text_encoding)

     def decode(self, buf, out=None):
         items = self._decoder.decode(ensure_text(buf, self._text_encoding))
-        dec = np.empty(items[-1], dtype=items[-2])
-        dec[:] = items[:-2]
+        if "[" in items[-2]:
+            dec = np.empty(items[-1], dtype=ast.literal_eval(items[-2]))
+            dec[:] = [tuple(_) for _ in items[:-2]]
+        else:
+            dec = np.empty(items[-1], dtype=items[-2])
+            dec[:] = items[:-2]
         if out is not None:
             np.copyto(out, dec)

Version and installation information

Please provide the following:

numcodecs.__version__ 0.10.0
Version of Python interpreter 3.8.8
Operating system: Mac
How NumCodecs was installed: pip from source

The text was updated successfully, but these errors were encountered:

bnavigator · 2023-01-12T16:20:57Z

Any update here?

martindurant · 2023-01-12T16:22:51Z

None from me. I have a plausible fix, above, which could be added now, but I was hoping someone would come up with something cleaner.

bnavigator · 2023-01-12T16:23:18Z

I am getting this from the test suite:

[   51s] ____________________________ test_non_numpy_inputs _____________________________
[   51s] 
[   51s]     def test_non_numpy_inputs():
[   51s]         # numpy will infer a range of different shapes and dtypes for these inputs.
[   51s]         # Make sure that round-tripping through encode preserves this.
[   51s]         data = [
[   51s]             [0, 1],
[   51s]             [[0, 1], [2, 3]],
[   51s]             [[0], [1], [2, 3]],
[   51s]             [[[0, 0]], [[1, 1]], [[2, 3]]],
[   51s]             ["1"],
[   51s]             ["11", "11"],
[   51s]             ["11", "1", "1"],
[   51s]             [{}],
[   51s]             [{"key": "value"}, ["list", "of", "strings"]],
[   51s]         ]
[   51s]         for input_data in data:
[   51s]             for codec in codecs:
[   51s] >               output_data = codec.decode(codec.encode(input_data))
[   51s] 
[   51s] ../../BUILDROOT/python-numcodecs-0.11.0-0.x86_64/usr/lib64/python3.8/site-packages/numcodecs/tests/test_json.py:72: 
[   51s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[   51s] 
[   51s] self = JSON(encoding='utf-8', allow_nan=True, check_circular=True, ensure_ascii=True,
[   51s]      indent=None, separators=(',', ':'), skipkeys=False, sort_keys=True,
[   51s]      strict=True)
[   51s] buf = [[0], [1], [2, 3]]
[   51s] 
[   51s]     def encode(self, buf):
[   51s] >       buf = np.asarray(buf)
[   51s] E       ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
[   51s] 
[   51s] ../../BUILDROOT/python-numcodecs-0.11.0-0.x86_64/usr/lib64/python3.8/site-packages/numcodecs/json.py:57: ValueError
[   51s] ____________________________ test_non_numpy_inputs _____________________________
[   51s] 
[   51s]     def test_non_numpy_inputs():
[   51s]         codec = MsgPack()
[   51s]         # numpy will infer a range of different shapes and dtypes for these inputs.
[   51s]         # Make sure that round-tripping through encode preserves this.
[   51s]         data = [
[   51s]             [0, 1],
[   51s]             [[0, 1], [2, 3]],
[   51s]             [[0], [1], [2, 3]],
[   51s]             [[[0, 0]], [[1, 1]], [[2, 3]]],
[   51s]             ["1"],
[   51s]             ["11", "11"],
[   51s]             ["11", "1", "1"],
[   51s]             [{}],
[   51s]             [{"key": "value"}, ["list", "of", "strings"]],
[   51s]             [b"1"],
[   51s]             [b"11", b"11"],
[   51s]             [b"11", b"1", b"1"],
[   51s]             [{b"key": b"value"}, [b"list", b"of", b"strings"]],
[   51s]         ]
[   51s]         for input_data in data:
[   51s] >           actual = codec.decode(codec.encode(input_data))
[   51s] 
[   51s] ../../BUILDROOT/python-numcodecs-0.11.0-0.x86_64/usr/lib64/python3.8/site-packages/numcodecs/tests/test_msgpacks.py:75: 
[   51s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[   51s] 
[   51s] self = MsgPack(raw=False, use_bin_type=True, use_single_float=False)
[   51s] buf = [[0], [1], [2, 3]]
[   51s] 
[   51s]     def encode(self, buf):
[   51s] >       buf = np.asarray(buf)
[   51s] E       ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
[   51s] 
[   51s] ../../BUILDROOT/python-numcodecs-0.11.0-0.x86_64/usr/lib64/python3.8/site-packages/numcodecs/msgpacks.py:55: ValueError

martindurant · 2023-01-12T16:25:20Z

I don't know if my diff would fix that. If yes, let's include it! I wonder what changed that this is showing up now but not before.

bnavigator · 2023-01-12T16:28:51Z

Unfortunately not. I tried to apply the patch just now. The encode part is already changed from your diff, but the failing np.asarray(buf) is before the changed lines.

bnavigator · 2023-01-12T16:29:24Z

The difference is an updated numpy 1.24.1

This was referenced Jan 12, 2023

No ragged sequence for conversion to numpy arrays #416

Closed

Enforce dtype=object for incompatible numpy array conversion #417

Merged

martindurant closed this as completed in #417 Jan 13, 2023

QuLogic mentioned this issue Jun 28, 2023

2 tests fail on =numcodecs-0.11.0 with ValueError: setting an array element with a sequence. #436

Closed

rabernat mentioned this issue Jul 11, 2023

UserWarning / NotImplementedError fsspec/kerchunk#339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flexible codecs cannot handle compound dtypes containing objects #333

flexible codecs cannot handle compound dtypes containing objects #333

martindurant commented Jul 6, 2022

bnavigator commented Jan 12, 2023

martindurant commented Jan 12, 2023

bnavigator commented Jan 12, 2023

martindurant commented Jan 12, 2023

bnavigator commented Jan 12, 2023

bnavigator commented Jan 12, 2023

flexible codecs cannot handle compound dtypes containing objects #333

flexible codecs cannot handle compound dtypes containing objects #333

Comments

martindurant commented Jul 6, 2022

Minimal, reproducible code sample, a copy-pastable example if possible

Problem description

Version and installation information

bnavigator commented Jan 12, 2023

martindurant commented Jan 12, 2023

bnavigator commented Jan 12, 2023

martindurant commented Jan 12, 2023

bnavigator commented Jan 12, 2023

bnavigator commented Jan 12, 2023