Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flexible codecs cannot handle compound dtypes containing objects #333

Closed
martindurant opened this issue Jul 6, 2022 · 6 comments · Fixed by #417
Closed

flexible codecs cannot handle compound dtypes containing objects #333

martindurant opened this issue Jul 6, 2022 · 6 comments · Fixed by #417

Comments

@martindurant
Copy link
Member

This is missing, but reasonably expectable, functionality. If you have a zarr array with a compound dtype (i.e., records) and any of the fields are object type (i.e., strings), then you cannot roundtrip the data even though JSON/msgpack/pickle are capable of converting the array.

Minimal, reproducible code sample, a copy-pastable example if possible

>>> a = np.array([('aaa', 1, 4.2),
...               ('bbb', 2, 8.4),
...               ('ccc', 3, 12.6)],
...              dtype=[('foo', 'O'), ('bar', 'i4'), ('baz', 'f8')])
>>> z = zarr.array(a, object_codec=numcodecs.JSON(), fill_value=None)
>>> z["foo"]
ValueError: setting an array element with a sequence.

(without fill_value, this errors earlier sue to the use of np.zeros to guess a fill; with an appropriate fill_value=("", 0, 0.), it fails at array creation too)

Problem description

JSON and similar codecs store only the dtype.str, which is a "Vxx" in these cases, which means a suitable empty array cannot be made at load time.

The following fixes this for JSON, but looks ugly.

--- a/numcodecs/json.py
+++ b/numcodecs/json.py
@@ -1,3 +1,4 @@
+import ast
 import json as _json
 import textwrap

@@ -56,14 +57,18 @@ class JSON(Codec):
     def encode(self, buf):
         buf = np.asarray(buf)
         items = buf.tolist()
-        items.append(buf.dtype.str)
+        items.append(str(buf.dtype))
         items.append(buf.shape)
         return self._encoder.encode(items).encode(self._text_encoding)

     def decode(self, buf, out=None):
         items = self._decoder.decode(ensure_text(buf, self._text_encoding))
-        dec = np.empty(items[-1], dtype=items[-2])
-        dec[:] = items[:-2]
+        if "[" in items[-2]:
+            dec = np.empty(items[-1], dtype=ast.literal_eval(items[-2]))
+            dec[:] = [tuple(_) for _ in items[:-2]]
+        else:
+            dec = np.empty(items[-1], dtype=items[-2])
+            dec[:] = items[:-2]
         if out is not None:
             np.copyto(out, dec)

Version and installation information

Please provide the following:

  • numcodecs.__version__ 0.10.0
  • Version of Python interpreter 3.8.8
  • Operating system: Mac
  • How NumCodecs was installed: pip from source
@bnavigator
Copy link
Contributor

Any update here?

@martindurant
Copy link
Member Author

None from me. I have a plausible fix, above, which could be added now, but I was hoping someone would come up with something cleaner.

@bnavigator
Copy link
Contributor

I am getting this from the test suite:

[   51s] ____________________________ test_non_numpy_inputs _____________________________
[   51s] 
[   51s]     def test_non_numpy_inputs():
[   51s]         # numpy will infer a range of different shapes and dtypes for these inputs.
[   51s]         # Make sure that round-tripping through encode preserves this.
[   51s]         data = [
[   51s]             [0, 1],
[   51s]             [[0, 1], [2, 3]],
[   51s]             [[0], [1], [2, 3]],
[   51s]             [[[0, 0]], [[1, 1]], [[2, 3]]],
[   51s]             ["1"],
[   51s]             ["11", "11"],
[   51s]             ["11", "1", "1"],
[   51s]             [{}],
[   51s]             [{"key": "value"}, ["list", "of", "strings"]],
[   51s]         ]
[   51s]         for input_data in data:
[   51s]             for codec in codecs:
[   51s] >               output_data = codec.decode(codec.encode(input_data))
[   51s] 
[   51s] ../../BUILDROOT/python-numcodecs-0.11.0-0.x86_64/usr/lib64/python3.8/site-packages/numcodecs/tests/test_json.py:72: 
[   51s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[   51s] 
[   51s] self = JSON(encoding='utf-8', allow_nan=True, check_circular=True, ensure_ascii=True,
[   51s]      indent=None, separators=(',', ':'), skipkeys=False, sort_keys=True,
[   51s]      strict=True)
[   51s] buf = [[0], [1], [2, 3]]
[   51s] 
[   51s]     def encode(self, buf):
[   51s] >       buf = np.asarray(buf)
[   51s] E       ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
[   51s] 
[   51s] ../../BUILDROOT/python-numcodecs-0.11.0-0.x86_64/usr/lib64/python3.8/site-packages/numcodecs/json.py:57: ValueError
[   51s] ____________________________ test_non_numpy_inputs _____________________________
[   51s] 
[   51s]     def test_non_numpy_inputs():
[   51s]         codec = MsgPack()
[   51s]         # numpy will infer a range of different shapes and dtypes for these inputs.
[   51s]         # Make sure that round-tripping through encode preserves this.
[   51s]         data = [
[   51s]             [0, 1],
[   51s]             [[0, 1], [2, 3]],
[   51s]             [[0], [1], [2, 3]],
[   51s]             [[[0, 0]], [[1, 1]], [[2, 3]]],
[   51s]             ["1"],
[   51s]             ["11", "11"],
[   51s]             ["11", "1", "1"],
[   51s]             [{}],
[   51s]             [{"key": "value"}, ["list", "of", "strings"]],
[   51s]             [b"1"],
[   51s]             [b"11", b"11"],
[   51s]             [b"11", b"1", b"1"],
[   51s]             [{b"key": b"value"}, [b"list", b"of", b"strings"]],
[   51s]         ]
[   51s]         for input_data in data:
[   51s] >           actual = codec.decode(codec.encode(input_data))
[   51s] 
[   51s] ../../BUILDROOT/python-numcodecs-0.11.0-0.x86_64/usr/lib64/python3.8/site-packages/numcodecs/tests/test_msgpacks.py:75: 
[   51s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[   51s] 
[   51s] self = MsgPack(raw=False, use_bin_type=True, use_single_float=False)
[   51s] buf = [[0], [1], [2, 3]]
[   51s] 
[   51s]     def encode(self, buf):
[   51s] >       buf = np.asarray(buf)
[   51s] E       ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
[   51s] 
[   51s] ../../BUILDROOT/python-numcodecs-0.11.0-0.x86_64/usr/lib64/python3.8/site-packages/numcodecs/msgpacks.py:55: ValueError

@martindurant
Copy link
Member Author

I don't know if my diff would fix that. If yes, let's include it! I wonder what changed that this is showing up now but not before.

@bnavigator
Copy link
Contributor

Unfortunately not. I tried to apply the patch just now. The encode part is already changed from your diff, but the failing np.asarray(buf) is before the changed lines.

@bnavigator
Copy link
Contributor

The difference is an updated numpy 1.24.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants