Object encoding #212

alimanfoo · 2017-11-30T23:42:14Z

This PR is a minimal attempt to improve safety and usability for object arrays, to resolve #208.

Object arrays without a filter that encodes object data are inherently unsafe and can lead to nasty errors or segfaults. Therefore, we want to make it very hard (if not impossible) for the user to create an object array without the necessary encoding.

This PR attempts to achieve this while maintaining backwards compatibility with previously-created data. I.e., any data previously created with an object dtype and with an appropriate filter such as numcodecs.MsgPack will continue to work with the new code, and no changes to the storage specification are required.

This is achieved by introducing a new object_codec argument to all array creation functions. If the array dtype is object, then a codec must be provided via the object_codec argument, otherwise a ValueError is raised.

To achieve compatibility, the object_codec is treated as a filter and inserted as the first filter in the chain.

Users could still attempt to subvert this system in various ways, e.g., by providing an inappropriate codec for the object_codec argument, or by manually wiping the filters. To prevent segfaults, a runtime check has been added to prevent object arrays getting passed down to the compressor during data storage.

A notebook provides some examples.

TODO:

update tutorial
soften errors to warnings
release notes
migrate codec benchmarks

alimanfoo · 2017-11-30T23:48:42Z

Note that the scope of this PR doesn't include any new codecs for object arrays. However, it does provide a simple mechanism by which different object codecs could be developed and used. I.e., any new object codec could be used as the value of the object_codec argument, as long as it implements the numcodecs.abc.Codec API (which is the contract for all filters).

jakirkham · 2017-12-01T14:51:59Z

Looks like the failure is due to object_codec not being set for some things in docs/tutorial.rst.

alimanfoo · 2017-12-01T15:45:37Z

Yep, I need to update the tutorial. I was just holding off updating all the docs for the moment in case anyone had any strong reactions against this proposal.

…

On Fri, Dec 1, 2017 at 2:52 PM, jakirkham ***@***.***> wrote: Looks like the failure is due to object_codec not being set for some things in docs/tutorial.rst. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/pull/212#issuecomment-348514297>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QjLRgC982FLngPUOb09EKix0vzBeks5s8BKQgaJpZM4QxpsC> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

jakirkham · 2017-12-01T16:08:15Z

No worries. Wasn't sure if that was why or not.

Generally I like the idea.

Was debating about whether there was a sensible default we could use here, but I think requiring it to be user specified near term is a good idea. Will help us collect data about what sort of things people expect from a default and thus allow us to update it later should we come to any new conclusions.

Will give it a more detailed pass in a moment.

alimanfoo · 2017-12-01T16:15:45Z

Thanks. Yeah I toyed with adding a default for convenience, but then just thought in this case it doesn't hurt to make the user aware that there are choices here, and this is an area of active development so new object codecs may come along and we're not yet sure which is best for performance etc. BTW I've just added a JSON codec to numcodecs which gives a third option for encoding object arrays in addition to Pickle and MsgPack. Interestingly performance is actually quite good, on a par with MsgPack, which I was surprised by. Pickle is worst. I'll post a bit more benchmark data later.

…

On Fri, Dec 1, 2017 at 4:08 PM, jakirkham ***@***.***> wrote: No worries. Wasn't sure if that was why or not. Generally I like the idea. Was debating about whether there was a sensible default we could use here, but I think requiring it to be user specified near term is a good idea. Will help us collect data about what sort of things people expect from a default and thus allow us to update it later should we come to any new conclusions. Will give it a more detailed pass in a moment. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/pull/212#issuecomment-348534980>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QtMTo9pHHteD96e01sAc6mPWJmwKks5s8CRvgaJpZM4QxpsC> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

jakirkham · 2017-12-01T16:23:57Z

Would be interesting to see what sort of data was used. JSON data being heterogeneous and all. ;)

Somehow I'm not surprised Pickle does worst.

Also this seems relevant.

jakirkham · 2017-12-01T16:51:57Z

zarr/core.py

+            # setting a single item
+            pass
+        elif is_scalar(value, self._dtype):
+            # setting a scalar value


Just to be clear, this is more of a general change to how selection works and not something that is specific to object type data (though it likely will be used for that case), correct?

This is necessary to allow storing a list as an item in an object array. I.e.,

z = zarr.empty(10, dtype=object) z[0] = ['foo', 'bar', 'baz']

...needs this change to work.

Sorry. Meant to put that comment near the () piece above. The scalar part is clear. :)

No worries. Yes it is a general thing, although it is really an optimisation for the case of setting a single item, and shouldn't change behaviour for anything other than allowing more flexibility in what kinds of items can be set in object arrays.

jakirkham · 2017-12-01T16:58:09Z

zarr/storage.py

+        else:
+            filters_config.insert(0, object_codec.get_config())
+    elif object_codec is not None:
+        raise ValueError('an object_codec is only needed for object arrays')


Maybe this should be softened to a warning? While it certainly isn't sensible for a user to set object_codec when there is no object, there isn't any problem caused by adding it. We just ignore it in this case anyways.

We could also explicitly set object_codec to None if we make this a warning should it accidentally be used later.

Yeah, I did wonder about that. In the end I figured 6 of one, half dozen of the other. Also I was being slightly lazy, I hate testing warnings, have had awful PY2/PY3 compatibility issues.

So as a friendly tip about testing warnings. Normally I having the warnings module convert them to errors and then test them in exact same way. The last half of this SO comment is useful for showing this. When we switch to pytest, they have a nicer way of doing this.

Thanks, I will soften to warning. I have some other tests that test warnings by converting to errors, but I could never get the catch_warnings context manager to work in PY2 so there is a bunch of ugly calls to warnings.resetwarnings(). Maybe this is time to start using pytest if that solutions works in PY2 and PY3.

jakirkham · 2017-12-01T17:13:17Z

Looks pretty good.

Asked one clarification comment (no change needed) and had a question about softening an error to a warning.

This is definitely an improvement. Though would note that if users were storing object arrays before in a different (but valid) way, this does break things for them. That said, there is an equally valid argument that storing object data was pretty broken before and this is necessary to fix it (not to mention the change of behavior is quite trivial to adopt). If we take the cautious route, we would want to soften all errors related to a lack of an object_codec to warnings, but make them DeprecationWarnings or FutureWarnings with some specific version when things will break/raise errors.

alimanfoo · 2017-12-03T01:04:29Z

I've made a couple of further changes to guard against possible segfaults during read. Updated notebook has examples.

Notebook also has some benchmarks for different object codecs. I compared MsgPack, JSON, Pickle, Categorize, and fastparquet's UTF8 encoding. I looked at encode speed, decode speed, uncompressed size and compressed size. Dataset is a million short unicode strings, chosen randomly from a relatively short list. MsgPack is the best all-round performer. On this dataset pickle is pretty good too. JSON is quick to encode but a bit slower to decode. FastParquet is very quick to encode but not so fast to decode. Categorize is fastest to decode, and also gives smallest size.

alimanfoo · 2017-12-04T01:27:30Z

I've softened the errors to warnings, now API backwards compatibility should be preserved, which I think is the right thing to do. Note that an error is still raised if there is no object_codec and no filters, i.e., we can be sure an object codec has not been provided. Updated notebook.

jakirkham · 2017-12-04T02:07:42Z

Thanks for updating this.

Have given it a cursory glance and it LGTM. May try to give it a closer look in the morning. That said, no need to hold up merging on my account.

Agree that switching to warnings seems safe. Also raising when there are no filters makes sense to me as well. Saw that Numcodecs has started raising in cases where object arrays were being passed, which is good. All of this seems pretty sensible and should encourage users to make better choices with object data.

alimanfoo · 2017-12-04T09:39:30Z

Thanks @jakirkham

jakirkham · 2017-12-04T15:55:55Z

docs/tutorial.rst

+
+Not all codecs support encoding of all object types. The
+:class:`numcodecs.Pickle` codec is the most flexible, supporting encoding any type
+of Python object. However, if you are sharing data with anyone other than yourself then


nit: yourself then -> yourself, then

jakirkham · 2017-12-04T15:56:43Z

docs/tutorial.rst

+Not all codecs support encoding of all object types. The
+:class:`numcodecs.Pickle` codec is the most flexible, supporting encoding any type
+of Python object. However, if you are sharing data with anyone other than yourself then
+Pickle is not recommended as it is a potential security risk, because malicious code can


nit: risk, because -> risk. This is because

jakirkham · 2017-12-04T15:57:41Z

docs/tutorial.rst

+of Python object. However, if you are sharing data with anyone other than yourself then
+Pickle is not recommended as it is a potential security risk, because malicious code can
+be embedded within pickled data. The JSON and MsgPack codecs support encoding of unicode
+strings, lists and dictionaries, with MsgPack usually faster for both encoding and


nit: dictionaries, with MsgPack -> dictionaries. MsgPack is

jakirkham · 2017-12-04T15:59:00Z

requirements_dev.txt

@@ -15,7 +15,7 @@ mccabe==0.6.1
 monotonic==1.3
 msgpack-python==0.4.8
 nose==1.3.7
-numcodecs==0.2.1
+numcodecs==0.4.1


Should we bump the requirement in setup.py as well?

Yes, good catch, done.

jakirkham · 2017-12-04T16:03:35Z

zarr/tests/test_core.py

+# needed for PY2/PY3 consistent behaviour
+if PY2:  # pragma: py3 no cover
+    warnings.resetwarnings()
+    warnings.simplefilter('always')


Just out of curiosity, is this still necessary with pytest or was this a holdover from before switching to pytest?

Yes I find this is still necessary, some tests fail under PY2 without it.

jakirkham · 2017-12-04T16:05:51Z

Added some minor comments above about wording in the docs primarily. Also some questions related to requirement changes. All just suggestions. Nothing crucial.

alimanfoo · 2017-12-06T00:52:57Z

Thanks @jakirkham for the review, much appreciated. I've pushed a few minor changes, I think this is ready to go pending CI. Codec benchmarks have been migrated to numcodecs .

jakirkham · 2017-12-06T01:09:13Z

np. Agreed. LGTM.

alimanfoo added 8 commits November 30, 2017 22:07

upgrade numcodecs

fc03507

consistent behaviour for empty object arrays

8b6df1f

add object_codec argument

ffc42b7

implement object_codec

e0d931e

WIP object tests

153421c

WIP object demo

cac1fd3

revert numcodecs for compatibility

1d9cb45

tests for object_codec

b7f3bd5

alimanfoo added the in progress Someone is currently working on this label Nov 30, 2017

alimanfoo mentioned this pull request Nov 30, 2017

Filters for object dtype #208

Closed

alimanfoo added this to the v2.2 milestone Nov 30, 2017

jakirkham reviewed Dec 1, 2017

View reviewed changes

alimanfoo added 4 commits December 1, 2017 18:04

benchmarking

e3d7972

catch segfault on read

810dc56

update notebook

64ed0af

add categorize

aac74ff

alimanfoo added 5 commits December 3, 2017 22:55

bump numcodecs

601510a

fix tutorial for object arrays

607f4c5

run with pytest

d6ce064

soften error to warning

eafecd9

more object tests

78014cb

alimanfoo added 4 commits December 4, 2017 00:56

use pytest to test warnings

2bf2cdc

soften error to warning, keep API compatibility

67b0bb0

raise error when we can be sure there is no object codec

2835eb2

raise error when we can be sure there is no object codec

14ac8d9

alimanfoo mentioned this pull request Dec 4, 2017

WIP: Zarr backend pydata/xarray#1528

Merged

4 tasks

fix appveyor

f3c0ccf

jakirkham reviewed Dec 4, 2017

View reviewed changes

alimanfoo added 3 commits December 6, 2017 00:31

address @jakirkham review comments

6c94602

release notes

43f4ae0

tidy

53968f8

alimanfoo merged commit 216b35e into master Dec 6, 2017

alimanfoo deleted the object_encoding branch December 6, 2017 01:16

alimanfoo removed the in progress Someone is currently working on this label Dec 6, 2017

alimanfoo mentioned this pull request Jan 16, 2019

Clarify status and semantics of object ('O') data type in storage spec zarr-developers/zarr-specs#6

Open

jakirkham mentioned this pull request Jun 3, 2022

Remove the object data type extension. zarr-developers/zarr-specs#146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object encoding #212

Object encoding #212

alimanfoo commented Nov 30, 2017 •

edited

alimanfoo commented Nov 30, 2017

jakirkham commented Dec 1, 2017

alimanfoo commented Dec 1, 2017 via email

jakirkham commented Dec 1, 2017

alimanfoo commented Dec 1, 2017 via email

jakirkham commented Dec 1, 2017

jakirkham Dec 1, 2017

alimanfoo Dec 1, 2017

jakirkham Dec 1, 2017

alimanfoo Dec 1, 2017

jakirkham Dec 1, 2017

alimanfoo Dec 1, 2017

jakirkham Dec 1, 2017

alimanfoo Dec 3, 2017

jakirkham commented Dec 1, 2017

alimanfoo commented Dec 3, 2017

alimanfoo commented Dec 4, 2017

jakirkham commented Dec 4, 2017

alimanfoo commented Dec 4, 2017

jakirkham Dec 4, 2017

jakirkham Dec 4, 2017

jakirkham Dec 4, 2017

jakirkham Dec 4, 2017

alimanfoo Dec 6, 2017

jakirkham Dec 4, 2017

alimanfoo Dec 6, 2017

jakirkham commented Dec 4, 2017

alimanfoo commented Dec 6, 2017

jakirkham commented Dec 6, 2017

Object encoding #212

Object encoding #212

Conversation

alimanfoo commented Nov 30, 2017 • edited

alimanfoo commented Nov 30, 2017

jakirkham commented Dec 1, 2017

alimanfoo commented Dec 1, 2017 via email

jakirkham commented Dec 1, 2017

alimanfoo commented Dec 1, 2017 via email

jakirkham commented Dec 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham commented Dec 1, 2017

alimanfoo commented Dec 3, 2017

alimanfoo commented Dec 4, 2017

jakirkham commented Dec 4, 2017

alimanfoo commented Dec 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham commented Dec 4, 2017

alimanfoo commented Dec 6, 2017

jakirkham commented Dec 6, 2017

alimanfoo commented Nov 30, 2017 •

edited