Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object encoding #212

Merged
merged 25 commits into from Dec 6, 2017
Merged

Object encoding #212

merged 25 commits into from Dec 6, 2017

Conversation

alimanfoo
Copy link
Member

@alimanfoo alimanfoo commented Nov 30, 2017

This PR is a minimal attempt to improve safety and usability for object arrays, to resolve #208.

Object arrays without a filter that encodes object data are inherently unsafe and can lead to nasty errors or segfaults. Therefore, we want to make it very hard (if not impossible) for the user to create an object array without the necessary encoding.

This PR attempts to achieve this while maintaining backwards compatibility with previously-created data. I.e., any data previously created with an object dtype and with an appropriate filter such as numcodecs.MsgPack will continue to work with the new code, and no changes to the storage specification are required.

This is achieved by introducing a new object_codec argument to all array creation functions. If the array dtype is object, then a codec must be provided via the object_codec argument, otherwise a ValueError is raised.

To achieve compatibility, the object_codec is treated as a filter and inserted as the first filter in the chain.

Users could still attempt to subvert this system in various ways, e.g., by providing an inappropriate codec for the object_codec argument, or by manually wiping the filters. To prevent segfaults, a runtime check has been added to prevent object arrays getting passed down to the compressor during data storage.

A notebook provides some examples.

TODO:

  • update tutorial
  • soften errors to warnings
  • release notes
  • migrate codec benchmarks

@alimanfoo alimanfoo added the in progress Someone is currently working on this label Nov 30, 2017
@alimanfoo
Copy link
Member Author

Note that the scope of this PR doesn't include any new codecs for object arrays. However, it does provide a simple mechanism by which different object codecs could be developed and used. I.e., any new object codec could be used as the value of the object_codec argument, as long as it implements the numcodecs.abc.Codec API (which is the contract for all filters).

@alimanfoo alimanfoo added this to the v2.2 milestone Nov 30, 2017
@jakirkham
Copy link
Member

Looks like the failure is due to object_codec not being set for some things in docs/tutorial.rst.

@alimanfoo
Copy link
Member Author

alimanfoo commented Dec 1, 2017 via email

@jakirkham
Copy link
Member

No worries. Wasn't sure if that was why or not.

Generally I like the idea.

Was debating about whether there was a sensible default we could use here, but I think requiring it to be user specified near term is a good idea. Will help us collect data about what sort of things people expect from a default and thus allow us to update it later should we come to any new conclusions.

Will give it a more detailed pass in a moment.

@alimanfoo
Copy link
Member Author

alimanfoo commented Dec 1, 2017 via email

@jakirkham
Copy link
Member

Would be interesting to see what sort of data was used. JSON data being heterogeneous and all. ;)

Somehow I'm not surprised Pickle does worst.

Also this seems relevant.

# setting a single item
pass
elif is_scalar(value, self._dtype):
# setting a scalar value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this is more of a general change to how selection works and not something that is specific to object type data (though it likely will be used for that case), correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary to allow storing a list as an item in an object array. I.e.,

z = zarr.empty(10, dtype=object)
z[0] = ['foo', 'bar', 'baz']

...needs this change to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Meant to put that comment near the () piece above. The scalar part is clear. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. Yes it is a general thing, although it is really an optimisation for the case of setting a single item, and shouldn't change behaviour for anything other than allowing more flexibility in what kinds of items can be set in object arrays.

zarr/storage.py Outdated
else:
filters_config.insert(0, object_codec.get_config())
elif object_codec is not None:
raise ValueError('an object_codec is only needed for object arrays')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be softened to a warning? While it certainly isn't sensible for a user to set object_codec when there is no object, there isn't any problem caused by adding it. We just ignore it in this case anyways.

We could also explicitly set object_codec to None if we make this a warning should it accidentally be used later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did wonder about that. In the end I figured 6 of one, half dozen of the other. Also I was being slightly lazy, I hate testing warnings, have had awful PY2/PY3 compatibility issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as a friendly tip about testing warnings. Normally I having the warnings module convert them to errors and then test them in exact same way. The last half of this SO comment is useful for showing this. When we switch to pytest, they have a nicer way of doing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will soften to warning. I have some other tests that test warnings by converting to errors, but I could never get the catch_warnings context manager to work in PY2 so there is a bunch of ugly calls to warnings.resetwarnings(). Maybe this is time to start using pytest if that solutions works in PY2 and PY3.

@jakirkham
Copy link
Member

Looks pretty good.

Asked one clarification comment (no change needed) and had a question about softening an error to a warning.

This is definitely an improvement. Though would note that if users were storing object arrays before in a different (but valid) way, this does break things for them. That said, there is an equally valid argument that storing object data was pretty broken before and this is necessary to fix it (not to mention the change of behavior is quite trivial to adopt). If we take the cautious route, we would want to soften all errors related to a lack of an object_codec to warnings, but make them DeprecationWarnings or FutureWarnings with some specific version when things will break/raise errors.

@alimanfoo
Copy link
Member Author

I've made a couple of further changes to guard against possible segfaults during read. Updated notebook has examples.

Notebook also has some benchmarks for different object codecs. I compared MsgPack, JSON, Pickle, Categorize, and fastparquet's UTF8 encoding. I looked at encode speed, decode speed, uncompressed size and compressed size. Dataset is a million short unicode strings, chosen randomly from a relatively short list. MsgPack is the best all-round performer. On this dataset pickle is pretty good too. JSON is quick to encode but a bit slower to decode. FastParquet is very quick to encode but not so fast to decode. Categorize is fastest to decode, and also gives smallest size.

@alimanfoo
Copy link
Member Author

I've softened the errors to warnings, now API backwards compatibility should be preserved, which I think is the right thing to do. Note that an error is still raised if there is no object_codec and no filters, i.e., we can be sure an object codec has not been provided. Updated notebook.

@alimanfoo alimanfoo mentioned this pull request Dec 4, 2017
4 tasks
@jakirkham
Copy link
Member

Thanks for updating this.

Have given it a cursory glance and it LGTM. May try to give it a closer look in the morning. That said, no need to hold up merging on my account.

Agree that switching to warnings seems safe. Also raising when there are no filters makes sense to me as well. Saw that Numcodecs has started raising in cases where object arrays were being passed, which is good. All of this seems pretty sensible and should encourage users to make better choices with object data.

@alimanfoo
Copy link
Member Author

Thanks @jakirkham


Not all codecs support encoding of all object types. The
:class:`numcodecs.Pickle` codec is the most flexible, supporting encoding any type
of Python object. However, if you are sharing data with anyone other than yourself then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: yourself then -> yourself, then

Not all codecs support encoding of all object types. The
:class:`numcodecs.Pickle` codec is the most flexible, supporting encoding any type
of Python object. However, if you are sharing data with anyone other than yourself then
Pickle is not recommended as it is a potential security risk, because malicious code can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: risk, because -> risk. This is because

of Python object. However, if you are sharing data with anyone other than yourself then
Pickle is not recommended as it is a potential security risk, because malicious code can
be embedded within pickled data. The JSON and MsgPack codecs support encoding of unicode
strings, lists and dictionaries, with MsgPack usually faster for both encoding and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dictionaries, with MsgPack -> dictionaries. MsgPack is

@@ -15,7 +15,7 @@ mccabe==0.6.1
monotonic==1.3
msgpack-python==0.4.8
nose==1.3.7
numcodecs==0.2.1
numcodecs==0.4.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we bump the requirement in setup.py as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch, done.

# needed for PY2/PY3 consistent behaviour
if PY2: # pragma: py3 no cover
warnings.resetwarnings()
warnings.simplefilter('always')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, is this still necessary with pytest or was this a holdover from before switching to pytest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I find this is still necessary, some tests fail under PY2 without it.

@jakirkham
Copy link
Member

Added some minor comments above about wording in the docs primarily. Also some questions related to requirement changes. All just suggestions. Nothing crucial.

@alimanfoo
Copy link
Member Author

Thanks @jakirkham for the review, much appreciated. I've pushed a few minor changes, I think this is ready to go pending CI. Codec benchmarks have been migrated to numcodecs .

@jakirkham
Copy link
Member

np. Agreed. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filters for object dtype
2 participants