New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object encoding #212
Object encoding #212
Conversation
Note that the scope of this PR doesn't include any new codecs for object arrays. However, it does provide a simple mechanism by which different object codecs could be developed and used. I.e., any new object codec could be used as the value of the |
Looks like the failure is due to |
Yep, I need to update the tutorial. I was just holding off updating all the
docs for the moment in case anyone had any strong reactions against this
proposal.
…On Fri, Dec 1, 2017 at 2:52 PM, jakirkham ***@***.***> wrote:
Looks like the failure is due to object_codec not being set for some
things in docs/tutorial.rst.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/pull/212#issuecomment-348514297>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QjLRgC982FLngPUOb09EKix0vzBeks5s8BKQgaJpZM4QxpsC>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
No worries. Wasn't sure if that was why or not. Generally I like the idea. Was debating about whether there was a sensible default we could use here, but I think requiring it to be user specified near term is a good idea. Will help us collect data about what sort of things people expect from a default and thus allow us to update it later should we come to any new conclusions. Will give it a more detailed pass in a moment. |
Thanks. Yeah I toyed with adding a default for convenience, but then just
thought in this case it doesn't hurt to make the user aware that there are
choices here, and this is an area of active development so new object
codecs may come along and we're not yet sure which is best for performance
etc.
BTW I've just added a JSON codec to numcodecs which gives a third option
for encoding object arrays in addition to Pickle and MsgPack. Interestingly
performance is actually quite good, on a par with MsgPack, which I was
surprised by. Pickle is worst. I'll post a bit more benchmark data later.
…On Fri, Dec 1, 2017 at 4:08 PM, jakirkham ***@***.***> wrote:
No worries. Wasn't sure if that was why or not.
Generally I like the idea.
Was debating about whether there was a sensible default we could use here,
but I think requiring it to be user specified near term is a good idea.
Will help us collect data about what sort of things people expect from a
default and thus allow us to update it later should we come to any new
conclusions.
Will give it a more detailed pass in a moment.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/pull/212#issuecomment-348534980>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QtMTo9pHHteD96e01sAc6mPWJmwKks5s8CRvgaJpZM4QxpsC>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
Would be interesting to see what sort of data was used. JSON data being heterogeneous and all. ;) Somehow I'm not surprised Pickle does worst. Also this seems relevant. |
# setting a single item | ||
pass | ||
elif is_scalar(value, self._dtype): | ||
# setting a scalar value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, this is more of a general change to how selection works and not something that is specific to object
type data (though it likely will be used for that case), correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary to allow storing a list as an item in an object array. I.e.,
z = zarr.empty(10, dtype=object)
z[0] = ['foo', 'bar', 'baz']
...needs this change to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Meant to put that comment near the ()
piece above. The scalar part is clear. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. Yes it is a general thing, although it is really an optimisation for the case of setting a single item, and shouldn't change behaviour for anything other than allowing more flexibility in what kinds of items can be set in object arrays.
zarr/storage.py
Outdated
else: | ||
filters_config.insert(0, object_codec.get_config()) | ||
elif object_codec is not None: | ||
raise ValueError('an object_codec is only needed for object arrays') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be softened to a warning? While it certainly isn't sensible for a user to set object_codec
when there is no object
, there isn't any problem caused by adding it. We just ignore it in this case anyways.
We could also explicitly set object_codec
to None
if we make this a warning should it accidentally be used later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did wonder about that. In the end I figured 6 of one, half dozen of the other. Also I was being slightly lazy, I hate testing warnings, have had awful PY2/PY3 compatibility issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as a friendly tip about testing warnings. Normally I having the warnings
module convert them to errors and then test them in exact same way. The last half of this SO comment is useful for showing this. When we switch to pytest
, they have a nicer way of doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will soften to warning. I have some other tests that test warnings by converting to errors, but I could never get the catch_warnings
context manager to work in PY2 so there is a bunch of ugly calls to warnings.resetwarnings()
. Maybe this is time to start using pytest if that solutions works in PY2 and PY3.
Looks pretty good. Asked one clarification comment (no change needed) and had a question about softening an error to a warning. This is definitely an improvement. Though would note that if users were storing object arrays before in a different (but valid) way, this does break things for them. That said, there is an equally valid argument that storing |
I've made a couple of further changes to guard against possible segfaults during read. Updated notebook has examples. Notebook also has some benchmarks for different object codecs. I compared MsgPack, JSON, Pickle, Categorize, and fastparquet's UTF8 encoding. I looked at encode speed, decode speed, uncompressed size and compressed size. Dataset is a million short unicode strings, chosen randomly from a relatively short list. MsgPack is the best all-round performer. On this dataset pickle is pretty good too. JSON is quick to encode but a bit slower to decode. FastParquet is very quick to encode but not so fast to decode. Categorize is fastest to decode, and also gives smallest size. |
I've softened the errors to warnings, now API backwards compatibility should be preserved, which I think is the right thing to do. Note that an error is still raised if there is no |
Thanks for updating this. Have given it a cursory glance and it LGTM. May try to give it a closer look in the morning. That said, no need to hold up merging on my account. Agree that switching to warnings seems safe. Also raising when there are no filters makes sense to me as well. Saw that Numcodecs has started raising in cases where object arrays were being passed, which is good. All of this seems pretty sensible and should encourage users to make better choices with object data. |
Thanks @jakirkham |
docs/tutorial.rst
Outdated
|
||
Not all codecs support encoding of all object types. The | ||
:class:`numcodecs.Pickle` codec is the most flexible, supporting encoding any type | ||
of Python object. However, if you are sharing data with anyone other than yourself then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: yourself then
-> yourself, then
docs/tutorial.rst
Outdated
Not all codecs support encoding of all object types. The | ||
:class:`numcodecs.Pickle` codec is the most flexible, supporting encoding any type | ||
of Python object. However, if you are sharing data with anyone other than yourself then | ||
Pickle is not recommended as it is a potential security risk, because malicious code can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: risk, because
-> risk. This is because
docs/tutorial.rst
Outdated
of Python object. However, if you are sharing data with anyone other than yourself then | ||
Pickle is not recommended as it is a potential security risk, because malicious code can | ||
be embedded within pickled data. The JSON and MsgPack codecs support encoding of unicode | ||
strings, lists and dictionaries, with MsgPack usually faster for both encoding and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: dictionaries, with MsgPack
-> dictionaries. MsgPack is
@@ -15,7 +15,7 @@ mccabe==0.6.1 | |||
monotonic==1.3 | |||
msgpack-python==0.4.8 | |||
nose==1.3.7 | |||
numcodecs==0.2.1 | |||
numcodecs==0.4.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we bump the requirement in setup.py
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch, done.
# needed for PY2/PY3 consistent behaviour | ||
if PY2: # pragma: py3 no cover | ||
warnings.resetwarnings() | ||
warnings.simplefilter('always') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, is this still necessary with pytest
or was this a holdover from before switching to pytest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I find this is still necessary, some tests fail under PY2 without it.
Added some minor comments above about wording in the docs primarily. Also some questions related to requirement changes. All just suggestions. Nothing crucial. |
Thanks @jakirkham for the review, much appreciated. I've pushed a few minor changes, I think this is ready to go pending CI. Codec benchmarks have been migrated to numcodecs . |
np. Agreed. LGTM. |
This PR is a minimal attempt to improve safety and usability for object arrays, to resolve #208.
Object arrays without a filter that encodes object data are inherently unsafe and can lead to nasty errors or segfaults. Therefore, we want to make it very hard (if not impossible) for the user to create an object array without the necessary encoding.
This PR attempts to achieve this while maintaining backwards compatibility with previously-created data. I.e., any data previously created with an object dtype and with an appropriate filter such as
numcodecs.MsgPack
will continue to work with the new code, and no changes to the storage specification are required.This is achieved by introducing a new
object_codec
argument to all array creation functions. If the array dtype isobject
, then a codec must be provided via theobject_codec
argument, otherwise aValueError
is raised.To achieve compatibility, the
object_codec
is treated as a filter and inserted as the first filter in the chain.Users could still attempt to subvert this system in various ways, e.g., by providing an inappropriate codec for the
object_codec
argument, or by manually wiping the filters. To prevent segfaults, a runtime check has been added to prevent object arrays getting passed down to the compressor during data storage.A notebook provides some examples.
TODO: