New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serializable #96
Serializable #96
Conversation
|
||
numpy-storage: choose one of ['auto', 'ascii', 'base64'] (default: auto) | ||
|
||
Use the 'nmupy_storage' argument to select whether numpy arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/nmupy/numpy
This looks great! Will definitely make working with transforms and sources much easier. Some minor comments:
|
Agreed, this looks totally sweet. I mainly wanted to just say thanks for adding the motivation, usage info, and links out to the relevant blog posts in the PR and code comments - in about five minutes these took me from "huh?" to "oh!". I'll get on this branch and see whether I can find any loose ends, but overall this is looking like a great addition. |
…yle guidelines and added an additional unit test.
Alrighty, I addressed your comments for the most part, @poolio. The code now conforms to the style guide and I added the additional unit test. Regarding the One other thing that came up as I was making these changes is a question about the style guide regarding numpy imports. It recommends always using On a more practical note, for mathematically intensive code that uses lots and lots of numpy symbols, I find it a little inconvenient to hop back and forth between my code and the import statement as I use new symbols. I guess this last concern can be mitigated by importing the namespace during development, and then "cleaning up" the code (as I have done here) before checking it in. So maybe that's a fine approach. Anyway, I guess I would propose that the guideline about |
Alright, so a couple comments / questions:
@serializable
class Foo(object):
__slots__ = ['bar']
foo = Foo()
foo.bar = 'a'
foo.serialize() # boom I think this could be addressed pretty simply on the serialization side by first checking whether Anyway, I think it would be reasonable to not support So, in short, I'd like to either support serializing / deserializing objects with slots, or else to say up front that it isn't supported (yet). :)
Overall again this is looking really nice! |
Also there appears to be some weirdness with the FooTuple = serializable(namedtuple('FooTuple', 'bar'))
foo = FooTuple(bar="baz")
foo.serialize() # returns: {'py/collections.OrderedDict': [['bar', 'baz']]} ... OrderedDict?
foo2 = FooTuple.deserialize(foo.serialize()) # boom: TypeError: __new__() takes exactly 2 arguments (1 given) My syntax may be off here, but this was the first seemingly sensible usage with a namedtuple that I could come up with. Looking at the original blog posts, this looks like some kind of unexpected interaction between the serialization code and the annotation wrapper? |
…r when encountering un unrecognized encoding data label. Also added support for complex datatype.
Thanks for the excellent comments, @industrial-sloth !! I just checked in two more code changes. One adds support for Adding support for |
Hey @broxtronix - ya, sure thing. Your fix for Any thoughts on my from collections import namedtuple
from thunder.utils.decorators import serializable
FooTuple = serializable(namedtuple('FooTuple', 'bar'))
foo = FooTuple(bar="baz")
foo.serialize() # now returns {} Again, I'm not 100% certain whether this particular syntax would in fact be expected to do what I want it to do (give me back a JSON-serializable namedtuple), but it seems like a reasonable attempt at least. :) |
… both. Added unit test for namedtuple.
Ok, @industrial-sloth, I did find a bug in namedtuple and orderedict. Those were something of a separate issue, though. It turns out that the reason your code example is not working is a more subtle issue... the namedtuple appears to have both slots and a dict for some reason, and therefore also seems to have a new constructor that takes two arguments instead of one. I'm not sure exactly what is going on here, but clearly this is a more complicated Python object than the norm. I suspect we could probably figure out how to handle this case, but for now I have added a comment to the docstring and an exception if you try to wrap such a class. And, although you cannot wrap a namedtuple directly, you can still use a namedtuple inside of an @serializable class like this:
So, although @serializable is not fully general (although it could be made more so with time), but it does work in the current intended use case where we have a object containing a bunch of different types of state information. Hopefully the error checking and documentation are enough to steer people in the right usage direction! |
I agree that we should only serialize things that are a particular type, and not attempt to serialize subclasses of serializable types. I think |
…(foo) = bar to get more specificity when identifying specific classes rather than accidentally pickling a sub-class of a serializable base class.
You are totally right, @poolio. I had it backwards. Using |
Serialize this object to a python dictionary that can easily be converted | ||
to/from JSON using Python's standard JSON library. | ||
|
||
Arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been using a slightly different format for arguments (it's the same format used by numpy
and scipy
documentation), should look like this:
Parameters
----------
numpyStorage : {'auto', 'ascii', 'base64' }, optional, default 'auto'
Use to select whether numpy arrays...
Returns
-------
The object encoded as a...
Will add this to the style guide!
and hasattr(obj, "_asdict") \ | ||
and callable(obj._asdict) | ||
|
||
def serializable(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two lines before def
@broxtronix thanks for putting together an awesome patch! And thanks @industrial-sloth and @poolio for the detailed review. I left a bunch of comments but I think all are about formatting and many are small nits (some of which are things that should be in the style guide but aren't, so glad to have caught those). I think the coverage is fine for this PR, agree with @industrial-sloth that there are some additional types we could add support for as we move forward. Lastly, regarding |
(BTW, a lot of the style nits just come from opening this up in PyCharm with PEP8 checking turned on, we generally stick to all those conventions with the exception of camelCasing) |
…ests out into test_decorators.py
Alrighty, I made the changes you suggested, @freeman-lab. I must say that this process of code review has definitely produced much clearer code, some nice new features, and we even caught a few critical bugs. Thanks to all for taking the time to give it a thorough look-through! Regarding your comment about "types" above, @freeman-lab, are you thinking we might want to change And thanks for your thoughts on the numpy style guidelines. I'll bring it up again if that ends up feeling too onerous, but I suspect I can get in the habit of eliminating the Onward and upward! |
Awesome, that's great to hear! Regarding the types, I initially liked the idea of the names being the exact proper name as defined in python, but then realized we'd be stuck on the |
Well, I went in and changed how the encoding format information is stored for numpy arrays. It was relatively easy to make this a part of the dictionary rather than part of the key. I think this is probably a bit more elegant this way. I then went to try to remove all of the "py/" prefixes to harmonize the encoding keys with the actual underlying python types. However, I got halfway through and then undid this change because of this little bit of the code starting at line 207
This is a bit of heuristic code to see if this particular portion of the serialized object was encoded using one of our special encoding methods, or if it was one of the more basic types that didn't require any encoding. It uses the "py/" prefix as a marker to make the determination about whether to proceed with any special decoding logic. How about this: with my latest commits, all of the names after the "py/" prefix match their python counterparts (now that we have done away with the 'ascii' and 'base64' suffixes). It's easy enough to strip away the "py/" prefix to get back the full classname if needed. I can think of a couple of other ways we could store (1) the fact that this object has special encoding, and (2) the original class that was encoded that don't use the "py/" prefix and instead place the class name in the encoded dictionary. However, I think I prefer the JSON that is produced using the current code to other things I can think of off the top of my head. Totally open to suggestions, though. Let me know what you think. (We could also change the prefix from "py/" to something else if you want!) |
Yeah the "py/" magic string prefix makes sense to me. The concern of course is whether this sort of string would happen to come up in some regular data, and then be misinterpreted as a deserialization data type specification, but "py/" seems sufficiently unusual that it'd be unlikely to happen often in practice. That being said, the lambda checking Now it may be that owing to the particular way in which keys and values get nested we could guarantee that at levels of the hierarchy where we might expect to see a "py/" datatype, all other values might be variable names, where "/" isn't a legal character, and this whole point about the possibility of accidental collisions might be moot. But that's not immediately obvious to me one way or the other. Finally (I promise!), might be worth considering at some point in the future, not in this PR, making the "py/" prefix configurable or set to a constant or something, so that if we do end up having worries about "py/" showing up in data, we could change this magic string to "&^RKL&^:KG^R" or whatever and lower our collision odd still further. :) |
Thanks @broxtronix ! I agree with you and @industrial-sloth , the |
Parameters | ||
---------- | ||
|
||
numpy-storage: {'auto', 'ascii', 'base64' }, optional, default 'auto' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/numpy-storage/numpyStorage
Tiny typo, but otherwise LGTM. Excited to put this to use! |
Your suggestion to use And regarding a configurable prefix, that is definitely something we could add in a short ways down the road, along with serialization support for a few new object types. I have a suspicion that some other feature requests will pop up once this decorator starts getting some use in the code base. There is a pretty wide variety of "data model" objects we might want to be able to serialize into JSON, so definitely keep the discussion going as you all discover ways in which this object is and is not working out for you. I think that addresses everything, but let me know if you guys want any final tweaks. Thanks again for all the superb feedback!! |
Awesome! LGTM, merging it in. Great patch @broxtronix! And thanks all for the review. |
Note: This pull request came out of a face-to-face discussion between @freeman-lab , @poolio , @logang, and @broxtronix.
This pull request introduces a new @serializable decorator that can decorate any class to make it easy to store that class in a human readable JSON format and then recall it and recover the original object instance. Classes instances that are wrapped in this decorator gain the serialize() method, and the class also gains a deserialize() static method that can automatically "pickle" and "unpickle" a wide variety of objects like so:
Note that this decorator is NOT designed to provide generalized pickling capabilities. Rather, it is designed to make it very easy to convert small classes containing model properties to a human and machine parsable format for later analysis or visualization. A few classes under consideration for such decorating include the Transformation class for image alignment and the Source classes for source extraction.
A key feature of the @serializable decorator is that it can "pickle" data types that are not normally supported by Python's stock JSON dump() and load() methods. Supported datatypes include: list, set, tuple, namedtuple, OrderedDict, datetime objects, numpy ndarrays, and dicts with non-string (but still data) keys. Serialization is performed recursively, and descends into the standard python container types (list, dict, tuple, set).