Serializable #96

broxtronix · 2015-02-03T04:46:41Z

Note: This pull request came out of a face-to-face discussion between @freeman-lab , @poolio , @logang, and @broxtronix.

This pull request introduces a new @serializable decorator that can decorate any class to make it easy to store that class in a human readable JSON format and then recall it and recover the original object instance. Classes instances that are wrapped in this decorator gain the serialize() method, and the class also gains a deserialize() static method that can automatically "pickle" and "unpickle" a wide variety of objects like so:

@serializable
class Visitor():
def __init__(self, ip_addr = None, agent = None, referrer = None):
    self.ip = ip_addr
    self.ua = agent
    self.referrer= referrer
    self.time = datetime.datetime.now()

orig_visitor = Visitor('192.168', 'UA-1', 'http://www.google.com')

#serialize the object
pickled_visitor = orig_visitor.serialize()

#restore object
recov_visitor = Visitor.deserialize(pickled_visitor)

Note that this decorator is NOT designed to provide generalized pickling capabilities. Rather, it is designed to make it very easy to convert small classes containing model properties to a human and machine parsable format for later analysis or visualization. A few classes under consideration for such decorating include the Transformation class for image alignment and the Source classes for source extraction.

A key feature of the @serializable decorator is that it can "pickle" data types that are not normally supported by Python's stock JSON dump() and load() methods. Supported datatypes include: list, set, tuple, namedtuple, OrderedDict, datetime objects, numpy ndarrays, and dicts with non-string (but still data) keys. Serialization is performed recursively, and descends into the standard python container types (list, dict, tuple, set).

poolio · 2015-02-03T05:47:23Z

python/thunder/utils/decorators.py

+
+              numpy-storage: choose one of ['auto', 'ascii', 'base64'] (default: auto)
+
+              Use the 'nmupy_storage' argument to select whether numpy arrays


s/nmupy/numpy

poolio · 2015-02-03T06:05:21Z

This looks great! Will definitely make working with transforms and sources much easier. Some minor comments:

This code violates a lot of the style guide, like the lack of camelCase and use of np
I wonder whether it would be better to use type(obj) == foo instead of isinstance? If someone passes in a subclass of say tuple or np.array then it would still try to pickle it.
Can you add tests for the case when you try to serialize an object with an unsupported datatype?

industrial-sloth · 2015-02-03T18:31:53Z

Agreed, this looks totally sweet. I mainly wanted to just say thanks for adding the motivation, usage info, and links out to the relevant blog posts in the PR and code comments - in about five minutes these took me from "huh?" to "oh!".

I'll get on this branch and see whether I can find any loose ends, but overall this is looking like a great addition.

…yle guidelines and added an additional unit test.

broxtronix · 2015-02-03T19:16:55Z

Alrighty, I addressed your comments for the most part, @poolio. The code now conforms to the style guide and I added the additional unit test.

Regarding the type(obj)==foo vs isinstance, after thinking about it I think that instance is the right thing here. I f users wer to subclass tuple or np.array, then we may very well need to add special serialization logic for that new subtype. I think it's actually a good thing that we are checking specifically to make sure that we are serializing an object instance of a particular type, rather than serializing anything that has a particular class somewhere as a base class. Does that seem reasonable to you?

One other thing that came up as I was making these changes is a question about the style guide regarding numpy imports. It recommends always using from numpy import foo, bar, baz rather than import numpy or import numpy as np to access those symbols directly via the namespace. I imagine that the thinking behind this guideline was to reduce the proliferation of np.<whatever> prefixes throughout the codebase. However, the numpy namespace in particularly full of short, rather generic symbols names that sometimes conflict with built-in python symbols like sum, bool, or abs so there are times when referring to the symbol via its namespace is an important disambiguation. Even for symbols that don't collide with Python internal object names, I find it a little more readable/clear to specify np.uint8 rather than uint8 since there are many potential software libraries that could be offering up such a symbol, and it's nice to know at a glance which uint8 you are using!

On a more practical note, for mathematically intensive code that uses lots and lots of numpy symbols, I find it a little inconvenient to hop back and forth between my code and the import statement as I use new symbols. I guess this last concern can be mitigated by importing the namespace during development, and then "cleaning up" the code (as I have done here) before checking it in. So maybe that's a fine approach.

Anyway, I guess I would propose that the guideline about import numpy be relaxed a tiny bit, since explicitly specifying the namespace can be useful (and even necessary) in some cases. But no big deal. Just thought I'd bring it up for a quick discussion! I'm curious what you guys all think.

industrial-sloth · 2015-02-03T19:28:11Z

Alright, so a couple comments / questions:

one limitation right now is that this doesn't handle objects defined with __slots__, just the regular __dict__ attributes:

@serializable
class Foo(object):
    __slots__ = ['bar']
foo = Foo()
foo.bar = 'a'
foo.serialize()  # boom

I think this could be addressed pretty simply on the serialization side by first checking whether hasattr(self.wrapped, "__slots__") around line 155, and recursively_serializing slots instead of dict if so. On the deserialization side, one might be able to do something similar checking for slots on the cls variable? Am less sure of how the deserialization end would look...

Anyway, I think it would be reasonable to not support __slots__ right out of the gate on this, but in that case it'd be nice to have a comment to that effect somewhere in the docstring.

So, in short, I'd like to either support serializing / deserializing objects with slots, or else to say up front that it isn't supported (yet). :)

Various data types that aren't supported:
frozenset. This is probably my top ask, since supporting it should be at least as easy as the current support for regular sets, and it's a little weird to have an immutable, hashable type that can't be serialized while its mutable cousin is completely serializable. Also I do actually use frozensets reasonably often.
slice. Understood that this isn't the major intended use case, but the new Blocks type uses slices all over the place, and if we'd ever want to be able to serialize those objects using this framework slice support would have to happen. Not clear that this has to be added right now though. Again, a comment saying that we can't support serializing slices is probably fine.
complex. Now I'm just getting pedantic. :) Notably, serializing complex-valued numpy arrays seems to work just fine. If this is any more difficult than adding complex as a supported type on line 117, it's probably not worth it, and a comment to that effect would suffice.

Overall again this is looking really nice!

industrial-sloth · 2015-02-03T19:48:41Z

Also there appears to be some weirdness with the namedtuple support:

FooTuple = serializable(namedtuple('FooTuple', 'bar'))
foo = FooTuple(bar="baz")
foo.serialize() # returns: {'py/collections.OrderedDict': [['bar', 'baz']]} ... OrderedDict?
foo2 = FooTuple.deserialize(foo.serialize())  # boom: TypeError: __new__() takes exactly 2 arguments (1 given)

My syntax may be off here, but this was the first seemingly sensible usage with a namedtuple that I could come up with. Looking at the original blog posts, this looks like some kind of unexpected interaction between the serialization code and the annotation wrapper?

…r when encountering un unrecognized encoding data label. Also added support for complex datatype.

broxtronix · 2015-02-03T20:55:32Z

Thanks for the excellent comments, @industrial-sloth !! I just checked in two more code changes. One adds support for complex (easy enough!), and the other one adds support for classes that use __slots__ instead of a __dict__. This seems like a very important use case, since I imagine we will often be serializing numerous small objects (e.g. one Transformation per image) where we will want to capitalize on the space-savings offered by __slots__.

Adding support for frozenset and slice should also be straight-forward, but I have not done so yet. Would it be ok to add those a little down the road, once those use cases arise? (Sounds like they may arise soon, but it should be easy to add those capabilities when they do!)

industrial-sloth · 2015-02-03T21:22:27Z

Hey @broxtronix - ya, sure thing. Your fix for __slots__ certainly appears sensible - and the new test you added is passing, so that's nice. :) I have no immediate need to convert frozensets or slices to JSON, so I'm fine with leaving that out for now. I may go ahead and add support for those two myself once this PR goes in.

Any thoughts on my namedtuple issue? Worth noting that with these most recent changes, I'm now seeing different behavior on my previous test:

from collections import namedtuple
from thunder.utils.decorators import serializable
FooTuple = serializable(namedtuple('FooTuple', 'bar'))
foo = FooTuple(bar="baz")
foo.serialize()  # now returns {}

Again, I'm not 100% certain whether this particular syntax would in fact be expected to do what I want it to do (give me back a JSON-serializable namedtuple), but it seems like a reasonable attempt at least. :)

… both. Added unit test for namedtuple.

broxtronix · 2015-02-03T21:41:33Z

Ok, @industrial-sloth, I did find a bug in namedtuple and orderedict. Those were something of a separate issue, though.

It turns out that the reason your code example is not working is a more subtle issue... the namedtuple appears to have both slots and a dict for some reason, and therefore also seems to have a new constructor that takes two arguments instead of one. I'm not sure exactly what is going on here, but clearly this is a more complicated Python object than the norm.

I suspect we could probably figure out how to handle this case, but for now I have added a comment to the docstring and an exception if you try to wrap such a class. And, although you cannot wrap a namedtuple directly, you can still use a namedtuple inside of an @serializable class like this:

from collections import namedtuple

@serializable
class Foo(object):
    def __init__(self):
        self.nt = namedtuple('FooTuple', 'bar')

foo = Foo()
foo.nt.bar="baz"

testJson = foo.serialize()
foo2 = Foo.deserialize(testJson)
assert(foo.nt.bar == foo2.nt.bar)

So, although @serializable is not fully general (although it could be made more so with time), but it does work in the current intended use case where we have a object containing a bunch of different types of state information. Hopefully the error checking and documentation are enough to steer people in the right usage direction!

poolio · 2015-02-04T02:01:05Z

I agree that we should only serialize things that are a particular type, and not attempt to serialize subclasses of serializable types. I think type(obj) == Foo is only true if obj is an instance of Foo, and is false if obj is an instance of a subclass of Foo. Whereas isinstance will be true for all subclasses. That's normally the desired behavior, but here I think we really want type instead of isinstance.

…(foo) = bar to get more specificity when identifying specific classes rather than accidentally pickling a sub-class of a serializable base class.

broxtronix · 2015-02-04T02:59:22Z

You are totally right, @poolio. I had it backwards. Using type(foo) == bar is the more specific way to ensure you are dealing with the subclass, and not the superclass. I switched it over!

freeman-lab · 2015-02-04T05:53:08Z

python/thunder/utils/decorators.py

+            Serialize this object to a python dictionary that can easily be converted
+            to/from JSON using Python's standard JSON library.
+
+            Arguments


We've been using a slightly different format for arguments (it's the same format used by numpy and scipy documentation), should look like this:

Parameters ---------- numpyStorage : {'auto', 'ascii', 'base64' }, optional, default 'auto' Use to select whether numpy arrays... Returns ------- The object encoded as a...

Will add this to the style guide!

freeman-lab · 2015-02-04T06:32:17Z

python/thunder/utils/decorators.py

+     and hasattr(obj, "_asdict") \
+     and callable(obj._asdict)
+
+def serializable(cls):


Two lines before def

freeman-lab · 2015-02-04T07:05:55Z

@broxtronix thanks for putting together an awesome patch! And thanks @industrial-sloth and @poolio for the detailed review. I left a bunch of comments but I think all are about formatting and many are small nits (some of which are things that should be in the style guide but aren't, so glad to have caught those).

I think the coverage is fine for this PR, agree with @industrial-sloth that there are some additional types we could add support for as we move forward.

Lastly, regarding np vs numpy, I'm with you on the potential inconvenience during development, the reason is exactly what you said: making the codebase itself more readable. I think many of us use the np. format while developing and then clean up, which is fine, if not ideal. I'm less worried about namespace issues because we rely quite a lot on local namespacing anyway, but it's a fair point, and can certainly keep the discussion open as we move forward.

freeman-lab · 2015-02-04T07:07:13Z

(BTW, a lot of the style nits just come from opening this up in PyCharm with PEP8 checking turned on, we generally stick to all those conventions with the exception of camelCasing)

…ests out into test_decorators.py

broxtronix · 2015-02-04T22:31:44Z

Alrighty, I made the changes you suggested, @freeman-lab. I must say that this process of code review has definitely produced much clearer code, some nice new features, and we even caught a few critical bugs. Thanks to all for taking the time to give it a thorough look-through!

Regarding your comment about "types" above, @freeman-lab, are you thinking we might want to change py/collections.namedtuple to collections.namedtuple? I was thinking about that myself. I could go either way, but I figured it wasn't probably too big a deal and the additional specificity might be nice to help to make it clear that this json data originally came from Python. With the longer class names it is clear in either case, but it felt a little funny to use shorter names like dict, tuple, set, etc. But totally happy to change that if you guys thing it would be better the other way around.

And thanks for your thoughts on the numpy style guidelines. I'll bring it up again if that ends up feeling too onerous, but I suspect I can get in the habit of eliminating the np.* detritus my the code. :)

Onward and upward!

freeman-lab · 2015-02-05T05:21:59Z

Awesome, that's great to hear! Regarding the types, I initially liked the idea of the names being the exact proper name as defined in python, but then realized we'd be stuck on the ascii versus float64 case, because those are not actual python types. If there was some alternate way to embed that choice for that particular case, then we'd be using proper class names for everything, but if that's too tricky, then I say we keep it as you have it.

broxtronix · 2015-02-05T18:13:21Z

Well, I went in and changed how the encoding format information is stored for numpy arrays. It was relatively easy to make this a part of the dictionary rather than part of the key. I think this is probably a bit more elegant this way.

I then went to try to remove all of the "py/" prefixes to harmonize the encoding keys with the actual underlying python types. However, I got halfway through and then undid this change because of this little bit of the code starting at line 207

                # First, check to see if this is an encoded entry
                dataKey = None
                if type(dct) == dict:
                    filteredKeys = filter(lambda x: "py/" in x, dct.keys())

                    # If there is just one key with a "py/" prefix, that is the dataKey!
                    if len(filteredKeys) == 1:
                        dataKey = filteredKeys[0]

                # If no data key is found, we assume the data needs no further decoding.
                if dataKey is None:
                    return dct

This is a bit of heuristic code to see if this particular portion of the serialized object was encoded using one of our special encoding methods, or if it was one of the more basic types that didn't require any encoding. It uses the "py/" prefix as a marker to make the determination about whether to proceed with any special decoding logic.

How about this: with my latest commits, all of the names after the "py/" prefix match their python counterparts (now that we have done away with the 'ascii' and 'base64' suffixes). It's easy enough to strip away the "py/" prefix to get back the full classname if needed. I can think of a couple of other ways we could store (1) the fact that this object has special encoding, and (2) the original class that was encoded that don't use the "py/" prefix and instead place the class name in the encoded dictionary. However, I think I prefer the JSON that is produced using the current code to other things I can think of off the top of my head. Totally open to suggestions, though. Let me know what you think.

(We could also change the prefix from "py/" to something else if you want!)

industrial-sloth · 2015-02-05T19:30:22Z

Yeah the "py/" magic string prefix makes sense to me. The concern of course is whether this sort of string would happen to come up in some regular data, and then be misinterpreted as a deserialization data type specification, but "py/" seems sufficiently unusual that it'd be unlikely to happen often in practice.

That being said, the lambda checking "py/" in x (line 210) should probably be x.startswith("py/"), reducing the likelihood that we'd have an accidental collision with a string that happened to have "py/" somewhere inside it.

Now it may be that owing to the particular way in which keys and values get nested we could guarantee that at levels of the hierarchy where we might expect to see a "py/" datatype, all other values might be variable names, where "/" isn't a legal character, and this whole point about the possibility of accidental collisions might be moot. But that's not immediately obvious to me one way or the other.

Finally (I promise!), might be worth considering at some point in the future, not in this PR, making the "py/" prefix configurable or set to a constant or something, so that if we do end up having worries about "py/" showing up in data, we could change this magic string to "&^RKL&^:KG^R" or whatever and lower our collision odd still further. :)

freeman-lab · 2015-02-05T20:05:40Z

Thanks @broxtronix ! I agree with you and @industrial-sloth , the py prefix is useful and reasonable, and it's great that now everything coming after it is a proper python class, that was my main concern with the earlier version. With these last two commits everything LGTM and I'll merge it in unless anyone else has further requests / suggestions.

poolio · 2015-02-05T20:09:22Z

python/thunder/utils/decorators.py

+            Parameters
+            ----------
+
+              numpy-storage: {'auto', 'ascii', 'base64' }, optional, default 'auto'


s/numpy-storage/numpyStorage

poolio · 2015-02-05T20:14:07Z

Tiny typo, but otherwise LGTM. Excited to put this to use!

… fixed a typo.

broxtronix · 2015-02-05T21:15:05Z

Your suggestion to use startswith() is a great idea, @industrial-sloth. I made that change and fixed Ben's typo.

And regarding a configurable prefix, that is definitely something we could add in a short ways down the road, along with serialization support for a few new object types. I have a suspicion that some other feature requests will pop up once this decorator starts getting some use in the code base. There is a pretty wide variety of "data model" objects we might want to be able to serialize into JSON, so definitely keep the discussion going as you all discover ways in which this object is and is not working out for you.

I think that addresses everything, but let me know if you guys want any final tweaks. Thanks again for all the superb feedback!!

freeman-lab · 2015-02-06T00:03:17Z

Awesome! LGTM, merging it in. Great patch @broxtronix! And thanks all for the review.

Add Serializable decorator

broxtronix added 3 commits February 2, 2015 20:28

Added first draft of the @serializable decorator.

bee4a79

Added a few more delegation methods.

c8f7d08

Documentation update.

354286f

poolio reviewed Feb 3, 2015
View reviewed changes

Addressing code review comments from @poolio. Conformed to Thunder st…

c6613d1

…yle guidelines and added an additional unit test.

broxtronix added 2 commits February 3, 2015 12:17

Fixed a bug in deserialize that failed to detect and generate an erro…

a0d2719

…r when encountering un unrecognized encoding data label. Also added support for complex datatype.

Added support for serializing / deserializing classes that use __slots__

0e8ac6d

Fixed a bug in NamedTuple and OrderedDict.

fe9e7d5

broxtronix added 3 commits February 3, 2015 13:25

Added an important usage comment.

3b2a247

Added a helpful error message>

ffc0447

Oops, fixed bug introduced when checking for __dict__ in classes with…

3f9867d

… both. Added unit test for namedtuple.

At the suggestion of @poolio, switching from using isinstance to type…

3bd4fb6

…(foo) = bar to get more specificity when identifying specific classes rather than accidentally pickling a sub-class of a serializable base class.

freeman-lab reviewed Feb 4, 2015
View reviewed changes

Addressed pull request suggestions made by @freeman-lab. Moved unit t…

4a8967c

…ests out into test_decorators.py

broxtronix added 2 commits February 5, 2015 09:54

Slight change to how we store encoding information for numpy arrays.

fecb887

Changed datetime encoding to fully specify the package and object name.

7f4bf5a

poolio reviewed Feb 5, 2015
View reviewed changes

Make sure that py/ is a prefix, and not just part of the string. Also…

3882a7c

… fixed a typo.

freeman-lab added a commit that referenced this pull request Feb 6, 2015

Merge pull request #96 from broxtronix/serializable

94c3978

Add Serializable decorator

freeman-lab merged commit 94c3978 into thunder-project:master Feb 6, 2015

broxtronix deleted the serializable branch April 6, 2015 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serializable #96

Serializable #96

broxtronix commented Feb 3, 2015

poolio Feb 3, 2015

poolio commented Feb 3, 2015

industrial-sloth commented Feb 3, 2015

broxtronix commented Feb 3, 2015

industrial-sloth commented Feb 3, 2015

industrial-sloth commented Feb 3, 2015

broxtronix commented Feb 3, 2015

industrial-sloth commented Feb 3, 2015

broxtronix commented Feb 3, 2015

poolio commented Feb 4, 2015

broxtronix commented Feb 4, 2015

freeman-lab Feb 4, 2015

freeman-lab Feb 4, 2015

freeman-lab commented Feb 4, 2015

freeman-lab commented Feb 4, 2015

broxtronix commented Feb 4, 2015

freeman-lab commented Feb 5, 2015

broxtronix commented Feb 5, 2015

industrial-sloth commented Feb 5, 2015

freeman-lab commented Feb 5, 2015

poolio Feb 5, 2015

poolio commented Feb 5, 2015

broxtronix commented Feb 5, 2015

freeman-lab commented Feb 6, 2015


		numpy-storage: choose one of ['auto', 'ascii', 'base64'] (default: auto)

		Use the 'nmupy_storage' argument to select whether numpy arrays

Serializable #96

Serializable #96

Conversation

broxtronix commented Feb 3, 2015

poolio Feb 3, 2015

Choose a reason for hiding this comment

poolio commented Feb 3, 2015

industrial-sloth commented Feb 3, 2015

broxtronix commented Feb 3, 2015

industrial-sloth commented Feb 3, 2015

industrial-sloth commented Feb 3, 2015

broxtronix commented Feb 3, 2015

industrial-sloth commented Feb 3, 2015

broxtronix commented Feb 3, 2015

poolio commented Feb 4, 2015

broxtronix commented Feb 4, 2015

freeman-lab Feb 4, 2015

Choose a reason for hiding this comment

freeman-lab Feb 4, 2015

Choose a reason for hiding this comment

freeman-lab commented Feb 4, 2015

freeman-lab commented Feb 4, 2015

broxtronix commented Feb 4, 2015

freeman-lab commented Feb 5, 2015

broxtronix commented Feb 5, 2015

industrial-sloth commented Feb 5, 2015

freeman-lab commented Feb 5, 2015

poolio Feb 5, 2015

Choose a reason for hiding this comment

poolio commented Feb 5, 2015

broxtronix commented Feb 5, 2015

freeman-lab commented Feb 6, 2015