Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serializable #96

Merged
merged 15 commits into from Feb 6, 2015
Merged

Serializable #96

merged 15 commits into from Feb 6, 2015

Conversation

broxtronix
Copy link
Contributor

Note: This pull request came out of a face-to-face discussion between @freeman-lab , @poolio , @logang, and @broxtronix.

This pull request introduces a new @serializable decorator that can decorate any class to make it easy to store that class in a human readable JSON format and then recall it and recover the original object instance. Classes instances that are wrapped in this decorator gain the serialize() method, and the class also gains a deserialize() static method that can automatically "pickle" and "unpickle" a wide variety of objects like so:

@serializable
class Visitor():
def __init__(self, ip_addr = None, agent = None, referrer = None):
    self.ip = ip_addr
    self.ua = agent
    self.referrer= referrer
    self.time = datetime.datetime.now()

orig_visitor = Visitor('192.168', 'UA-1', 'http://www.google.com')

#serialize the object
pickled_visitor = orig_visitor.serialize()

#restore object
recov_visitor = Visitor.deserialize(pickled_visitor)

Note that this decorator is NOT designed to provide generalized pickling capabilities. Rather, it is designed to make it very easy to convert small classes containing model properties to a human and machine parsable format for later analysis or visualization. A few classes under consideration for such decorating include the Transformation class for image alignment and the Source classes for source extraction.

A key feature of the @serializable decorator is that it can "pickle" data types that are not normally supported by Python's stock JSON dump() and load() methods. Supported datatypes include: list, set, tuple, namedtuple, OrderedDict, datetime objects, numpy ndarrays, and dicts with non-string (but still data) keys. Serialization is performed recursively, and descends into the standard python container types (list, dict, tuple, set).


numpy-storage: choose one of ['auto', 'ascii', 'base64'] (default: auto)

Use the 'nmupy_storage' argument to select whether numpy arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/nmupy/numpy

@poolio
Copy link
Contributor

poolio commented Feb 3, 2015

This looks great! Will definitely make working with transforms and sources much easier. Some minor comments:

  • This code violates a lot of the style guide, like the lack of camelCase and use of np
  • I wonder whether it would be better to use type(obj) == foo instead of isinstance? If someone passes in a subclass of say tuple or np.array then it would still try to pickle it.
  • Can you add tests for the case when you try to serialize an object with an unsupported datatype?

@industrial-sloth
Copy link
Contributor

Agreed, this looks totally sweet. I mainly wanted to just say thanks for adding the motivation, usage info, and links out to the relevant blog posts in the PR and code comments - in about five minutes these took me from "huh?" to "oh!".

I'll get on this branch and see whether I can find any loose ends, but overall this is looking like a great addition.

…yle guidelines and added an additional unit test.
@broxtronix
Copy link
Contributor Author

Alrighty, I addressed your comments for the most part, @poolio. The code now conforms to the style guide and I added the additional unit test.

Regarding the type(obj)==foo vs isinstance, after thinking about it I think that instance is the right thing here. I f users wer to subclass tuple or np.array, then we may very well need to add special serialization logic for that new subtype. I think it's actually a good thing that we are checking specifically to make sure that we are serializing an object instance of a particular type, rather than serializing anything that has a particular class somewhere as a base class. Does that seem reasonable to you?

One other thing that came up as I was making these changes is a question about the style guide regarding numpy imports. It recommends always using from numpy import foo, bar, baz rather than import numpy or import numpy as np to access those symbols directly via the namespace. I imagine that the thinking behind this guideline was to reduce the proliferation of np.<whatever> prefixes throughout the codebase. However, the numpy namespace in particularly full of short, rather generic symbols names that sometimes conflict with built-in python symbols like sum, bool, or abs so there are times when referring to the symbol via its namespace is an important disambiguation. Even for symbols that don't collide with Python internal object names, I find it a little more readable/clear to specify np.uint8 rather than uint8 since there are many potential software libraries that could be offering up such a symbol, and it's nice to know at a glance which uint8 you are using!

On a more practical note, for mathematically intensive code that uses lots and lots of numpy symbols, I find it a little inconvenient to hop back and forth between my code and the import statement as I use new symbols. I guess this last concern can be mitigated by importing the namespace during development, and then "cleaning up" the code (as I have done here) before checking it in. So maybe that's a fine approach.

Anyway, I guess I would propose that the guideline about import numpy be relaxed a tiny bit, since explicitly specifying the namespace can be useful (and even necessary) in some cases. But no big deal. Just thought I'd bring it up for a quick discussion! I'm curious what you guys all think.

@industrial-sloth
Copy link
Contributor

Alright, so a couple comments / questions:

  • one limitation right now is that this doesn't handle objects defined with __slots__, just the regular __dict__ attributes:
@serializable
class Foo(object):
    __slots__ = ['bar']
foo = Foo()
foo.bar = 'a'
foo.serialize()  # boom

I think this could be addressed pretty simply on the serialization side by first checking whether hasattr(self.wrapped, "__slots__") around line 155, and recursively_serializing slots instead of dict if so. On the deserialization side, one might be able to do something similar checking for slots on the cls variable? Am less sure of how the deserialization end would look...

Anyway, I think it would be reasonable to not support __slots__ right out of the gate on this, but in that case it'd be nice to have a comment to that effect somewhere in the docstring.

So, in short, I'd like to either support serializing / deserializing objects with slots, or else to say up front that it isn't supported (yet). :)

  • Various data types that aren't supported:
  • frozenset. This is probably my top ask, since supporting it should be at least as easy as the current support for regular sets, and it's a little weird to have an immutable, hashable type that can't be serialized while its mutable cousin is completely serializable. Also I do actually use frozensets reasonably often.
  • slice. Understood that this isn't the major intended use case, but the new Blocks type uses slices all over the place, and if we'd ever want to be able to serialize those objects using this framework slice support would have to happen. Not clear that this has to be added right now though. Again, a comment saying that we can't support serializing slices is probably fine.
  • complex. Now I'm just getting pedantic. :) Notably, serializing complex-valued numpy arrays seems to work just fine. If this is any more difficult than adding complex as a supported type on line 117, it's probably not worth it, and a comment to that effect would suffice.

Overall again this is looking really nice!

@industrial-sloth
Copy link
Contributor

Also there appears to be some weirdness with the namedtuple support:

FooTuple = serializable(namedtuple('FooTuple', 'bar'))
foo = FooTuple(bar="baz")
foo.serialize() # returns: {'py/collections.OrderedDict': [['bar', 'baz']]} ... OrderedDict?
foo2 = FooTuple.deserialize(foo.serialize())  # boom: TypeError: __new__() takes exactly 2 arguments (1 given)

My syntax may be off here, but this was the first seemingly sensible usage with a namedtuple that I could come up with. Looking at the original blog posts, this looks like some kind of unexpected interaction between the serialization code and the annotation wrapper?

@broxtronix
Copy link
Contributor Author

Thanks for the excellent comments, @industrial-sloth !! I just checked in two more code changes. One adds support for complex (easy enough!), and the other one adds support for classes that use __slots__ instead of a __dict__. This seems like a very important use case, since I imagine we will often be serializing numerous small objects (e.g. one Transformation per image) where we will want to capitalize on the space-savings offered by __slots__.

Adding support for frozenset and slice should also be straight-forward, but I have not done so yet. Would it be ok to add those a little down the road, once those use cases arise? (Sounds like they may arise soon, but it should be easy to add those capabilities when they do!)

@industrial-sloth
Copy link
Contributor

Hey @broxtronix - ya, sure thing. Your fix for __slots__ certainly appears sensible - and the new test you added is passing, so that's nice. :) I have no immediate need to convert frozensets or slices to JSON, so I'm fine with leaving that out for now. I may go ahead and add support for those two myself once this PR goes in.

Any thoughts on my namedtuple issue? Worth noting that with these most recent changes, I'm now seeing different behavior on my previous test:

from collections import namedtuple
from thunder.utils.decorators import serializable
FooTuple = serializable(namedtuple('FooTuple', 'bar'))
foo = FooTuple(bar="baz")
foo.serialize()  # now returns {}

Again, I'm not 100% certain whether this particular syntax would in fact be expected to do what I want it to do (give me back a JSON-serializable namedtuple), but it seems like a reasonable attempt at least. :)

@broxtronix
Copy link
Contributor Author

Ok, @industrial-sloth, I did find a bug in namedtuple and orderedict. Those were something of a separate issue, though.

It turns out that the reason your code example is not working is a more subtle issue... the namedtuple appears to have both slots and a dict for some reason, and therefore also seems to have a new constructor that takes two arguments instead of one. I'm not sure exactly what is going on here, but clearly this is a more complicated Python object than the norm.

I suspect we could probably figure out how to handle this case, but for now I have added a comment to the docstring and an exception if you try to wrap such a class. And, although you cannot wrap a namedtuple directly, you can still use a namedtuple inside of an @serializable class like this:

from collections import namedtuple

@serializable
class Foo(object):
    def __init__(self):
        self.nt = namedtuple('FooTuple', 'bar')

foo = Foo()
foo.nt.bar="baz"

testJson = foo.serialize()
foo2 = Foo.deserialize(testJson)
assert(foo.nt.bar == foo2.nt.bar)

So, although @serializable is not fully general (although it could be made more so with time), but it does work in the current intended use case where we have a object containing a bunch of different types of state information. Hopefully the error checking and documentation are enough to steer people in the right usage direction!

@poolio
Copy link
Contributor

poolio commented Feb 4, 2015

I agree that we should only serialize things that are a particular type, and not attempt to serialize subclasses of serializable types. I think type(obj) == Foo is only true if obj is an instance of Foo, and is false if obj is an instance of a subclass of Foo. Whereas isinstance will be true for all subclasses. That's normally the desired behavior, but here I think we really want type instead of isinstance.

…(foo) = bar to get more specificity when identifying specific classes rather than accidentally pickling a sub-class of a serializable base class.
@broxtronix
Copy link
Contributor Author

You are totally right, @poolio. I had it backwards. Using type(foo) == bar is the more specific way to ensure you are dealing with the subclass, and not the superclass. I switched it over!

Serialize this object to a python dictionary that can easily be converted
to/from JSON using Python's standard JSON library.

Arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been using a slightly different format for arguments (it's the same format used by numpy and scipy documentation), should look like this:

Parameters
----------
numpyStorage : {'auto', 'ascii', 'base64' }, optional, default 'auto'
    Use to select whether numpy arrays...

Returns
-------
The object encoded as a...

Will add this to the style guide!

and hasattr(obj, "_asdict") \
and callable(obj._asdict)

def serializable(cls):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two lines before def

@freeman-lab
Copy link
Member

@broxtronix thanks for putting together an awesome patch! And thanks @industrial-sloth and @poolio for the detailed review. I left a bunch of comments but I think all are about formatting and many are small nits (some of which are things that should be in the style guide but aren't, so glad to have caught those).

I think the coverage is fine for this PR, agree with @industrial-sloth that there are some additional types we could add support for as we move forward.

Lastly, regarding np vs numpy, I'm with you on the potential inconvenience during development, the reason is exactly what you said: making the codebase itself more readable. I think many of us use the np. format while developing and then clean up, which is fine, if not ideal. I'm less worried about namespace issues because we rely quite a lot on local namespacing anyway, but it's a fair point, and can certainly keep the discussion open as we move forward.

@freeman-lab
Copy link
Member

(BTW, a lot of the style nits just come from opening this up in PyCharm with PEP8 checking turned on, we generally stick to all those conventions with the exception of camelCasing)

@broxtronix
Copy link
Contributor Author

Alrighty, I made the changes you suggested, @freeman-lab. I must say that this process of code review has definitely produced much clearer code, some nice new features, and we even caught a few critical bugs. Thanks to all for taking the time to give it a thorough look-through!

Regarding your comment about "types" above, @freeman-lab, are you thinking we might want to change py/collections.namedtuple to collections.namedtuple? I was thinking about that myself. I could go either way, but I figured it wasn't probably too big a deal and the additional specificity might be nice to help to make it clear that this json data originally came from Python. With the longer class names it is clear in either case, but it felt a little funny to use shorter names like dict, tuple, set, etc. But totally happy to change that if you guys thing it would be better the other way around.

And thanks for your thoughts on the numpy style guidelines. I'll bring it up again if that ends up feeling too onerous, but I suspect I can get in the habit of eliminating the np.* detritus my the code. :)

Onward and upward!

@freeman-lab
Copy link
Member

Awesome, that's great to hear! Regarding the types, I initially liked the idea of the names being the exact proper name as defined in python, but then realized we'd be stuck on the ascii versus float64 case, because those are not actual python types. If there was some alternate way to embed that choice for that particular case, then we'd be using proper class names for everything, but if that's too tricky, then I say we keep it as you have it.

@broxtronix
Copy link
Contributor Author

Well, I went in and changed how the encoding format information is stored for numpy arrays. It was relatively easy to make this a part of the dictionary rather than part of the key. I think this is probably a bit more elegant this way.

I then went to try to remove all of the "py/" prefixes to harmonize the encoding keys with the actual underlying python types. However, I got halfway through and then undid this change because of this little bit of the code starting at line 207

                # First, check to see if this is an encoded entry
                dataKey = None
                if type(dct) == dict:
                    filteredKeys = filter(lambda x: "py/" in x, dct.keys())

                    # If there is just one key with a "py/" prefix, that is the dataKey!
                    if len(filteredKeys) == 1:
                        dataKey = filteredKeys[0]

                # If no data key is found, we assume the data needs no further decoding.
                if dataKey is None:
                    return dct

This is a bit of heuristic code to see if this particular portion of the serialized object was encoded using one of our special encoding methods, or if it was one of the more basic types that didn't require any encoding. It uses the "py/" prefix as a marker to make the determination about whether to proceed with any special decoding logic.

How about this: with my latest commits, all of the names after the "py/" prefix match their python counterparts (now that we have done away with the 'ascii' and 'base64' suffixes). It's easy enough to strip away the "py/" prefix to get back the full classname if needed. I can think of a couple of other ways we could store (1) the fact that this object has special encoding, and (2) the original class that was encoded that don't use the "py/" prefix and instead place the class name in the encoded dictionary. However, I think I prefer the JSON that is produced using the current code to other things I can think of off the top of my head. Totally open to suggestions, though. Let me know what you think.

(We could also change the prefix from "py/" to something else if you want!)

@industrial-sloth
Copy link
Contributor

Yeah the "py/" magic string prefix makes sense to me. The concern of course is whether this sort of string would happen to come up in some regular data, and then be misinterpreted as a deserialization data type specification, but "py/" seems sufficiently unusual that it'd be unlikely to happen often in practice.

That being said, the lambda checking "py/" in x (line 210) should probably be x.startswith("py/"), reducing the likelihood that we'd have an accidental collision with a string that happened to have "py/" somewhere inside it.

Now it may be that owing to the particular way in which keys and values get nested we could guarantee that at levels of the hierarchy where we might expect to see a "py/" datatype, all other values might be variable names, where "/" isn't a legal character, and this whole point about the possibility of accidental collisions might be moot. But that's not immediately obvious to me one way or the other.

Finally (I promise!), might be worth considering at some point in the future, not in this PR, making the "py/" prefix configurable or set to a constant or something, so that if we do end up having worries about "py/" showing up in data, we could change this magic string to "&^RKL&^:KG^R" or whatever and lower our collision odd still further. :)

@freeman-lab
Copy link
Member

Thanks @broxtronix ! I agree with you and @industrial-sloth , the py prefix is useful and reasonable, and it's great that now everything coming after it is a proper python class, that was my main concern with the earlier version. With these last two commits everything LGTM and I'll merge it in unless anyone else has further requests / suggestions.

Parameters
----------

numpy-storage: {'auto', 'ascii', 'base64' }, optional, default 'auto'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/numpy-storage/numpyStorage

@poolio
Copy link
Contributor

poolio commented Feb 5, 2015

Tiny typo, but otherwise LGTM. Excited to put this to use!

@broxtronix
Copy link
Contributor Author

Your suggestion to use startswith() is a great idea, @industrial-sloth. I made that change and fixed Ben's typo.

And regarding a configurable prefix, that is definitely something we could add in a short ways down the road, along with serialization support for a few new object types. I have a suspicion that some other feature requests will pop up once this decorator starts getting some use in the code base. There is a pretty wide variety of "data model" objects we might want to be able to serialize into JSON, so definitely keep the discussion going as you all discover ways in which this object is and is not working out for you.

I think that addresses everything, but let me know if you guys want any final tweaks. Thanks again for all the superb feedback!!

@freeman-lab
Copy link
Member

Awesome! LGTM, merging it in. Great patch @broxtronix! And thanks all for the review.

freeman-lab added a commit that referenced this pull request Feb 6, 2015
@freeman-lab freeman-lab merged commit 94c3978 into thunder-project:master Feb 6, 2015
@broxtronix broxtronix deleted the serializable branch April 6, 2015 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants