Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON serializable registration model #106

Merged

Conversation

industrial-sloth
Copy link
Contributor

This PR modifies the existing JSON serialization code quite heavily, with the end goal of having this be usable to serialize RegistrationModel objects from the imgprocessing image registration code.

This gets around a couple issues with the previous serialization code:

  • RegistrationModels have nested within them Displacement objects. However the previous serialization code didn't handle custom classes. The current code can handle nested custom classes, so long as those nested classes are themselves serializable.
  • The previous decorator-based code produced objects that were not pickleable, since their type (ThunderSerializableObjectWrapper) was defined inside a function rather than at the top level of a module, and thus pickle could not dynamically instantiate them. RegistrationModels need to be pickleable, since they are broadcast by pyspark, which uses pickle to do so. This PR moves the serialization logic into an abstract base class rather than a decorator, so serializable classes must now extend ThunderSerializable (can be multiple inheritance) rather than being wrapped by the @serializable decorator.

At present this is still a little messy. I'm opening this PR right now for visibility and comment, but I don't yet consider it ready to be merged in.

This adds some level of support for putting serializable
objects inside each other. There are unit tests that make
sensible assertions and do not fail, but beyond that I am
not yet super confident that this is totally solid.
Can now use the at-serializable decorator to serialize nested
custom objects; however it turns out objects with this decorator
can no longer be pickled. Applying this to the RegistrationModel
currently breaks lots of tests, since transformations have to be
picklable in order to be distributed by Spark.
The former at-serializable decorator is now an abstract
ThunderSerializable class. Classes wrapped by the previous decorator
are not pickleable, since their class is not exposed at the top
level of a module.

Also fixes recursive serialization of non-basic types nested
within lists, tuples, and so on.
RegistrationModels have a transformations dictionary where all
values are instances of Displacements. This special case allows
the name of the Displacement subclass to be serialized once for the
dict, rather than once per value.
@freeman-lab
Copy link
Member

cc @broxtronix @poolio

no special encoding is needed for plain lists
Only reason to have this at all is to make clear that Serializable
is intended as a mixin, and not a "real" base class.
Previously, the serialization special case handling of homogeneously
typed lists and dicts would only be triggered for contained objects
with regular __dict__s, not with __slots__. The homogenous container
logic now supports both types of objects.
Conflicts:
	python/test/test_imgprocessing.py
	python/thunder/imgprocessing/regmethods/utils.py
@industrial-sloth
Copy link
Contributor Author

Ok, I think this should be ready for a potential merge now. Here's an example of a JSON-serialized RegistrationModel (taken from the unit tests, and manually reformatted for legibility):

{
"regMethod": "CrossCorr", 
"transClass": "Displacement", 
"transformations": 
  {"py/hmgDict": 
    {"data": 
      [ [0, {"delta": [2, 0]} ], 
        [1, {"delta": [0, 2]} ]], 
    "type": "Displacement", 
    "module": "thunder.imgprocessing.transformation"
    }
  }
}

Here regMethod, transClass, and transformations were the actual object attributes. The "py/hmgDict" tag indicates that transformations is a dictionary where all values are Serializable subclasses and have the same type (homogeneous, or "hmg"). Every value in transformations is a Displacement, which now extends Serializable, so we're good. Displacement instances themselves have just one attribute, named delta, which are here lists of int. Since we've flagged this as a homogeneously valued dictionary, we write out the Displacement class name and thunder.imgprocessing.transformation module name just once for the whole dict, rather than once per object, which would get crazy in a hurry.

If we wanted to (say) convert Displacement from a standard __dict__ based class to one based on __slots__ we should be able to do that at this point without impacting the serialization. In this particular case there's probably no need, since we're talking about one Displacement per image, which is maybe thousands but not millions, and so the memory savings probably isn't worth it. But for future reference, serializing slots should work fine. :)

@industrial-sloth industrial-sloth changed the title WIP - Serializable regmodel JSON serializable registration model Feb 12, 2015
@broxtronix
Copy link
Contributor

Wow, this is a fantastic overhaul, @industrial-sloth! I suppose we should have tested the decorator approach more carefully before merging it in, but now that we have switched from container classes over to base classes, I cannot really think of any real drawbacks to this new approach. It did indeed feel like I was going to great lengths to present the inner class's true identity (via complicated overriding of attribute access methods, __str__, __repr__, etc) in order to get the decorator approach to work correctly. This seems like a much cleaner approach. And what's amazing is that the end-user API is basically the same, modulo the use of base class rather than decorator when declaring the class. In all respects, this seems like a big improvement!

Just a few quick comments:

My recommendation would be to rename py/hmgList and py/hmgDict to something without abbreviations like py/homogeneousList and py/homogeneousDict. Sure, it's more verbose, but I was totally left scratching my head at first until I read your pull request commentary, and I think there is definite value in making the JSON generated by this class self-expository since these JSON files will often exist on their own where such abbreviations aren't necessarily obvious.

Your tests for checking whether the list or dict is homogeneous are a little verbose, especially since it needs to be repeated twice for list and for dict. Here are some helper functions you could place at the top of the file to streamline that code in __serializeRecursively() and make it more readable:

def hasHomogeneousValues(container):
    if isinstance(container, list):
        return ( len(set([type(elem) for elem in container])) == 1 )
    elif isinstance(container, dict):
        return ( len(set([type(elem) for elem in container.iteritems() ])) == 1 )

def hasSerializableValues(container):
    from numpy import all
    if isinstance(container, list):
        return all( [isinstance(elem, Serializable) for elem in container] )
    elif isinstance(container, dict):
        return all( [isinstance(elem, Serializable) for elem in container.iteritems() ] )

With these methods so defined, you could simply check for homogeneous serializability using hasHomogeneousValues(container) and hasSerializableValues(container), and then go along your way.

Everything else is looking fantastic. Totally awesome! :)

@industrial-sloth
Copy link
Contributor Author

Brilliant, thanks @broxtronix ! Agreed re: the homogeneity testing, I'l take those functions and merge those in, looks like a clear win.

Funny about the homogenous container tags - I usually end up erring on the side of verbosity, the original tag strings were something like "py/homogenousThunderSerializableList" or something equally ridiculous. So I cut them way way back, and may have gone too far. :) There's a bit of a tradeoff between human-readability and efficient encoding here that's hard for me to judge, but if you think the more legible strings are worth the extra characters, I'm happy to put that in. Thanks again for the review!

@broxtronix
Copy link
Contributor

Sounds great! Thanks for the major overhaul of the code!!

One other quick thing occurred to me... in addition to the save() and load() pair (to save directly to disk), I think it might make sense to add a toJson() and fromJson() method that allow you to encode/decode directly to/from raw JSON strings. This might be useful when sending JSON snippets over a network connection (to Lightning, for example), rather than working directly with files on disk.

@freeman-lab
Copy link
Member

Awesome work guys! I agree with @broxtronix about changing the name to homogenousDict, homogenousList etc. And I also agree about the toJSON() and fromJSON() helper functions (all caps for JSON? looks slightly better to me, but would be good to check the convention in other libraries). It's slightly more modularity, and the ability to integrate with networky things like Lightning will be really useful.

@@ -138,8 +139,11 @@ def fitandtransform(im, reg):
return Images(newrdd).__finalize__(images)


class RegistrationModel(object):

class RegistrationModel(Serializable, object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put object first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments on the next 4 lines. :) Basically, you can't put object first, much as one might want to. See for instance: http://stackoverflow.com/questions/3003053/metaclass-multiple-inheritance-inconsistency

was "py/hmgList" and "py/hmgDict", now "py/homogeneousList" and
"py/homogeneousDict".
checking for homogeneous serializable data values is now pulled out
into a separate method, as is building a dictionary equivalent to an
instance's __slots__.
also fix issue where strings were decoded as Unicode. Unicode
strings are not currently supported, only plain strings.
@industrial-sloth
Copy link
Contributor Author

Ok, I think I have all those comments addressed. I ended up going with a __isHomogeneousSerializable helper function that is not as pretty as @broxtronix 's isHomogeneous / isSerializable pair, but that only traverses the list once. The type codes now spell out "homogeneous" (which as it happens I'd been spelling incorrectly) rather than just "hmg". And I've added toJSON / fromJSON methods.

This last was unexpectedly useful, as it exposed an issue where the full roundtrip through the JSON module ends up converting plain strings to Unicode strings - arguably exactly what it should be doing but not I think what we want in Thunder, which generally speaking is not big on internationalization. Have now added a couple of hooks to the JSON string decoding logic that convert unicode strings back to plain strings, as suggested by Stack Overflow.

@freeman-lab
Copy link
Member

LGTM! The unicode / string thing is a little icky, it looks like we coupld also get around by using a different json library (it's handled more uniformly in simplejson it seems?) But we can keep it this way for now, simpler to add these helper functions than to switch json libraries.

freeman-lab added a commit that referenced this pull request Feb 14, 2015
@freeman-lab freeman-lab merged commit 6767a18 into thunder-project:master Feb 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants