JSON serializable registration model #106

industrial-sloth · 2015-02-11T23:10:36Z

This PR modifies the existing JSON serialization code quite heavily, with the end goal of having this be usable to serialize RegistrationModel objects from the imgprocessing image registration code.

This gets around a couple issues with the previous serialization code:

RegistrationModels have nested within them Displacement objects. However the previous serialization code didn't handle custom classes. The current code can handle nested custom classes, so long as those nested classes are themselves serializable.
The previous decorator-based code produced objects that were not pickleable, since their type (ThunderSerializableObjectWrapper) was defined inside a function rather than at the top level of a module, and thus pickle could not dynamically instantiate them. RegistrationModels need to be pickleable, since they are broadcast by pyspark, which uses pickle to do so. This PR moves the serialization logic into an abstract base class rather than a decorator, so serializable classes must now extend ThunderSerializable (can be multiple inheritance) rather than being wrapped by the @serializable decorator.

At present this is still a little messy. I'm opening this PR right now for visibility and comment, but I don't yet consider it ready to be merged in.

This adds some level of support for putting serializable objects inside each other. There are unit tests that make sensible assertions and do not fail, but beyond that I am not yet super confident that this is totally solid.

Can now use the at-serializable decorator to serialize nested custom objects; however it turns out objects with this decorator can no longer be pickled. Applying this to the RegistrationModel currently breaks lots of tests, since transformations have to be picklable in order to be distributed by Spark.

The former at-serializable decorator is now an abstract ThunderSerializable class. Classes wrapped by the previous decorator are not pickleable, since their class is not exposed at the top level of a module. Also fixes recursive serialization of non-basic types nested within lists, tuples, and so on.

RegistrationModels have a transformations dictionary where all values are instances of Displacements. This special case allows the name of the Displacement subclass to be serialized once for the dict, rather than once per value.

freeman-lab · 2015-02-12T06:57:11Z

cc @broxtronix @poolio

no special encoding is needed for plain lists

Only reason to have this at all is to make clear that Serializable is intended as a mixin, and not a "real" base class.

Previously, the serialization special case handling of homogeneously typed lists and dicts would only be triggered for contained objects with regular __dict__s, not with __slots__. The homogenous container logic now supports both types of objects.

Conflicts: python/test/test_imgprocessing.py python/thunder/imgprocessing/regmethods/utils.py

industrial-sloth · 2015-02-12T21:44:26Z

Ok, I think this should be ready for a potential merge now. Here's an example of a JSON-serialized RegistrationModel (taken from the unit tests, and manually reformatted for legibility):

{
"regMethod": "CrossCorr", 
"transClass": "Displacement", 
"transformations": 
  {"py/hmgDict": 
    {"data": 
      [ [0, {"delta": [2, 0]} ], 
        [1, {"delta": [0, 2]} ]], 
    "type": "Displacement", 
    "module": "thunder.imgprocessing.transformation"
    }
  }
}

Here regMethod, transClass, and transformations were the actual object attributes. The "py/hmgDict" tag indicates that transformations is a dictionary where all values are Serializable subclasses and have the same type (homogeneous, or "hmg"). Every value in transformations is a Displacement, which now extends Serializable, so we're good. Displacement instances themselves have just one attribute, named delta, which are here lists of int. Since we've flagged this as a homogeneously valued dictionary, we write out the Displacement class name and thunder.imgprocessing.transformation module name just once for the whole dict, rather than once per object, which would get crazy in a hurry.

If we wanted to (say) convert Displacement from a standard __dict__ based class to one based on __slots__ we should be able to do that at this point without impacting the serialization. In this particular case there's probably no need, since we're talking about one Displacement per image, which is maybe thousands but not millions, and so the memory savings probably isn't worth it. But for future reference, serializing slots should work fine. :)

broxtronix · 2015-02-13T04:22:59Z

Wow, this is a fantastic overhaul, @industrial-sloth! I suppose we should have tested the decorator approach more carefully before merging it in, but now that we have switched from container classes over to base classes, I cannot really think of any real drawbacks to this new approach. It did indeed feel like I was going to great lengths to present the inner class's true identity (via complicated overriding of attribute access methods, __str__, __repr__, etc) in order to get the decorator approach to work correctly. This seems like a much cleaner approach. And what's amazing is that the end-user API is basically the same, modulo the use of base class rather than decorator when declaring the class. In all respects, this seems like a big improvement!

Just a few quick comments:

My recommendation would be to rename py/hmgList and py/hmgDict to something without abbreviations like py/homogeneousList and py/homogeneousDict. Sure, it's more verbose, but I was totally left scratching my head at first until I read your pull request commentary, and I think there is definite value in making the JSON generated by this class self-expository since these JSON files will often exist on their own where such abbreviations aren't necessarily obvious.

Your tests for checking whether the list or dict is homogeneous are a little verbose, especially since it needs to be repeated twice for list and for dict. Here are some helper functions you could place at the top of the file to streamline that code in __serializeRecursively() and make it more readable:

def hasHomogeneousValues(container):
    if isinstance(container, list):
        return ( len(set([type(elem) for elem in container])) == 1 )
    elif isinstance(container, dict):
        return ( len(set([type(elem) for elem in container.iteritems() ])) == 1 )

def hasSerializableValues(container):
    from numpy import all
    if isinstance(container, list):
        return all( [isinstance(elem, Serializable) for elem in container] )
    elif isinstance(container, dict):
        return all( [isinstance(elem, Serializable) for elem in container.iteritems() ] )

With these methods so defined, you could simply check for homogeneous serializability using hasHomogeneousValues(container) and hasSerializableValues(container), and then go along your way.

Everything else is looking fantastic. Totally awesome! :)

industrial-sloth · 2015-02-13T15:31:24Z

Brilliant, thanks @broxtronix ! Agreed re: the homogeneity testing, I'l take those functions and merge those in, looks like a clear win.

Funny about the homogenous container tags - I usually end up erring on the side of verbosity, the original tag strings were something like "py/homogenousThunderSerializableList" or something equally ridiculous. So I cut them way way back, and may have gone too far. :) There's a bit of a tradeoff between human-readability and efficient encoding here that's hard for me to judge, but if you think the more legible strings are worth the extra characters, I'm happy to put that in. Thanks again for the review!

broxtronix · 2015-02-13T16:55:44Z

Sounds great! Thanks for the major overhaul of the code!!

One other quick thing occurred to me... in addition to the save() and load() pair (to save directly to disk), I think it might make sense to add a toJson() and fromJson() method that allow you to encode/decode directly to/from raw JSON strings. This might be useful when sending JSON snippets over a network connection (to Lightning, for example), rather than working directly with files on disk.

freeman-lab · 2015-02-13T17:36:16Z

Awesome work guys! I agree with @broxtronix about changing the name to homogenousDict, homogenousList etc. And I also agree about the toJSON() and fromJSON() helper functions (all caps for JSON? looks slightly better to me, but would be good to check the convention in other libraries). It's slightly more modularity, and the ability to integrate with networky things like Lightning will be really useful.

freeman-lab · 2015-02-13T17:37:33Z

python/thunder/imgprocessing/registration.py

@@ -138,8 +139,11 @@ def fitandtransform(im, reg):
        return Images(newrdd).__finalize__(images)


-class RegistrationModel(object):
-
+class RegistrationModel(Serializable, object):


put object first?

See comments on the next 4 lines. :) Basically, you can't put object first, much as one might want to. See for instance: http://stackoverflow.com/questions/3003053/metaclass-multiple-inheritance-inconsistency

was "py/hmgList" and "py/hmgDict", now "py/homogeneousList" and "py/homogeneousDict".

checking for homogeneous serializable data values is now pulled out into a separate method, as is building a dictionary equivalent to an instance's __slots__.

also fix issue where strings were decoded as Unicode. Unicode strings are not currently supported, only plain strings.

industrial-sloth · 2015-02-13T20:49:38Z

Ok, I think I have all those comments addressed. I ended up going with a __isHomogeneousSerializable helper function that is not as pretty as @broxtronix 's isHomogeneous / isSerializable pair, but that only traverses the list once. The type codes now spell out "homogeneous" (which as it happens I'd been spelling incorrectly) rather than just "hmg". And I've added toJSON / fromJSON methods.

This last was unexpectedly useful, as it exposed an issue where the full roundtrip through the JSON module ends up converting plain strings to Unicode strings - arguably exactly what it should be doing but not I think what we want in Thunder, which generally speaking is not big on internationalization. Have now added a couple of hooks to the JSON string decoding logic that convert unicode strings back to plain strings, as suggested by Stack Overflow.

freeman-lab · 2015-02-14T00:55:27Z

LGTM! The unicode / string thing is a little icky, it looks like we coupld also get around by using a different json library (it's handled more uniformly in simplejson it seems?) But we can keep it this way for now, simpler to add these helper functions than to switch json libraries.

JSON serializable registration model

industrial-sloth added 6 commits February 6, 2015 16:40

WIP serializing RegistrationModel

6ff706d

shaky support for nesting serializable objects

ceffc8c

This adds some level of support for putting serializable objects inside each other. There are unit tests that make sensible assertions and do not fail, but beyond that I am not yet super confident that this is totally solid.

rename decorators module to serializable.py

b313341

freeman-lab added the in progress label Feb 11, 2015

industrial-sloth added 10 commits February 12, 2015 11:28

cleanup of serialization code and tests

95beb92

more serialization code and comment cleanup

687cbdf

rename ThunderSerializable to Serializable

1b7f213

add test for homogenous-valued dict serialization

31e49b8

remove 'py/list' encoding in serialization

8291c84

no special encoding is needed for plain lists

shorten type strings for homogenous containers in serialization

4033233

explicitly inherit object and Serializable in RegistrationModel

f448fa9

fix Serializable, object multiple inheritance order

e9cdf76

Only reason to have this at all is to make clear that Serializable is intended as a mixin, and not a "real" base class.

Merge branch 'master' into serializable_regmodel

9c9a7f5

Conflicts: python/test/test_imgprocessing.py python/thunder/imgprocessing/regmethods/utils.py

industrial-sloth changed the title ~~WIP - Serializable regmodel~~ JSON serializable registration model Feb 12, 2015

freeman-lab reviewed Feb 13, 2015
View reviewed changes

industrial-sloth added 3 commits February 13, 2015 14:13

change homogeneous type codes in serialization

dd7cb4d

was "py/hmgList" and "py/hmgDict", now "py/homogeneousList" and "py/homogeneousDict".

refactor homogeneity test in serialization

40f1360

checking for homogeneous serializable data values is now pulled out into a separate method, as is building a dictionary equivalent to an instance's __slots__.

add toJSON and fromJSON string serialization methods

2f72228

also fix issue where strings were decoded as Unicode. Unicode strings are not currently supported, only plain strings.

industrial-sloth added 2 commits February 13, 2015 15:41

serialization docstring reformatting

62a7ed6

Merge branch 'master' into serializable_regmodel

39109bb

freeman-lab added a commit that referenced this pull request Feb 14, 2015

Merge pull request #106 from industrial-sloth/serializable_regmodel

6767a18

JSON serializable registration model

freeman-lab merged commit 6767a18 into thunder-project:master Feb 14, 2015

freeman-lab removed the in progress label Feb 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON serializable registration model #106

JSON serializable registration model #106

industrial-sloth commented Feb 11, 2015

freeman-lab commented Feb 12, 2015

industrial-sloth commented Feb 12, 2015

broxtronix commented Feb 13, 2015

industrial-sloth commented Feb 13, 2015

broxtronix commented Feb 13, 2015

freeman-lab commented Feb 13, 2015

freeman-lab Feb 13, 2015

industrial-sloth Feb 13, 2015

industrial-sloth commented Feb 13, 2015

freeman-lab commented Feb 14, 2015

JSON serializable registration model #106

JSON serializable registration model #106

Conversation

industrial-sloth commented Feb 11, 2015

freeman-lab commented Feb 12, 2015

industrial-sloth commented Feb 12, 2015

broxtronix commented Feb 13, 2015

industrial-sloth commented Feb 13, 2015

broxtronix commented Feb 13, 2015

freeman-lab commented Feb 13, 2015

freeman-lab Feb 13, 2015

Choose a reason for hiding this comment

industrial-sloth Feb 13, 2015

Choose a reason for hiding this comment

industrial-sloth commented Feb 13, 2015

freeman-lab commented Feb 14, 2015