Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serializable #96

Merged
merged 15 commits into from
Feb 6, 2015
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
47 changes: 47 additions & 0 deletions python/test/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,50 @@ def elementwiseStdev(arys):
combined = vstack([ary.ravel() for ary in arys])
stdAry = std(combined, axis=0)
return stdAry.reshape(arys[0].shape)


class TestSerializableDecorator(PySparkTestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move these tests into a separate test module test_decorators.py


def test_serializable_decorator(self):

from thunder.utils.decorators import serializable
import numpy as np
import datetime

@serializable
class Visitor(object):
def __init__(self, ip_addr = None, agent = None, referrer = None):
self.ip = ip_addr
self.ua = agent
self.referrer= referrer
self.test_dict = {'a': 10, 'b': "string", 'c': [1, 2, 3]}
self.test_vec = np.array([1,2,3])
self.test_array = np.array([[1,2,3],[4,5,6.]])
self.time = datetime.datetime.now()

def __str__(self):
return str(self.ip) + " " + str(self.ua) + " " + str(self.referrer) + " " + str(self.time)

def test_method(self):
return True

# Run the test. Build an object, serialize it, and recover it.

# Create a new object
orig_visitor = Visitor('192.168', 'UA-1', 'http://www.google.com')

# Serialize the object
pickled_visitor = orig_visitor.serialize(numpy_storage='ascii')

# Restore object
recov_visitor = Visitor.deserialize(pickled_visitor)

# Check that the object was reconstructed successfully
assert(orig_visitor.ip == recov_visitor.ip)
assert(orig_visitor.ua == recov_visitor.ua)
assert(orig_visitor.referrer == recov_visitor.referrer)
for key in orig_visitor.test_dict.keys():
assert(orig_visitor.test_dict[key] == recov_visitor.test_dict[key])

assert(np.all(orig_visitor.test_vec == recov_visitor.test_vec))
assert(np.all(orig_visitor.test_array == recov_visitor.test_array))
235 changes: 235 additions & 0 deletions python/thunder/utils/decorators.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
""" Useful decorators that are used throughout the library """

def _isnamedtuple(obj):
"""Heuristic check if an object is a namedtuple."""
return isinstance(obj, tuple) \
and hasattr(obj, "_fields") \
and hasattr(obj, "_asdict") \
and callable(obj._asdict)

def serializable(cls):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two lines before def

'''The @serializable decorator can decorate any class to make it easy to store
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use triple double quotes

that class in a human readable JSON format and then recall it and recover
the original object instance. Classes instances that are wrapped in this
decorator gain the serialize() method, and the class also gains a
deserialize() static method that can automatically "pickle" and "unpickle" a
wide variety of objects like so:

@serializable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be best to reformat a bit so this shows up as an example when building the docs, should be:

Examples
--------
>> @serializable
class Visitor():...

as in the numpy documentation (e.g. here). I think this might be the first proper "example" anywhere in the codebase, so might need to do some trial and error building the docs to get the formatting right, I'm happy to try that myself after this is merged =)

class Visitor():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need an object here so it's Visitor(object), otherwise this doesn't run as written =)

def __init__(self, ip_addr = None, agent = None, referrer = None):
self.ip = ip_addr
self.ua = agent
self.referrer= referrer
self.time = datetime.datetime.now()

orig_visitor = Visitor('192.168', 'UA-1', 'http://www.google.com')

#serialize the object
pickled_visitor = orig_visitor.serialize()

#restore object
recov_visitor = Visitor.deserialize(pickled_visitor)

Note that this decorator is NOT designed to provide generalized pickling
capabilities. Rather, it is designed to make it very easy to convert small
classes containing model properties to a human and machine parsable format
for later analysis or visualization. A few classes under consideration for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe drop this last sentence, great for our own planning, but will (hopefully!) be only a transient a description of the state of things and thus not really for the docs.

such decorating include the Transformation class for image alignment and the
Source classes for source extraction.

A key feature of the @serializable decorator is that it can "pickle" data
types that are not normally supported by Python's stock JSON dump() and
load() methods. Supported datatypes include: list, set, tuple, namedtuple,
OrderedDict, datetime objects, numpy ndarrays, and dicts with non-string
(but still data) keys. Serialization is performed recursively, and descends
into the standard python container types (list, dict, tuple, set).

Some of this code was adapted from these fantastic blog posts by Chris
Wagner and Sunil Arora:

http://robotfantastic.org/serializing-python-data-to-json-some-edge-cases.html
http://sunilarora.org/serializable-decorator-for-python-class/

'''

class ThunderSerializeableObjectWrapper(object):

def __init__(self, *args, **kwargs):
self.wrapped = cls(*args, **kwargs)

# Allows transparent access to the attributes of the wrapped class
def __getattr__(self, *args):
if args[0] != 'wrapped':
return getattr(self.wrapped, *args)
else:
return self.__dict__['wrapped']

# Allows transparent access to the attributes of the wrapped class
def __setattr__(self, *args):
if args[0] != 'wrapped':
return setattr(self.wrapped, *args)
else:
self.__dict__['wrapped'] = args[1]

# Delegate to wrapped class for special python object-->string methods
def __str__(self):
return self.wrapped.__str__()
def __repr__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a single blank line before def

return self.wrapped.__repr__()
def __unicode__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a single blank line

return self.wrapped.__unicode__()

# Delegate to wrapped class for special python methods
def __call__(self, *args, **kwargs):
return self.wrapped.__str__(*args, **kwargs)

# ------------------------------------------------------------------------------
# SERIALIZE()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor style nit, but I'd drop this heading and the one below


def serialize(self, numpy_storage='auto'):
'''
Serialize this object to a python dictionary that can easily be converted
to/from JSON using Python's standard JSON library.

Arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been using a slightly different format for arguments (it's the same format used by numpy and scipy documentation), should look like this:

Parameters
----------
numpyStorage : {'auto', 'ascii', 'base64' }, optional, default 'auto'
    Use to select whether numpy arrays...

Returns
-------
The object encoded as a...

Will add this to the style guide!


numpy-storage: choose one of ['auto', 'ascii', 'base64'] (default: auto)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be numpyStorage to follow the camelCase guidelines


Use the 'nmupy_storage' argument to select whether numpy arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/nmupy/numpy

will be encoded in ASCII (as a list of lists) in Base64 (i.e.
space efficient binary), or to select automatically (the default)
depending on the size of the array. Currently the Base64 encoding
is selecting if the array has more than 1000 elements.

Returns

The object encoded as a python dictionary with "JSON-safe" datatypes that is ready to
be converted to a string using Python's standard JSON library (or another library of
your choice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)

'''
from collections import namedtuple, Iterable, OrderedDict
import numpy as np

def serialize_recursively(data):
import datetime

if data is None or isinstance(data, (bool, int, long, float, basestring)):
return data
if isinstance(data, list):
return [serialize_recursively(val) for val in data] # Recurse into lists
if isinstance(data, OrderedDict):
return {"py/collections.OrderedDict":
[[serialize_recursively(k), serialize_recursively(v)] for k, v in data.iteritems()]}
if _isnamedtuple(data):
return {"py/collections.namedtuple": {
"type": type(data).__name__,
"fields": list(data._fields),
"values": [serialize_recursively(getattr(data, f)) for f in data._fields]}}
if isinstance(data, dict):
if all(isinstance(k, basestring) for k in data): # Recurse into dicts
return {k: serialize_recursively(v) for k, v in data.iteritems()}
else:
return {"py/dict": [[serialize_recursively(k), serialize_recursively(v)] for k, v in data.iteritems()]}
if isinstance(data, tuple): # Recurse into tuples
return {"py/tuple": [serialize_recursively(val) for val in data]}
if isinstance(data, set): # Recurse into sets
return {"py/set": [serialize_recursively(val) for val in data]}
if isinstance(data, datetime.datetime):
return {"py/datetime": str(data)}
if isinstance(data, np.ndarray):
if numpy_storage == 'ascii' or (numpy_storage == 'auto' and data.size < 1000):
return {"py/numpy.ndarray.ascii": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concern that this "type" is of our own invention and not a real python or numpy type, unlike say collections.namedtuple? I don't have a better suggestion and think I'm fine with it, just throwing it out there.

"shape": data.shape,
"values": data.tolist(),
"dtype": str(data.dtype)}}
else:
import base64
return {"py/numpy.ndarray.base64": {
"shape": data.shape,
"values": base64.b64encode(data),
"dtype": str(data.dtype)}}

raise TypeError("Type %s not data-serializable" % type(data))

# Start serializing from the top level object dictionary
return serialize_recursively(self.wrapped.__dict__)

# ------------------------------------------------------------------------------
# DESERIALIZE()

@staticmethod
def deserialize(serialized_dict):
'''
Restore the object that has been converted to a python dictionary using an @serializable
class's serialize() method.

Arguments

serialized_dict: a python dictionary returned by serialize()

Returns:

A reconstituted class instance
'''

def restore_recursively(dct):
'''
This object hook helps to deserialize object encoded using the
serialize() method above.
'''
import numpy as np
import base64

if "py/dict" in dct:
return dict(restore_recursively(dct["py/dict"]))
if "py/tuple" in dct:
return tuple(restore_recursively(dct["py/tuple"]))
if "py/set" in dct:
return set(restore_recursively(dct["py/set"]))
if "py/collections.namedtuple" in dct:
data = restore_recursively(dct["py/collections.namedtuple"])
return namedtuple(data["type"], data["fields"])(*data["values"])
if "py/collections.OrderedDict" in dct:
return OrderedDict(restore_recursively(dct["py/collections.OrderedDict"]))
if "py/datetime" in dct:
from dateutil import parser
return parser.parse(dct["py/datetime"])
if "py/numpy.ndarray.ascii" in dct:
data = dct["py/numpy.ndarray.ascii"]
return np.array(data["values"], dtype=data["dtype"])
if "py/numpy.ndarray.base64" in dct:
data = dct["py/numpy.ndarray.base64"]
arr = np.frombuffer(base64.decodestring(data["values"]), np.dtype(data["dtype"]))
return arr.reshape(data["shape"])

# Base case: data type needs no further decoding.
return dct

# First we must restore the object's dictionary entries. These are decoded recursively
# using the helper function above.
restored_dict = {}
for k in serialized_dict.keys():
restored_dict[k] = restore_recursively(serialized_dict[k])

# Next we recreate the object. Calling the __new__() function here creates
# an empty object without calling __init__(). We then take this empty
# shell of an object, and set its dictionary to the reconstructed
# dictionary we pulled from the JSON file.
thawed_object = cls.__new__(cls)
thawed_object.__dict__ = restored_dict

# Finally, we would like this re-hydrated object to also be @serializable, so we re-wrap it
# in the ThunderSerializeableObjectWrapper using the same trick with __new__().
rewrapped_object = ThunderSerializeableObjectWrapper.__new__(ThunderSerializeableObjectWrapper)
rewrapped_object.__dict__['wrapped'] = thawed_object

# Return the re-constituted class
return rewrapped_object

# End of decorator. Return the wrapper class from inside this closure.
return ThunderSerializeableObjectWrapper