Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H4EP002: Memoisation scheme #139

Closed
hernot opened this issue Jul 27, 2020 · 1 comment
Closed

H4EP002: Memoisation scheme #139

hernot opened this issue Jul 27, 2020 · 1 comment

Comments

@hernot
Copy link
Contributor

hernot commented Jul 27, 2020

Abstract:

With the proposed extension implemented it will be assured that objects are dumped only once. If the object is referred to multiple times within a complex object structure than any further reference is converted into an appropriate hdff5 hard-link referring to the corresponding hdf5 group or dataset.
This avoids that the very same object is dumped several hundred to thousand times unnecessarily increasing the file size. On load the hard-link will again be replaced by a reference to the Python object represented by the hdf5 group or dataset. This ensures that the loaded object structure resembles a one to one copy of the dumped one.

Motivation

In H4EP001 (#135) the extension of the loader based design introduced by hickle 4 to by container based loaders for h5py.Group objects and mixed loaders handling in addition h5py.Dataset object was propose. A possible implementation is suggested in pull-request #138. With this in place a large variety of Python objects can now be dumped and loaded without the need to convert them to pickle string.

Objects which are referred to multiple times within the object structure dumped are written multiple times to the resulting hdf5 file. Which causes the following effects:

  1. File size quickly increases especially if dumped objects are complex or their representation includes long strings, byte strings and pickle strings created for their 'type' attribute or class and function objects attached to them
  2. on Load for each occurrence a distinct copy of the object created. As a consequence the restored object structure is not any more a one to one representation of the dumped one.

Memoisation is a possibility to keep track upon objects which are at least referenced once within an object structure and link their references to the corresponding hdf5 file objects. In case a specific object occurs multiple times within the object structure a dataset with the special dtype=ref_dtype to the very same hdf5 group or dataset be created instead of dumping the referred object once again. On load when a dataset with dtype=ref_dtype is encountered its content will be replaced by appropriate reference to the actual object referred to.

Specification

Basic Memoisation

The basic memoisation is implemented mechanisms and structures comparable to those which the Python copy protocol uses. At the core there are two dictionaries one for dumping and one for loading.

py_obj_id = id(py_obj)
dump_memo[py_obj_id] = h_subnode

The dict utilized by _dump function uses the id of the already dumped objects as its keys and the reference to the created hdf5 group or dataset as the corresponding values. The id is the value returned by the id() function provided by python builtins module.

py_obj = load_fn(...) or py_subcontainer.convert()
load_memo[h_node.id] = py_obj

The dict utilized by _load function uses the id of the hdf5 file nodes loaded as its keys and the object restored from each of the hdf5 nodes as its values.

On every py_obj to be dumped the _dump will first check an entry for id(py_obj) exists within the dump_memo dictionary. In case the entry exist _dump will create a new dataset in the current h_group and assign the reference of the previously created h5py.Group or h5py.Dataset object to it instead of dumping the py_obj again. Using a dataset exposing the special dtype=ref_dtype has the advantage that this dataset can have attributes which differ from the ones stored along with the referred to dataset or group, whereas on hard links the attributes would be shared amognst all of its occurrences. When a py_obj is stored to the file the file a new entry in the dump_memo dictionary is created for later use.

Whenever a py_obj is restored from its corresponding hdf5 file object the _load function creates an entry in the load_memo dictionary. In case _load function encounters that a d h5py.Dataset with dtype=ref_dtype to be loaded it check whether it is already referenced by the load_memo dictionary. In this case it will load the corresponding py_obj from load_memo. If it the referred to h5py.Dataset or h5py.Group has not yet been loaded the _load function will load it and store the reference to the resulting object in the load_memo dictionary before resolving there reference. If the corresponding h5py.Dataset or h5py.Group is encountered later on the corresponding py_obj can directly loaded from the load_memo dictionary.

Extended Memoisation

Beyond the above specified basic memoisation the following extended memoisation related to the handling and storage of 'type'attributes of dumped dhf5 datasets and groups is proposed. In the current hickle file format besides the'base_type' attribute storing a short string identifying the loader to be used for properly restoring the object represented by the data, the 'type' attribute contains a pickle string describing the actual python class or type object of the restored object. This can be a subclass of the class object the _dump method used to select the loader providing the appropriate dump method.

The h5py documentation says about attributes:
"[...] Attributes have the following properties:

They may be created from any scalar or NumPy array
Each attribute should be small (generally < 64k)
There is no partial I/O (i.e. slicing); the entire attribute must be read.

[...]"

This means that especially pickle string which can become quite long may cause problems or have to be handled special by hdf5 file. Even worse each pickle string can occur multiple times as various distinct py_obj objects within the object structure to be dumped may be of the same class or type.

Therefore it is proposed to introduce an additional dictionary like classes dump_type_memo and load_type_memo which manage the py_obj_types_table group within the hickle file hosting the datasets representing the pickle strings of all relevant types`

py_obj_type_id = id(py_obj_type)
next_py_obj_type = dump_type_memo.get(py_obj_type_id,None)
if next_py_obj_type is None:
    pickle_string = bytearray(pickle.dumps(py_obj_type))
    next_py_obj_type = py_obj_types_table.create_dataset(
         str(len(py_obj_types_table)),data =picklestring,**kwargs
    )
    dump_type_memo[py_obj_type_id] = next_py_obj_type
h_subnode.attrs['type'] = next_py_obj_type.ref

The dump_type_memo dictionary uses the id of the py_obj_type as the key to access dataset representing the picklestring corresponding to py_object_type to be dumped. A h5py.Reference to corresponding next_py_obj_type will assigned to the 'type' attribute of the h_subnode representing the py_obj instead of storing the pickle string of its py_obj_type directly.

load_type_memo = _load(...,h5f["type_memo"],...)
py_obj_type = dump_type_memo[h_node.attrs['type']

On load the the load_type_memo list would be linked to the py_obj_types_table group before any object is restored from file.
Instead of unpacking the py_obj_type form the 'type' attribute of the hdf5 file node the attribute would just denote h5py.Refernence referring to the table entry representing the appropriate py_obj_type. This h5py.Reference would be used by _load function to load the actual py_obj_type from load_type_memo table and if necessary unpack it first from the corresponding dataset if not yet loaded, instead of unpacking it from a pickle string stored in the 'type' attribute directly.

A special case would form in this context class and function objects which are to be dumped within Python copy protocol representation of objects. The corresponding datasets store the pickle string representing the class object to be restored or the function to be called for proper restoring the basic structure of the py_obj. If extended Memoisation is activated these strings would also be moved to dump_type_memo structure and the corresponding datasets with dtype=ref_dtype would refer to appropriate py_obj_types_table entry instead of being handled by basic memoisation. This would be taken care of by the create_pickled_dataset and load_pickled_dataset methods provided through the special b'pickle' base_type.

Rational

The proposed extension will allow to also dump complex object structures which contain objects referred to multiple times without duplicating their representation within the resulting hdf5 file. As well as reducing the amount of space used by repeatedly storing pickle string of py_obj_type of python object of the same type. At the same time in contrast to using hard links each occurence can have its own set of special attributes local to the positing of the referred to object within the object structure to be dumped.

Open Issues

  • Shall extended memoisation be activated always? Shall it be activated when any of the
    compression filters is activated? Or shall it be activated by dedicated extendedmemo flag.

    -> compression should not be the problem

    -> deactivation on loader level by an addtional memoize flag for each class_register table entry which can either be True or False seems to be the more sensitive aproach instead of global activation and deactivation.

References

[1] H4EP #135
[2] Python copy https://docs.python.org/3.6/library/copy.html
[3] Python pickle https://docs.python.org/3.6/library/pickle.html#pickling-class-instances
[4] h5py Object References https://docs.h5py.org/en/stable/refs.html

Precondition

H4EP001 #138 implemented and merged into dev

@telegraphic
Copy link
Owner

Short comment to a long proposal: I agree this is well motivated and am supportive of its implementation!

hernot added a commit to hernot/hickle that referenced this issue Jan 18, 2021
…and type)

Basic Menoisation:
==================
Both types of memoisation are handled by Reference manager dictionary
type object. For storing object instance references it is used as
python dict object which stores the references to the py_obj and related
node using py_obj(id) as key when dumping. On load the id of the h_node
is used as key for storing the to be shared reference of the restored
object.

Additoinal references to the same object are represented by
h5py.Datasets with their dtype set to ref_dtype. They are created
by assinging a h5py.Refence object as returned by h5py.Dataset.ref or
h5py.Group.ref attribute. These datasets are resolved by the filter
iterator method of the ExpandReferenceContainer class and returned as
sub_item of the reference dataset.

Type Memoisation
================
The 'type' attribute of all nodes exempt datasets which contain pickle
strings, or expose a ref_dtype as their dtype now contains a reference
to the approriate py_obj_type entry in the global 'hickle_types_table'
this table host the datasets representing all py_obj_types and
base_types encountered by hickle.dump once.

Each py_obj_type is represened by a numbered dataset containing the
corresponding pickle string. The base_types are represented by empty
datasets the name of which is the name of the base_type as defined
by class_register table of loaders. No entry is stored for object,
b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these
can be resolved implicitly on load.
The 'base_type' attribute of a  py_obj_type entry referres to the
base_type used to encode and required to properly restore it again
from the hickle file.

The entries in the 'hickle_types_table' are managed by the
ReferenceManager.store_type and ReferenceManager.resolve_type methods.
The latter is also taking care of properly distinguishing pickle
datasets from reference datasets and resolving hickle 4.0.X dict_item
groups.

The ReferenceManager is implemented as context manager and thus can and
shall be used within the with statement, to ensure proper cleanup. Each
file has its own ReferenceManager instance, therefore different data can
be dumped to distinct files which are open in parallel. The basic
management of managers is provided by the BaseManager base class which can be
used build futher managers for example to allow loaders to be activated only
when specific feature flags are passed to hickle.dump method or
encountered by hickle.load from the file attributes. The BaseManager
class has to be subclassed.

Other changes:
==============
 - lookup.register_class and class_register tables have an addtional
   memoise flag indicating whether py_obj shall be remmembered for
   representing and resolving multiple references to it or if it shall
   be dumped and restored everytime it is encountered

 - lookup.hickle_types table entries include the memoise flag as third entry

 - lookup.load_loader: the tuple returned in addtion to py_obj_type
   includes the memoise flag

 - hickle.load: Whether to use load_fn stored in
   lookup.hkl_types_table or use a PyContainer object stored in
   hkl_container_dict is decided upon the is_container flag returned
   by ReferenceManager.resolve_type instead of checking whether
   processed node is of type h5py.Group

 - dtype of string and bytes datasets is now set to 'S1' instead of 'u8'
   and shape is set to (1,len)
hernot added a commit to hernot/hickle that referenced this issue Feb 17, 2021
…and type)

Basic Menoisation:
==================
Both types of memoisation are handled by Reference manager dictionary
type object. For storing object instance references it is used as
python dict object which stores the references to the py_obj and related
node using py_obj(id) as key when dumping. On load the id of the h_node
is used as key for storing the to be shared reference of the restored
object.

Additoinal references to the same object are represented by
h5py.Datasets with their dtype set to ref_dtype. They are created
by assinging a h5py.Refence object as returned by h5py.Dataset.ref or
h5py.Group.ref attribute. These datasets are resolved by the filter
iterator method of the ExpandReferenceContainer class and returned as
sub_item of the reference dataset.

Type Memoisation
================
The 'type' attribute of all nodes exempt datasets which contain pickle
strings, or expose a ref_dtype as their dtype now contains a reference
to the approriate py_obj_type entry in the global 'hickle_types_table'
this table host the datasets representing all py_obj_types and
base_types encountered by hickle.dump once.

Each py_obj_type is represened by a numbered dataset containing the
corresponding pickle string. The base_types are represented by empty
datasets the name of which is the name of the base_type as defined
by class_register table of loaders. No entry is stored for object,
b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these
can be resolved implicitly on load.
The 'base_type' attribute of a  py_obj_type entry referres to the
base_type used to encode and required to properly restore it again
from the hickle file.

The entries in the 'hickle_types_table' are managed by the
ReferenceManager.store_type and ReferenceManager.resolve_type methods.
The latter is also taking care of properly distinguishing pickle
datasets from reference datasets and resolving hickle 4.0.X dict_item
groups.

The ReferenceManager is implemented as context manager and thus can and
shall be used within the with statement, to ensure proper cleanup. Each
file has its own ReferenceManager instance, therefore different data can
be dumped to distinct files which are open in parallel. The basic
management of managers is provided by the BaseManager base class which can be
used build futher managers for example to allow loaders to be activated only
when specific feature flags are passed to hickle.dump method or
encountered by hickle.load from the file attributes. The BaseManager
class has to be subclassed.

Other changes:
==============
 - lookup.register_class and class_register tables have an addtional
   memoise flag indicating whether py_obj shall be remmembered for
   representing and resolving multiple references to it or if it shall
   be dumped and restored everytime it is encountered

 - lookup.hickle_types table entries include the memoise flag as third entry

 - lookup.load_loader: the tuple returned in addtion to py_obj_type
   includes the memoise flag

 - hickle.load: Whether to use load_fn stored in
   lookup.hkl_types_table or use a PyContainer object stored in
   hkl_container_dict is decided upon the is_container flag returned
   by ReferenceManager.resolve_type instead of checking whether
   processed node is of type h5py.Group

 - dtype of string and bytes datasets is now set to 'S1' instead of 'u8'
   and shape is set to (1,len)
hernot added a commit to hernot/hickle that referenced this issue Feb 19, 2021
…and type)

Basic Menoisation:
==================
Both types of memoisation are handled by Reference manager dictionary
type object. For storing object instance references it is used as
python dict object which stores the references to the py_obj and related
node using py_obj(id) as key when dumping. On load the id of the h_node
is used as key for storing the to be shared reference of the restored
object.

Additoinal references to the same object are represented by
h5py.Datasets with their dtype set to ref_dtype. They are created
by assinging a h5py.Refence object as returned by h5py.Dataset.ref or
h5py.Group.ref attribute. These datasets are resolved by the filter
iterator method of the ExpandReferenceContainer class and returned as
sub_item of the reference dataset.

Type Memoisation
================
The 'type' attribute of all nodes exempt datasets which contain pickle
strings, or expose a ref_dtype as their dtype now contains a reference
to the approriate py_obj_type entry in the global 'hickle_types_table'
this table host the datasets representing all py_obj_types and
base_types encountered by hickle.dump once.

Each py_obj_type is represened by a numbered dataset containing the
corresponding pickle string. The base_types are represented by empty
datasets the name of which is the name of the base_type as defined
by class_register table of loaders. No entry is stored for object,
b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these
can be resolved implicitly on load.
The 'base_type' attribute of a  py_obj_type entry referres to the
base_type used to encode and required to properly restore it again
from the hickle file.

The entries in the 'hickle_types_table' are managed by the
ReferenceManager.store_type and ReferenceManager.resolve_type methods.
The latter is also taking care of properly distinguishing pickle
datasets from reference datasets and resolving hickle 4.0.X dict_item
groups.

The ReferenceManager is implemented as context manager and thus can and
shall be used within the with statement, to ensure proper cleanup. Each
file has its own ReferenceManager instance, therefore different data can
be dumped to distinct files which are open in parallel. The basic
management of managers is provided by the BaseManager base class which can be
used build futher managers for example to allow loaders to be activated only
when specific feature flags are passed to hickle.dump method or
encountered by hickle.load from the file attributes. The BaseManager
class has to be subclassed.

Other changes:
==============
 - lookup.register_class and class_register tables have an addtional
   memoise flag indicating whether py_obj shall be remmembered for
   representing and resolving multiple references to it or if it shall
   be dumped and restored everytime it is encountered

 - lookup.hickle_types table entries include the memoise flag as third entry

 - lookup.load_loader: the tuple returned in addtion to py_obj_type
   includes the memoise flag

 - hickle.load: Whether to use load_fn stored in
   lookup.hkl_types_table or use a PyContainer object stored in
   hkl_container_dict is decided upon the is_container flag returned
   by ReferenceManager.resolve_type instead of checking whether
   processed node is of type h5py.Group

 - dtype of string and bytes datasets is now set to 'S1' instead of 'u8'
   and shape is set to (1,len)
@hernot hernot mentioned this issue Apr 22, 2021
@hernot hernot mentioned this issue Dec 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants