New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H4EP002: Memoisation scheme #139
Labels
Comments
This was referenced Nov 23, 2020
This was referenced Dec 20, 2020
Short comment to a long proposal: I agree this is well motivated and am supportive of its implementation! |
hernot
added a commit
to hernot/hickle
that referenced
this issue
Jan 18, 2021
…and type) Basic Menoisation: ================== Both types of memoisation are handled by Reference manager dictionary type object. For storing object instance references it is used as python dict object which stores the references to the py_obj and related node using py_obj(id) as key when dumping. On load the id of the h_node is used as key for storing the to be shared reference of the restored object. Additoinal references to the same object are represented by h5py.Datasets with their dtype set to ref_dtype. They are created by assinging a h5py.Refence object as returned by h5py.Dataset.ref or h5py.Group.ref attribute. These datasets are resolved by the filter iterator method of the ExpandReferenceContainer class and returned as sub_item of the reference dataset. Type Memoisation ================ The 'type' attribute of all nodes exempt datasets which contain pickle strings, or expose a ref_dtype as their dtype now contains a reference to the approriate py_obj_type entry in the global 'hickle_types_table' this table host the datasets representing all py_obj_types and base_types encountered by hickle.dump once. Each py_obj_type is represened by a numbered dataset containing the corresponding pickle string. The base_types are represented by empty datasets the name of which is the name of the base_type as defined by class_register table of loaders. No entry is stored for object, b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these can be resolved implicitly on load. The 'base_type' attribute of a py_obj_type entry referres to the base_type used to encode and required to properly restore it again from the hickle file. The entries in the 'hickle_types_table' are managed by the ReferenceManager.store_type and ReferenceManager.resolve_type methods. The latter is also taking care of properly distinguishing pickle datasets from reference datasets and resolving hickle 4.0.X dict_item groups. The ReferenceManager is implemented as context manager and thus can and shall be used within the with statement, to ensure proper cleanup. Each file has its own ReferenceManager instance, therefore different data can be dumped to distinct files which are open in parallel. The basic management of managers is provided by the BaseManager base class which can be used build futher managers for example to allow loaders to be activated only when specific feature flags are passed to hickle.dump method or encountered by hickle.load from the file attributes. The BaseManager class has to be subclassed. Other changes: ============== - lookup.register_class and class_register tables have an addtional memoise flag indicating whether py_obj shall be remmembered for representing and resolving multiple references to it or if it shall be dumped and restored everytime it is encountered - lookup.hickle_types table entries include the memoise flag as third entry - lookup.load_loader: the tuple returned in addtion to py_obj_type includes the memoise flag - hickle.load: Whether to use load_fn stored in lookup.hkl_types_table or use a PyContainer object stored in hkl_container_dict is decided upon the is_container flag returned by ReferenceManager.resolve_type instead of checking whether processed node is of type h5py.Group - dtype of string and bytes datasets is now set to 'S1' instead of 'u8' and shape is set to (1,len)
hernot
added a commit
to hernot/hickle
that referenced
this issue
Feb 17, 2021
…and type) Basic Menoisation: ================== Both types of memoisation are handled by Reference manager dictionary type object. For storing object instance references it is used as python dict object which stores the references to the py_obj and related node using py_obj(id) as key when dumping. On load the id of the h_node is used as key for storing the to be shared reference of the restored object. Additoinal references to the same object are represented by h5py.Datasets with their dtype set to ref_dtype. They are created by assinging a h5py.Refence object as returned by h5py.Dataset.ref or h5py.Group.ref attribute. These datasets are resolved by the filter iterator method of the ExpandReferenceContainer class and returned as sub_item of the reference dataset. Type Memoisation ================ The 'type' attribute of all nodes exempt datasets which contain pickle strings, or expose a ref_dtype as their dtype now contains a reference to the approriate py_obj_type entry in the global 'hickle_types_table' this table host the datasets representing all py_obj_types and base_types encountered by hickle.dump once. Each py_obj_type is represened by a numbered dataset containing the corresponding pickle string. The base_types are represented by empty datasets the name of which is the name of the base_type as defined by class_register table of loaders. No entry is stored for object, b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these can be resolved implicitly on load. The 'base_type' attribute of a py_obj_type entry referres to the base_type used to encode and required to properly restore it again from the hickle file. The entries in the 'hickle_types_table' are managed by the ReferenceManager.store_type and ReferenceManager.resolve_type methods. The latter is also taking care of properly distinguishing pickle datasets from reference datasets and resolving hickle 4.0.X dict_item groups. The ReferenceManager is implemented as context manager and thus can and shall be used within the with statement, to ensure proper cleanup. Each file has its own ReferenceManager instance, therefore different data can be dumped to distinct files which are open in parallel. The basic management of managers is provided by the BaseManager base class which can be used build futher managers for example to allow loaders to be activated only when specific feature flags are passed to hickle.dump method or encountered by hickle.load from the file attributes. The BaseManager class has to be subclassed. Other changes: ============== - lookup.register_class and class_register tables have an addtional memoise flag indicating whether py_obj shall be remmembered for representing and resolving multiple references to it or if it shall be dumped and restored everytime it is encountered - lookup.hickle_types table entries include the memoise flag as third entry - lookup.load_loader: the tuple returned in addtion to py_obj_type includes the memoise flag - hickle.load: Whether to use load_fn stored in lookup.hkl_types_table or use a PyContainer object stored in hkl_container_dict is decided upon the is_container flag returned by ReferenceManager.resolve_type instead of checking whether processed node is of type h5py.Group - dtype of string and bytes datasets is now set to 'S1' instead of 'u8' and shape is set to (1,len)
hernot
added a commit
to hernot/hickle
that referenced
this issue
Feb 19, 2021
…and type) Basic Menoisation: ================== Both types of memoisation are handled by Reference manager dictionary type object. For storing object instance references it is used as python dict object which stores the references to the py_obj and related node using py_obj(id) as key when dumping. On load the id of the h_node is used as key for storing the to be shared reference of the restored object. Additoinal references to the same object are represented by h5py.Datasets with their dtype set to ref_dtype. They are created by assinging a h5py.Refence object as returned by h5py.Dataset.ref or h5py.Group.ref attribute. These datasets are resolved by the filter iterator method of the ExpandReferenceContainer class and returned as sub_item of the reference dataset. Type Memoisation ================ The 'type' attribute of all nodes exempt datasets which contain pickle strings, or expose a ref_dtype as their dtype now contains a reference to the approriate py_obj_type entry in the global 'hickle_types_table' this table host the datasets representing all py_obj_types and base_types encountered by hickle.dump once. Each py_obj_type is represened by a numbered dataset containing the corresponding pickle string. The base_types are represented by empty datasets the name of which is the name of the base_type as defined by class_register table of loaders. No entry is stored for object, b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these can be resolved implicitly on load. The 'base_type' attribute of a py_obj_type entry referres to the base_type used to encode and required to properly restore it again from the hickle file. The entries in the 'hickle_types_table' are managed by the ReferenceManager.store_type and ReferenceManager.resolve_type methods. The latter is also taking care of properly distinguishing pickle datasets from reference datasets and resolving hickle 4.0.X dict_item groups. The ReferenceManager is implemented as context manager and thus can and shall be used within the with statement, to ensure proper cleanup. Each file has its own ReferenceManager instance, therefore different data can be dumped to distinct files which are open in parallel. The basic management of managers is provided by the BaseManager base class which can be used build futher managers for example to allow loaders to be activated only when specific feature flags are passed to hickle.dump method or encountered by hickle.load from the file attributes. The BaseManager class has to be subclassed. Other changes: ============== - lookup.register_class and class_register tables have an addtional memoise flag indicating whether py_obj shall be remmembered for representing and resolving multiple references to it or if it shall be dumped and restored everytime it is encountered - lookup.hickle_types table entries include the memoise flag as third entry - lookup.load_loader: the tuple returned in addtion to py_obj_type includes the memoise flag - hickle.load: Whether to use load_fn stored in lookup.hkl_types_table or use a PyContainer object stored in hkl_container_dict is decided upon the is_container flag returned by ReferenceManager.resolve_type instead of checking whether processed node is of type h5py.Group - dtype of string and bytes datasets is now set to 'S1' instead of 'u8' and shape is set to (1,len)
Merged
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Abstract:
With the proposed extension implemented it will be assured that objects are dumped only once. If the object is referred to multiple times within a complex object structure than any further reference is converted into an appropriate hdff5 hard-link referring to the corresponding hdf5 group or dataset.
This avoids that the very same object is dumped several hundred to thousand times unnecessarily increasing the file size. On load the hard-link will again be replaced by a reference to the Python object represented by the hdf5 group or dataset. This ensures that the loaded object structure resembles a one to one copy of the dumped one.
Motivation
In H4EP001 (#135) the extension of the loader based design introduced by hickle 4 to by container based loaders for
h5py.Group
objects and mixed loaders handling in additionh5py.Dataset
object was propose. A possible implementation is suggested in pull-request #138. With this in place a large variety of Python objects can now be dumped and loaded without the need to convert them to pickle string.Objects which are referred to multiple times within the object structure dumped are written multiple times to the resulting hdf5 file. Which causes the following effects:
Memoisation is a possibility to keep track upon objects which are at least referenced once within an object structure and link their references to the corresponding hdf5 file objects. In case a specific object occurs multiple times within the object structure a dataset with the special
dtype=ref_dtype
to the very same hdf5 group or dataset be created instead of dumping the referred object once again. On load when a dataset withdtype=ref_dtype
is encountered its content will be replaced by appropriate reference to the actual object referred to.Specification
Basic Memoisation
The basic memoisation is implemented mechanisms and structures comparable to those which the Python copy protocol uses. At the core there are two dictionaries one for dumping and one for loading.
The dict utilized by
_dump
function uses the id of the already dumped objects as its keys and the reference to the created hdf5 group or dataset as the corresponding values. The id is the value returned by theid()
function provided by pythonbuiltins
module.The dict utilized by
_load
function uses the id of the hdf5 file nodes loaded as its keys and the object restored from each of the hdf5 nodes as its values.On every
py_obj
to be dumped the_dump
will first check an entry forid(py_obj)
exists within thedump_memo
dictionary. In case the entry exist_dump
will create a new dataset in the current h_group and assign the reference of the previously createdh5py.Group
orh5py.Dataset
object to it instead of dumping thepy_obj
again. Using a dataset exposing the specialdtype=ref_dtype
has the advantage that this dataset can have attributes which differ from the ones stored along with the referred to dataset or group, whereas on hard links the attributes would be shared amognst all of its occurrences. When apy_obj
is stored to the file the file a new entry in thedump_memo
dictionary is created for later use.Whenever a
py_obj
is restored from its corresponding hdf5 file object the_load
function creates an entry in theload_memo
dictionary. In case_load
function encounters that a dh5py.Dataset
withdtype=ref_dtype
to be loaded it check whether it is already referenced by theload_memo
dictionary. In this case it will load the correspondingpy_obj
fromload_memo
. If it the referred toh5py.Dataset
orh5py.Group
has not yet been loaded the_load
function will load it and store the reference to the resulting object in theload_memo
dictionary before resolving there reference. If the correspondingh5py.Dataset
orh5py.Group
is encountered later on the correspondingpy_obj
can directly loaded from theload_memo
dictionary.Extended Memoisation
Beyond the above specified basic memoisation the following extended memoisation related to the handling and storage of
'type'
attributes of dumped dhf5 datasets and groups is proposed. In the current hickle file format besides the'base_type'
attribute storing a short string identifying the loader to be used for properly restoring the object represented by the data, the'type'
attribute contains a pickle string describing the actual python class or type object of the restored object. This can be a subclass of the class object the_dump
method used to select the loader providing the appropriate dump method.The h5py documentation says about attributes:
"[...] Attributes have the following properties:
[...]"
This means that especially pickle string which can become quite long may cause problems or have to be handled special by hdf5 file. Even worse each pickle string can occur multiple times as various distinct
py_obj
objects within the object structure to be dumped may be of the same class or type.Therefore it is proposed to introduce an additional dictionary like classes
dump_type_memo
andload_type_memo
which manage thepy_obj_types_table
group within the hickle file hosting the datasets representing the pickle strings of all relevant types`The
dump_type_memo
dictionary uses the id of thepy_obj_type
as the key to access dataset representing the picklestring corresponding to py_object_type to be dumped. Ah5py.Reference
to correspondingnext_py_obj_type
will assigned to the'type'
attribute of theh_subnode
representing thepy_obj
instead of storing the pickle string of itspy_obj_type
directly.On load the the
load_type_memo
list would be linked to thepy_obj_types_table
group before any object is restored from file.Instead of unpacking the
py_obj_type
form the'type'
attribute of the hdf5 file node the attribute would just denoteh5py.Refernence
referring to the table entry representing the appropriatepy_obj_type
. Thish5py.Reference
would be used by_load
function to load the actualpy_obj_type
fromload_type_memo
table and if necessary unpack it first from the corresponding dataset if not yet loaded, instead of unpacking it from a pickle string stored in the'type'
attribute directly.A special case would form in this context class and function objects which are to be dumped within Python copy protocol representation of objects. The corresponding datasets store the pickle string representing the class object to be restored or the function to be called for proper restoring the basic structure of the
py_obj
. If extended Memoisation is activated these strings would also be moved todump_type_memo
structure and the corresponding datasets withdtype=ref_dtype
would refer to appropriatepy_obj_types_table
entry instead of being handled by basic memoisation. This would be taken care of by thecreate_pickled_dataset
andload_pickled_dataset
methods provided through the specialb'pickle'
base_type.Rational
The proposed extension will allow to also dump complex object structures which contain objects referred to multiple times without duplicating their representation within the resulting hdf5 file. As well as reducing the amount of space used by repeatedly storing pickle string of
py_obj_type
of python object of the same type. At the same time in contrast to using hard links each occurence can have its own set of special attributes local to the positing of the referred to object within the object structure to be dumped.Open Issues
Shall extended memoisation be activated always? Shall it be activated when any of the
compression
filters is activated? Or shall it be activated by dedicatedextendedmemo
flag.-> compression should not be the problem
-> deactivation on loader level by an addtional
memoize
flag for each class_register table entry which can either beTrue
orFalse
seems to be the more sensitive aproach instead of global activation and deactivation.References
[1] H4EP #135
[2] Python copy https://docs.python.org/3.6/library/copy.html
[3] Python pickle https://docs.python.org/3.6/library/pickle.html#pickling-class-instances
[4] h5py Object References https://docs.h5py.org/en/stable/refs.html
Precondition
H4EP001 #138 implemented and merged into dev
The text was updated successfully, but these errors were encountered: