H4EP002: Memoisation scheme #139

hernot · 2020-07-27T17:47:03Z

Abstract:

With the proposed extension implemented it will be assured that objects are dumped only once. If the object is referred to multiple times within a complex object structure than any further reference is converted into an appropriate hdff5 hard-link referring to the corresponding hdf5 group or dataset.
This avoids that the very same object is dumped several hundred to thousand times unnecessarily increasing the file size. On load the hard-link will again be replaced by a reference to the Python object represented by the hdf5 group or dataset. This ensures that the loaded object structure resembles a one to one copy of the dumped one.

Motivation

In H4EP001 (#135) the extension of the loader based design introduced by hickle 4 to by container based loaders for h5py.Group objects and mixed loaders handling in addition h5py.Dataset object was propose. A possible implementation is suggested in pull-request #138. With this in place a large variety of Python objects can now be dumped and loaded without the need to convert them to pickle string.

Objects which are referred to multiple times within the object structure dumped are written multiple times to the resulting hdf5 file. Which causes the following effects:

File size quickly increases especially if dumped objects are complex or their representation includes long strings, byte strings and pickle strings created for their 'type' attribute or class and function objects attached to them
on Load for each occurrence a distinct copy of the object created. As a consequence the restored object structure is not any more a one to one representation of the dumped one.

Memoisation is a possibility to keep track upon objects which are at least referenced once within an object structure and link their references to the corresponding hdf5 file objects. In case a specific object occurs multiple times within the object structure a dataset with the special dtype=ref_dtype to the very same hdf5 group or dataset be created instead of dumping the referred object once again. On load when a dataset with dtype=ref_dtype is encountered its content will be replaced by appropriate reference to the actual object referred to.

Specification

Basic Memoisation

The basic memoisation is implemented mechanisms and structures comparable to those which the Python copy protocol uses. At the core there are two dictionaries one for dumping and one for loading.

py_obj_id = id(py_obj)
dump_memo[py_obj_id] = h_subnode

The dict utilized by _dump function uses the id of the already dumped objects as its keys and the reference to the created hdf5 group or dataset as the corresponding values. The id is the value returned by the id() function provided by python builtins module.

py_obj = load_fn(...) or py_subcontainer.convert()
load_memo[h_node.id] = py_obj

The dict utilized by _load function uses the id of the hdf5 file nodes loaded as its keys and the object restored from each of the hdf5 nodes as its values.

On every py_obj to be dumped the _dump will first check an entry for id(py_obj) exists within the dump_memo dictionary. In case the entry exist _dump will create a new dataset in the current h_group and assign the reference of the previously created h5py.Group or h5py.Dataset object to it instead of dumping the py_obj again. Using a dataset exposing the special dtype=ref_dtype has the advantage that this dataset can have attributes which differ from the ones stored along with the referred to dataset or group, whereas on hard links the attributes would be shared amognst all of its occurrences. When a py_obj is stored to the file the file a new entry in the dump_memo dictionary is created for later use.

Whenever a py_obj is restored from its corresponding hdf5 file object the _load function creates an entry in the load_memo dictionary. In case _load function encounters that a d h5py.Dataset with dtype=ref_dtype to be loaded it check whether it is already referenced by the load_memo dictionary. In this case it will load the corresponding py_obj from load_memo. If it the referred to h5py.Dataset or h5py.Group has not yet been loaded the _load function will load it and store the reference to the resulting object in the load_memo dictionary before resolving there reference. If the corresponding h5py.Dataset or h5py.Group is encountered later on the corresponding py_obj can directly loaded from the load_memo dictionary.

Extended Memoisation

Beyond the above specified basic memoisation the following extended memoisation related to the handling and storage of 'type'attributes of dumped dhf5 datasets and groups is proposed. In the current hickle file format besides the'base_type' attribute storing a short string identifying the loader to be used for properly restoring the object represented by the data, the 'type' attribute contains a pickle string describing the actual python class or type object of the restored object. This can be a subclass of the class object the _dump method used to select the loader providing the appropriate dump method.

The h5py documentation says about attributes:
"[...] Attributes have the following properties:

They may be created from any scalar or NumPy array
Each attribute should be small (generally < 64k)
There is no partial I/O (i.e. slicing); the entire attribute must be read.

[...]"

This means that especially pickle string which can become quite long may cause problems or have to be handled special by hdf5 file. Even worse each pickle string can occur multiple times as various distinct py_obj objects within the object structure to be dumped may be of the same class or type.

Therefore it is proposed to introduce an additional dictionary like classes dump_type_memo and load_type_memo which manage the py_obj_types_table group within the hickle file hosting the datasets representing the pickle strings of all relevant types`

py_obj_type_id = id(py_obj_type)
next_py_obj_type = dump_type_memo.get(py_obj_type_id,None)
if next_py_obj_type is None:
    pickle_string = bytearray(pickle.dumps(py_obj_type))
    next_py_obj_type = py_obj_types_table.create_dataset(
         str(len(py_obj_types_table)),data =picklestring,**kwargs
    )
    dump_type_memo[py_obj_type_id] = next_py_obj_type
h_subnode.attrs['type'] = next_py_obj_type.ref

The dump_type_memo dictionary uses the id of the py_obj_type as the key to access dataset representing the picklestring corresponding to py_object_type to be dumped. A h5py.Reference to corresponding next_py_obj_type will assigned to the 'type' attribute of the h_subnode representing the py_obj instead of storing the pickle string of its py_obj_type directly.

load_type_memo = _load(...,h5f["type_memo"],...)
py_obj_type = dump_type_memo[h_node.attrs['type']

On load the the load_type_memo list would be linked to the py_obj_types_table group before any object is restored from file.
Instead of unpacking the py_obj_type form the 'type' attribute of the hdf5 file node the attribute would just denote h5py.Refernence referring to the table entry representing the appropriate py_obj_type. This h5py.Reference would be used by _load function to load the actual py_obj_type from load_type_memo table and if necessary unpack it first from the corresponding dataset if not yet loaded, instead of unpacking it from a pickle string stored in the 'type' attribute directly.

A special case would form in this context class and function objects which are to be dumped within Python copy protocol representation of objects. The corresponding datasets store the pickle string representing the class object to be restored or the function to be called for proper restoring the basic structure of the py_obj. If extended Memoisation is activated these strings would also be moved to dump_type_memo structure and the corresponding datasets with dtype=ref_dtype would refer to appropriate py_obj_types_table entry instead of being handled by basic memoisation. This would be taken care of by the create_pickled_dataset and load_pickled_dataset methods provided through the special b'pickle' base_type.

Rational

The proposed extension will allow to also dump complex object structures which contain objects referred to multiple times without duplicating their representation within the resulting hdf5 file. As well as reducing the amount of space used by repeatedly storing pickle string of py_obj_type of python object of the same type. At the same time in contrast to using hard links each occurence can have its own set of special attributes local to the positing of the referred to object within the object structure to be dumped.

Open Issues

Shall extended memoisation be activated always? Shall it be activated when any of the
compression filters is activated? Or shall it be activated by dedicated extendedmemo flag.

-> compression should not be the problem

-> deactivation on loader level by an addtional memoize flag for each class_register table entry which can either be True or False seems to be the more sensitive aproach instead of global activation and deactivation.

References

[1] H4EP #135
[2] Python copy https://docs.python.org/3.6/library/copy.html
[3] Python pickle https://docs.python.org/3.6/library/pickle.html#pickling-class-instances
[4] h5py Object References https://docs.h5py.org/en/stable/refs.html

Precondition

H4EP001 #138 implemented and merged into dev

The text was updated successfully, but these errors were encountered:

telegraphic · 2021-01-12T14:11:28Z

Short comment to a long proposal: I agree this is well motivated and am supportive of its implementation!

…and type) Basic Menoisation: ================== Both types of memoisation are handled by Reference manager dictionary type object. For storing object instance references it is used as python dict object which stores the references to the py_obj and related node using py_obj(id) as key when dumping. On load the id of the h_node is used as key for storing the to be shared reference of the restored object. Additoinal references to the same object are represented by h5py.Datasets with their dtype set to ref_dtype. They are created by assinging a h5py.Refence object as returned by h5py.Dataset.ref or h5py.Group.ref attribute. These datasets are resolved by the filter iterator method of the ExpandReferenceContainer class and returned as sub_item of the reference dataset. Type Memoisation ================ The 'type' attribute of all nodes exempt datasets which contain pickle strings, or expose a ref_dtype as their dtype now contains a reference to the approriate py_obj_type entry in the global 'hickle_types_table' this table host the datasets representing all py_obj_types and base_types encountered by hickle.dump once. Each py_obj_type is represened by a numbered dataset containing the corresponding pickle string. The base_types are represented by empty datasets the name of which is the name of the base_type as defined by class_register table of loaders. No entry is stored for object, b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these can be resolved implicitly on load. The 'base_type' attribute of a py_obj_type entry referres to the base_type used to encode and required to properly restore it again from the hickle file. The entries in the 'hickle_types_table' are managed by the ReferenceManager.store_type and ReferenceManager.resolve_type methods. The latter is also taking care of properly distinguishing pickle datasets from reference datasets and resolving hickle 4.0.X dict_item groups. The ReferenceManager is implemented as context manager and thus can and shall be used within the with statement, to ensure proper cleanup. Each file has its own ReferenceManager instance, therefore different data can be dumped to distinct files which are open in parallel. The basic management of managers is provided by the BaseManager base class which can be used build futher managers for example to allow loaders to be activated only when specific feature flags are passed to hickle.dump method or encountered by hickle.load from the file attributes. The BaseManager class has to be subclassed. Other changes: ============== - lookup.register_class and class_register tables have an addtional memoise flag indicating whether py_obj shall be remmembered for representing and resolving multiple references to it or if it shall be dumped and restored everytime it is encountered - lookup.hickle_types table entries include the memoise flag as third entry - lookup.load_loader: the tuple returned in addtion to py_obj_type includes the memoise flag - hickle.load: Whether to use load_fn stored in lookup.hkl_types_table or use a PyContainer object stored in hkl_container_dict is decided upon the is_container flag returned by ReferenceManager.resolve_type instead of checking whether processed node is of type h5py.Group - dtype of string and bytes datasets is now set to 'S1' instead of 'u8' and shape is set to (1,len)

hernot mentioned this issue Jul 27, 2020

Implementaion of Container and mixed loaders (H4EP001) #138

Closed

hernot changed the title ~~H4EP002: Memoization scheme~~ H4EP002: Memoisation scheme Jul 27, 2020

1313e added the enhancement label Jul 30, 2020

This was referenced Nov 23, 2020

support for python copy protocol __setstate__ __getstate__ if present in object #125

Closed

HEP003: Hickle Compact Expand protocol #145

Closed

This was referenced Dec 20, 2020

np.float64 vs float when compression is on. #140

Closed

Dumping to io.BufferedReader Fails #144

Closed

hernot mentioned this issue Jan 4, 2021

Hickle not working with h5py 3.0 #143

Closed

hernot mentioned this issue Apr 22, 2021

Hickle 5 rc #149

Merged

hernot mentioned this issue Dec 18, 2021

Hickle v5.0.0 #153

Merged

telegraphic closed this as completed Dec 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H4EP002: Memoisation scheme #139

H4EP002: Memoisation scheme #139

hernot commented Jul 27, 2020 •

edited

telegraphic commented Jan 12, 2021

H4EP002: Memoisation scheme #139

H4EP002: Memoisation scheme #139

Comments

hernot commented Jul 27, 2020 • edited

Abstract:

Motivation

Specification

Basic Memoisation

Extended Memoisation

Rational

Open Issues

References

Precondition

telegraphic commented Jan 12, 2021

hernot commented Jul 27, 2020 •

edited