-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hickle v5.0.0 #153
Hickle v5.0.0 #153
Conversation
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to contain pickle string and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Dataset objects which contain pickle strings 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to contain pickle string and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Dataset objects which contain pickle strings 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting. h5py version limited to <3.x according to issue #143
Basic Menoisation: ================== Both types of memoisation are handled by Reference manager dictionary type object. For storing object instance references it is used as python dict object which stores the references to the py_obj and related node using py_obj(id) as key when dumping. On load the id of the h_node is used as key for storing the to be shared reference of the restored object. Additoinal references to the same object are represented by h5py.Datasets with their dtype set to ref_dtype. They are created by assinging a h5py.Refence object as returned by h5py.Dataset.ref or h5py.Group.ref attribute. These datasets are resolved by the filter iterator method of the ExpandReferenceContainer class and returned as sub_item of the reference dataset. Type Memoisation ================ The 'type' attribute of all nodes exempt datasets which contain pickle strings, or expose a ref_dtype as their dtype now contains a reference to the approriate py_obj_type entry in the global 'hickle_types_table' this table host the datasets representing all py_obj_types and base_types encountered by hickle.dump once. Each py_obj_type is represened by a numbered dataset containing the corresponding pickle string. The base_types are represented by empty datasets the name of which is the name of the base_type as defined by class_register table of loaders. No entry is stored for object, b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these can be resolved implicitly on load. The 'base_type' attribute of a py_obj_type entry referres to the base_type used to encode and required to properly restore it again from the hickle file. The entries in the 'hickle_types_table' are managed by the ReferenceManager.store_type and ReferenceManager.resolve_type methods. The latter is also taking care of properly distinguishing pickle datasets from reference datasets and resolving hickle 4.0.X dict_item groups. The ReferenceManager is implemented as context manager and thus can and shall be used within the with statement, to ensure proper cleanup. Each file has its own ReferenceManager instance, therefore different data can be dumped to distinct files which are open in parallel. The basic management of managers is provided by the BaseManager base class which can be used build futher managers for example to allow loaders to be activated only when specific feature flags are passed to hickle.dump method or encountered by hickle.load from the file attributes. The BaseManager class has to be subclassed. Other changes: ============== - lookup.register_class and class_register tables have an addtional memoise flag indicating whether py_obj shall be remmembered for representing and resolving multiple references to it or if it shall be dumped and restored everytime it is encountered - lookup.hickle_types table entries include the memoise flag as third entry - lookup.load_loader: the tuple returned in addtion to py_obj_type includes the memoise flag - hickle.load: Whether to use load_fn stored in lookup.hkl_types_table or use a PyContainer object stored in hkl_container_dict is decided upon the is_container flag returned by ReferenceManager.resolve_type instead of checking whether processed node is of type h5py.Group - dtype of string and bytes datasets is now set to 'S1' instead of 'u8' and shape is set to (1,len)
In current release custom objects are simply converted into binary pickle string. This commit implements HEP003 issue #145 register_class based alternative featureing custom loader modules. In hickle 4.x custom loader funcitons can be added to hickle by exiplicitly calling hickle.lookup.register_class before calling hickle.dump and hickle.load. In HEP003 an alternative approach, the hickle specific compact_expand protocol mimicking python copy protocol is proposed. It was found that this mimickry does not provide any benfefit compared to hickle.lookup.register_class based approach. Even worse hickle users have to litter their class definitions whith hickle only loader methods called __compact__ and __expand__ and in addition have to register their class for compact expand and activate the compact_expand loader option to activate the uses of these two methods. This commit implements hickle.lookup.register_class package and program loader modules support. In addtion load_<package>.py modules may in addtion to hickle.loaders directory also be stored along with the python base package, module or program main script. In order to keep directory structure clean load_<package>.py modules must be stored within the special hickle_loaders subdirectory at that level. Example package loader: ----------------------- dist_packages/ +- ... +- <my_package>/ | +- __init__.py | +- sub_module1.py | +- ... | +- sub_moduleN.py | +- hickle_loaders/ | +- load_<my_package>.py +- ... Example single module loader distpackages/ +- ... +- <my_module>.py +- ... +- hickle_loaders/ | +- ... | +- load_<my_module>.py | +- ... +- ... Example program main (package) loader bin/ +- ... +- <my_single_file_program>.py +- hickle_loaders/ | +- ... | +- load_<my_single_file_program>.py | +- ... +- ... +- <my_program>/ | +- ... | +- <main>.py | +- ... | +- hickle_loaders/ | | +- ... | | +- load_<main>.py | | +- ... | +- ... +- ... Fallback Loader recovering data: -------------------------------- Implements special AttemptsRecoverCustom types used in replacement for Python objects are missing or are incompatible to the data stored. The affected data is loaded as RecoveredGroup (dict type) or RecoveredDataset (numpy.ndarray type) objects. Attached to either is the attrs attribute as found on the corresponding h5py.Group and h5py.Dataset in the hickle file. LoaderManager: ============== The LoaderManager based approach allows to add further optional loader sets. For example when loading a hickle 4.0.X file imlicitly the corresponding loader set is added to ensure 'DictItem' and other helper types specific to hickle 4.0.X are properly recognized and the correpsonding data is properly restored. Only optional loaders exempt legacy loaders provided by hickle core (currently 'hickle-4.0') are considered valid which are listed by the 'optional_loaders' exported by hickle.loaders.__init__.py. A class_register table entry can be assigned to a specific optional loader by specifying the loader name as its 7th item. Any other entry which has less than 7 items or its 7th item reads None is included in the set of global loaders. @hernot
Hickle 4.0.x accepts file like objects and uses them as model for creating a new file discarding file handler passed in. If the file should be accessed later on using the same file handler this results in an IOError raised and alike. The proposed fix does not close the file but checks that it is opened at least read able if mode is 'r' or read and writable in case mode reads 'w', 'r+', 'w+','x', 'x+' or 'a'. Note: passed in file or file like object has to be closed by caller.
Compression keywords safety tests: ================================== In issue #140 it was reported that some loaders crash when 'compression', 'chunks' and related h5py keyword arguments are specified. By running pytest a second time and thereby specifying the custom --enable-compression parameter all tests are rerun with kwargs={'compression':'gzip', 'compression_opts':6} All compression sensitive tests especially all 'test_XX_*.py::*' unit test functions must include the 'compression_kwargs' parameter in their signature to receive the actual kwargs list to be passed to all 'create_fcn' function defined by loader module. In case a test function misses to pass on the 'compression_kwargs' as keyword arguments ('**kwargs') to 'hickle.dump', 'hickle._dump', or any dump_method listed in 'class_register' table of loader module or specified directly in a 'LoaderManager.register_class' an AssertionError exception is thrown indicating the name of the test function, the line in which the affected function is called any function which it calls. Tests which either test compression related issues explicitly or do not call any of the dump_functions may be marked accordingly using the 'pytest.mark.no_compression' marker to explicitly exclude test function from compression testing. Tox virtual env manager support: ================================ Adds support for virtualenv manager tox. Tox simplifies local testing of compatibility for multiple python versions before pushing to github and creating pullrequest. Travis and Appveyor integration still has to be tested and verified. '# Sie sind gerade beim Rebase von Branch 'final_and_cleanup_4.1.0' auf 'ab9a0ee'.
Current release causes problems when trying to dump and load data when h5py 3.X library and newer is installed. In that case loading will fail with errors. This is caused by - a change in how strings are stored in Attributes. Before 3.x they were stored as bytes strings. Encoding had to happend manually before storing to and after reading from the attribute. In h5py encoding is handled by attrs.__getitem__ and attrs.__getitem__. The first simply assumes that any bytestring stored as attribute value must represent a python 'str' object which either was written by python 2 which used plain c strings or is the encoded version of a Python 3 utf8 string and thus has to be encoded in utf8. Only in case the byte string contains characters which are not valid in utf8 encoded string than content is returned as 'bytes' object - a change in how utf8 'str' objets are stored. They are now consequently stored as var_dtype objects where the metadata of dtype indicate the resulting python type. In H5py2 this didn't matter therefore strings stored in datasets using h5py2 are returned with a opaque object dtype for which decoding to utf8 has to be enforced if py_obj_type as indicated by 'type' attribute is 'str'. - a change how Datasets for which no explicit link exists in the file but a strong Python reference exists to the h5py.Dataset or h5py.Group objects exits are handled when a reference to them is resolved. In h5py 2 a ValueError is raise in h5py 3.X an anonymous Dataset or Group is returned. Consequently when simulating during test runs stale references encountered by hickle.load all python references to the corresponding datasets and groups have to be dropped to see that the reference is stale. The alternative would be to check the name of the referred to Dataset if it is None the dataset would be anonymous. Resolving the name on non Anonymous Datasets and Groups, which allways represents the full path name and not just the local name, is a quite expensive and costy processin terms of runtime and resources and shall be avoided This commit provides fixes for these changes severely affecting proper operation of hickle. String values of attributes are compared against 'bytes' and 'str' type variants of constant strings. Lookupables like key_base_type and hkl_types and hkl_conainer_types include the key for an entry as 'str' and 'bytes' type key. A featured optional load_fcn is provided fixing the difference how dtype of utf8 strings is computed by h5py 2 and h5py 3. Test functions are updated accordingly to properly mock stale reference conditions and changes in how strings are stored.
Removed commits included in branch by imperfect branch rebasing on assembling hickle-5-RC branch. Numpy has introduced with version 1.20 a disrupting change which prevents numpy.dtype type objects like np.int16 be pickled. Trying to pickle numpy.dtypes causes pickling error in python 3.7 and python 3.8 which utilize numpy > 1.20. As this error will according to numpy issue numpy/numpy#16692 not before numpy 1.21 is released. To fix for now set numpy requirement to be less than 1.20. Recheck required when numpy 1.21 is released. If fixed update requirement that all numpy 1.20 versions will be properly excluded from beeing supported by hickle. Latest cryptography packages required by tox changed their dependency upon rust compiler. This change makes installation of virtualenv fail if not like for pip its latest versoin is used. Therefor virtualenv is now included in pip --update line Deviating from former plans missing indexer on regex match class in python 3.5 fixed. Latest HDF5 versions (>= 1.12) seem to not support anymor windows 32 bit platform. This causes all appveyor builds on windows 32 bit to fail exempt for python 3.5. It seems h5py 2.10 seems to be the last supporting python 3.5. Added TOXENV entry to appveyor yaml as tox_appveryor seems not anymore limit environments to be tested to the selected testing environment.
Recovering data: ---------------- All data for which no loader can or could be loaded is attempted to recove Windows 32 bit fix: ------------------- Binary prebuilt wheels for h5py are available on PyPy upt to version 2.10 and Python 3.8. Any later version only provides prebuilt wheels for 64 bit Windows. Any later version must be built manually if to be installed on 32 bit Windows. Similar situation for libhdf5 with is required by h5py. The last version for which prebuilt binaries for Bindows 7 32 Bit are available is 1.10.4 https://portal.hdfgroup.org/display/support/HDF5+1.10.4). Ayn later version must be built manually before building h5py manually. Consequently hickle can only for 32 bit Winudows as long as no h5py feautres are use which are provided by h5py >= 3.0 and when run on Python <= 3.8. Any other version is 64 bit only. On Appveyor the variable TOX_H5PY_REQIREMENTS is used to select the requirements32.txt file specifying all requirements in their latest version providing prebuild wheels for 32 bit Windows. On any other system this variable is not set and thus tox will select requirements.txt instead. Other: ------ Small fixes, documentation etc.
switched deployment and testing from appveyor and travis as requested by @1313e. Added description on how to enable custom loaders and setup loader moduels for custmom python packages, modules and applications. Requirement for h5py set to >= 2.10 Hickle version set to 5.0.0.dev0 Added dummy test for not yet existing loaders for pandas package Documentation cleanup. Revision of unit tests and removed `# pragma: no cover` from branches which are covered by existing tests.
a) drop eval for b'NoneType' dict key_base_type b) switch back to literal_eval for b'tuple' dict key_base_type c) limit github action workflow to running for push and pull_reqeust event on telectraphic/hickle repo only d) limit github action workflow to running for following pull_request types opened, synchronize, reopened, edited only NOTE: c) and d) needed some extra learning on how github actions work NOTE: c) and d) for debugging workflow and workflows in general from forked repo comment top most if conditoin of the edited job. This ensure its edited steps are executed on the forked repo. Until the edits are merged into upstream repo the pull_reqest.synchronize will use the unchanged workflow. When done with editing do not forget to uncomment if condition to prevent workflow to be executed on forked repos and upstream repo even though it still will appear in the actions sectoin of the forked repo when configured so on the upstream repo.
…py_obj has to be pickled due to missing loader. fixes: RuntimeError emmitted for properly registered loader internal to hickle with None type dump_fcn
… adding dill weed
…cleanup. Without package, module and script loader support is broken
…properly and stable under any circumstances?
…properly and stable under any circumstances?
… collector, try forcing collection after file flushing and before closing
…ves the problem too
On GitHub actions windows runners it was obswerved for python 3.7 32 bit and 64 bit and python 3.8 64 bit that h5py.copy aborted with PermissionError exceptoin when copying hdf5 files which are opened for writing. On the remaining three windows runners as well as linux and mac runners copying was possible without any issue. Adopted test_ReferenceManager_context and test_ReferenceManager tests to close file accessed through h5_data fixture first before calling shutil.copy. Fixed spourious Unrecognized typecode ValueError exception occuring during h5py.File.close. These are caused by python garbage collector which kicks in while h5py.close in critical section. By disabling garbage collection before call to h5py.File.close and re-enabling afterwards this can be solved for all python versions. This error most consistently could be observed on windows 32 bit running python 3.7 with h5py 2.10.0.
…cleanup. Without package, module and script loader support is broken
…irements updated to allow any time removal from requirements.txt and requirements32.txt while it stays requirement for testing. Removed explicit six requirement as no loader yet available and likely meanwhile installed by other packages anyway in appropriate version.
…irements updated to allow any time removal from requirements.txt and requirements32.txt while it stays requirement for testing. Removed explicit six requirement as no loader yet available and likely meanwhile installed by other packages anyway in appropriate version.
Hickle 5 rc
Any reason to keep appveyor now we have github actions? (I think not?) |
No, that link should be removed at AppVeyor's side. |
I've updated the The version is based on that in We should change this to 5.0.0 once we're satisfied that it is working for end users |
@telegraphic Cool thank you very much. By the way I guess you should update the list of major changes in README.md and possibly one day if relevant at all redo the speed comparisons to fit hickle 5. |
We have merged @hernot's Hickle 5 rc branch into our dev-v5 branch, what else needs to be done before releasing v5.0.0?