Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hickle v5.0.0 #153

Merged
merged 32 commits into from
Dec 17, 2021
Merged

Hickle v5.0.0 #153

merged 32 commits into from
Dec 17, 2021

Conversation

telegraphic
Copy link
Owner

We have merged @hernot's Hickle 5 rc branch into our dev-v5 branch, what else needs to be done before releasing v5.0.0?

With hickle 4.0.0 the code for dumping and loading dedicated objects
like scalar values or numpy arrays was moved to dedicated loader
modules. This first step of disentangling hickle core machinery from
object specific included all objects and structures which were mappable
to h5py.Dataset objects.

This commit provides an implementaition of hickle extension proposal
H4EP001 (#135). In this
proposal the extension of the loader concept introduced by hickle 4.0.0
towards generic PyContainer based and mixed loaders specified.

In addition to the proposed extension this proposed implementation inludes
the following extensions hickle 4.0.0 and H4EP001

H4EP001:
========
    PyContainer Interface includes a filter method which allows loaders
    when data is loaded to adjust, suppress, or insert addtional data subitems
    of h5py.Group objects. In order to acomplish the temorary modification
    of h5py.Group and h5py.Dataset object when file is opened in read
    only mode the H5NodeFilterProxy class is provided. This class will
    store all temporary modifications while the original h5py.Group
    and h5py.Dataset object stay unchanged

hickle 4.0.0 / 4.0.1:
=====================
    Strings and arrays of bytes are stored as Python bytearrays and not as
    variable sized stirngs and bytes. The benefit is that hdf5 filters
    and hdf5.compression filters can be applied to Python bytearrays.
    The down is that data is stored as bytes of int8 datatype.
    This change affects native Python string scalars as well as numpy
    arrays containing strings.

    numpy.masked array is now stored as h5py.Group containin a dedicated
    dataset for data and mask each.

    scipy.sparce matrices now are stored as h5py.Group with containing
    the datasets data, indices, indptr and shape

    dictionary keys are now used as names for h5py.Dataset and
    h5py.Group objects.

    Only string, bytes, int, float, complex, bool and NonType keys are
    converted to name strings, for all other keys a key-value-pair group
    is created containg the key and value as its subitems.

    string and bytes keys which contain slashes are converted into key
    value pairs instead of converting slashes to backslashes.
    Distinction between hickle 4.0.0 string and byte keys with converted
    slashes is made by enclosing sting value within double quotes
    instead of single qoutes as donw by Python repr function or !r or %r
    string format specifiers. Consequently on load all string keys which
    are enclosed in single quotes will be subjected to slash conversion
    while any others will be used as ar.

    h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle'
    are on load automatically get assigned object as their py_object_type.
    The related 'type' attribute is ignored. h5py.Dataset objects which do
    not expose a 'base_type' attribute are assumed to contain pickle string
    and thus get implicitly assigned 'pickle' base type. Thus on dump for all
    h5py.Dataset objects which contain pickle strings 'base_type' and 'type'
    attributes are ommited as their values are 'pickle' and object respective.

Other stuff:
============
    Full separation between hickle core and loaders

    Distinct unit tests for individual loaders and hickle core

    Cleanup of not any more required functions and classes

    Simplification of recursion on dump and load through self contained
    loader interface.

    is capbable to load hickle 4.0.x files which do not yet
    support PyContainer concept beyond list, tuple, dict and set
    includes extended test of loading hickel 4.0.x files

    contains fix for labda py_obj_type issue on numpy arrays with
    single non list/tuple object content. Python 3.8 refuses to
    unpickle lambda function string. Was observerd during finalizing
    pullrequest. Fixes are only activated when 4.0.x file is to be
    loaded

    Exceptoin thrown by load now includes exception triggering it
    including stacktrace for better localization of error in debuggin
    and error reporting.
With hickle 4.0.0 the code for dumping and loading dedicated objects
like scalar values or numpy arrays was moved to dedicated loader
modules. This first step of disentangling hickle core machinery from
object specific included all objects and structures which were mappable
to h5py.Dataset objects.

This commit provides an implementaition of hickle extension proposal
H4EP001 (#135). In this
proposal the extension of the loader concept introduced by hickle 4.0.0
towards generic PyContainer based and mixed loaders specified.

In addition to the proposed extension this proposed implementation inludes
the following extensions hickle 4.0.0 and H4EP001

H4EP001:
========
    PyContainer Interface includes a filter method which allows loaders
    when data is loaded to adjust, suppress, or insert addtional data subitems
    of h5py.Group objects. In order to acomplish the temorary modification
    of h5py.Group and h5py.Dataset object when file is opened in read
    only mode the H5NodeFilterProxy class is provided. This class will
    store all temporary modifications while the original h5py.Group
    and h5py.Dataset object stay unchanged

hickle 4.0.0 / 4.0.1:
=====================
    Strings and arrays of bytes are stored as Python bytearrays and not as
    variable sized stirngs and bytes. The benefit is that hdf5 filters
    and hdf5.compression filters can be applied to Python bytearrays.
    The down is that data is stored as bytes of int8 datatype.
    This change affects native Python string scalars as well as numpy
    arrays containing strings.

    numpy.masked array is now stored as h5py.Group containin a dedicated
    dataset for data and mask each.

    scipy.sparce matrices now are stored as h5py.Group with containing
    the datasets data, indices, indptr and shape

    dictionary keys are now used as names for h5py.Dataset and
    h5py.Group objects.

    Only string, bytes, int, float, complex, bool and NonType keys are
    converted to name strings, for all other keys a key-value-pair group
    is created containg the key and value as its subitems.

    string and bytes keys which contain slashes are converted into key
    value pairs instead of converting slashes to backslashes.
    Distinction between hickle 4.0.0 string and byte keys with converted
    slashes is made by enclosing sting value within double quotes
    instead of single qoutes as donw by Python repr function or !r or %r
    string format specifiers. Consequently on load all string keys which
    are enclosed in single quotes will be subjected to slash conversion
    while any others will be used as ar.

    h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle'
    are on load automatically get assigned object as their py_object_type.
    The related 'type' attribute is ignored. h5py.Dataset objects which do
    not expose a 'base_type' attribute are assumed to contain pickle string
    and thus get implicitly assigned 'pickle' base type. Thus on dump for all
    h5py.Dataset objects which contain pickle strings 'base_type' and 'type'
    attributes are ommited as their values are 'pickle' and object respective.

Other stuff:
============
    Full separation between hickle core and loaders

    Distinct unit tests for individual loaders and hickle core

    Cleanup of not any more required functions and classes

    Simplification of recursion on dump and load through self contained
    loader interface.

    is capbable to load hickle 4.0.x files which do not yet
    support PyContainer concept beyond list, tuple, dict and set
    includes extended test of loading hickel 4.0.x files

    contains fix for labda py_obj_type issue on numpy arrays with
    single non list/tuple object content. Python 3.8 refuses to
    unpickle lambda function string. Was observerd during finalizing
    pullrequest. Fixes are only activated when 4.0.x file is to be
    loaded

    Exceptoin thrown by load now includes exception triggering it
    including stacktrace for better localization of error in debuggin
    and error reporting.

    h5py version limited to <3.x according to issue #143
Basic Menoisation:
==================
Both types of memoisation are handled by Reference manager dictionary
type object. For storing object instance references it is used as
python dict object which stores the references to the py_obj and related
node using py_obj(id) as key when dumping. On load the id of the h_node
is used as key for storing the to be shared reference of the restored
object.

Additoinal references to the same object are represented by
h5py.Datasets with their dtype set to ref_dtype. They are created
by assinging a h5py.Refence object as returned by h5py.Dataset.ref or
h5py.Group.ref attribute. These datasets are resolved by the filter
iterator method of the ExpandReferenceContainer class and returned as
sub_item of the reference dataset.

Type Memoisation
================
The 'type' attribute of all nodes exempt datasets which contain pickle
strings, or expose a ref_dtype as their dtype now contains a reference
to the approriate py_obj_type entry in the global 'hickle_types_table'
this table host the datasets representing all py_obj_types and
base_types encountered by hickle.dump once.

Each py_obj_type is represened by a numbered dataset containing the
corresponding pickle string. The base_types are represented by empty
datasets the name of which is the name of the base_type as defined
by class_register table of loaders. No entry is stored for object,
b'pickle' as well as hickle.RerfenceType,b'!node-refence' as these
can be resolved implicitly on load.
The 'base_type' attribute of a  py_obj_type entry referres to the
base_type used to encode and required to properly restore it again
from the hickle file.

The entries in the 'hickle_types_table' are managed by the
ReferenceManager.store_type and ReferenceManager.resolve_type methods.
The latter is also taking care of properly distinguishing pickle
datasets from reference datasets and resolving hickle 4.0.X dict_item
groups.

The ReferenceManager is implemented as context manager and thus can and
shall be used within the with statement, to ensure proper cleanup. Each
file has its own ReferenceManager instance, therefore different data can
be dumped to distinct files which are open in parallel. The basic
management of managers is provided by the BaseManager base class which can be
used build futher managers for example to allow loaders to be activated only
when specific feature flags are passed to hickle.dump method or
encountered by hickle.load from the file attributes. The BaseManager
class has to be subclassed.

Other changes:
==============
 - lookup.register_class and class_register tables have an addtional
   memoise flag indicating whether py_obj shall be remmembered for
   representing and resolving multiple references to it or if it shall
   be dumped and restored everytime it is encountered

 - lookup.hickle_types table entries include the memoise flag as third entry

 - lookup.load_loader: the tuple returned in addtion to py_obj_type
   includes the memoise flag

 - hickle.load: Whether to use load_fn stored in
   lookup.hkl_types_table or use a PyContainer object stored in
   hkl_container_dict is decided upon the is_container flag returned
   by ReferenceManager.resolve_type instead of checking whether
   processed node is of type h5py.Group

 - dtype of string and bytes datasets is now set to 'S1' instead of 'u8'
   and shape is set to (1,len)
In current release custom objects are simply converted into binary
pickle string. This commit implements HEP003 issue #145 register_class
based alternative featureing custom loader modules.

In hickle 4.x custom loader funcitons can be added to hickle by
exiplicitly calling hickle.lookup.register_class before calling
hickle.dump and hickle.load. In HEP003 an alternative approach, the
hickle specific compact_expand protocol mimicking python copy protocol
is proposed. It was found that this mimickry does not provide any
benfefit compared to hickle.lookup.register_class based approach.
Even worse hickle users have to litter their class definitions whith
hickle only loader methods called __compact__ and __expand__ and
in addition have to register their class for compact expand and activate
the compact_expand loader option to activate the uses of these two
methods.

This commit implements hickle.lookup.register_class package and program
loader modules support. In addtion load_<package>.py modules may in
addtion to hickle.loaders directory also be stored along with the python
base package, module or program main script. In order to keep directory
structure clean load_<package>.py modules must be stored within the
special hickle_loaders subdirectory at that level.

Example package loader:
-----------------------

dist_packages/
  +- ...
  +- <my_package>/
  |   +- __init__.py
  |   +- sub_module1.py
  |   +- ...
  |   +- sub_moduleN.py
  |   +- hickle_loaders/
  |   +- load_<my_package>.py
  +- ...

Example single module loader

distpackages/
  +- ...
  +- <my_module>.py
  +- ...
  +- hickle_loaders/
  |   +- ...
  |   +- load_<my_module>.py
  |   +- ...
  +- ...

Example program main (package) loader

bin/
  +- ...
  +- <my_single_file_program>.py
  +- hickle_loaders/
  |   +- ...
  |   +- load_<my_single_file_program>.py
  |   +- ...
  +- ...
  +- <my_program>/
  |   +- ...
  |   +- <main>.py
  |   +- ...
  |   +- hickle_loaders/
  |   |   +- ...
  |   |   +- load_<main>.py
  |   |   +- ...
  |   +- ...
  +- ...

Fallback Loader recovering data:
--------------------------------
Implements special AttemptsRecoverCustom types used in replacement for
Python objects are missing or are incompatible to the data stored. The
affected data is loaded as RecoveredGroup (dict type) or RecoveredDataset
(numpy.ndarray type) objects. Attached to either is the attrs attribute
as found on the corresponding h5py.Group and h5py.Dataset in the hickle
file.

LoaderManager:
==============

The LoaderManager based approach allows to add further optional loader
sets. For example when loading a hickle 4.0.X file imlicitly the
corresponding loader set is added to ensure 'DictItem' and other
helper types specific to hickle 4.0.X are properly recognized and the
correpsonding data is properly restored. Only optional loaders exempt
legacy loaders provided by hickle core (currently 'hickle-4.0')
are considered valid which are listed by the 'optional_loaders' exported
by hickle.loaders.__init__.py.

A class_register table entry can be assigned to a specific optional
loader by specifying the loader name as its 7th item. Any other entry
which has less than 7 items or its 7th item reads None is included in
the set of global loaders.

@hernot
Hickle 4.0.x accepts file like objects and uses them as model for
creating a new file discarding file handler passed in. If the file
should be accessed later on using the same file handler this results
in an IOError raised and alike.

The proposed fix does not close the file but checks that it is opened at
least read able if mode is 'r' or read and writable in case mode reads
'w', 'r+', 'w+','x', 'x+' or 'a'.

Note: passed in file or file like object has to be closed by caller.
Compression keywords safety tests:
==================================

In issue #140 it was reported that some loaders crash
when 'compression', 'chunks' and related h5py keyword arguments are
specified. By running pytest a second time and thereby specifying
the custom --enable-compression parameter all tests are rerun with

   kwargs={'compression':'gzip', 'compression_opts':6}

All compression sensitive tests especially all 'test_XX_*.py::*' unit test
functions must include the 'compression_kwargs' parameter in their
signature to receive the actual kwargs list to be passed to all 'create_fcn'
function defined by loader module. In case a test function misses to
pass on the 'compression_kwargs' as keyword arguments ('**kwargs') to
   'hickle.dump',
   'hickle._dump',
or any dump_method listed in 'class_register' table of loader module or
specified directly in a 'LoaderManager.register_class' an AssertionError
exception is thrown indicating the name of the test function, the line
in which the affected function is called any function which it calls.
Tests which either test compression related issues explicitly or do not
call any of the dump_functions may be marked accordingly using the
   'pytest.mark.no_compression'
marker to explicitly exclude test function from compression testing.

Tox virtual env manager support:
================================
Adds support for virtualenv manager tox. Tox simplifies local testing of
compatibility for multiple python versions before pushing to github and
creating pullrequest. Travis and Appveyor integration still has to be
tested and verified.

'# Sie sind gerade beim Rebase von Branch 'final_and_cleanup_4.1.0' auf 'ab9a0ee'.
Current release causes problems when trying to dump and load data
when h5py 3.X library and newer is installed. In that case loading
will fail with errors. This is caused by
 - a change in how strings are stored in Attributes.
   Before 3.x they were stored as bytes strings. Encoding had to happend
   manually before storing to and after reading from the attribute. In
   h5py encoding is handled by attrs.__getitem__ and attrs.__getitem__.
   The first simply assumes that any bytestring stored as attribute
   value must represent a python 'str' object which either was written
   by python 2 which used plain c strings or is the encoded version of
   a Python 3 utf8 string and thus has to be encoded in utf8. Only in
   case the byte string contains characters which are not valid in utf8
   encoded string than content is returned as 'bytes' object
 - a change in how utf8 'str' objets are stored. They are now
   consequently stored as var_dtype objects where the metadata of dtype
   indicate the resulting python type. In H5py2 this didn't matter
   therefore strings stored in datasets using h5py2 are returned with a
   opaque object dtype for which decoding to utf8 has to be enforced if
   py_obj_type as indicated by 'type' attribute is 'str'.
 - a change how Datasets for which no explicit link exists in the file
   but a strong Python reference exists to the h5py.Dataset or
   h5py.Group objects exits are handled when a reference to them is
   resolved. In h5py 2 a ValueError is raise in h5py 3.X an anonymous
   Dataset or Group is returned. Consequently when simulating during
   test runs stale references encountered by hickle.load all python
   references to the corresponding datasets and groups have to be
   dropped to see that the reference is stale. The alternative would be
   to check the name of the referred to Dataset if it is None the
   dataset would be anonymous. Resolving the name on non Anonymous
   Datasets and Groups, which allways represents the full path name and
   not just the local name,  is a quite expensive and costy processin
   terms of runtime and resources and shall be avoided

This commit provides fixes for these changes severely affecting proper
operation of hickle. String values of attributes are compared against
'bytes' and 'str' type variants of constant strings. Lookupables like
key_base_type and hkl_types and hkl_conainer_types include the key for
an entry as 'str' and 'bytes' type key. A featured optional load_fcn is
provided fixing the difference how dtype of utf8 strings is computed by
h5py 2 and h5py 3. Test functions are updated accordingly to properly
mock stale reference conditions and changes in how strings are stored.
Removed commits included in branch by imperfect branch rebasing
on assembling hickle-5-RC branch.

Numpy has introduced with version 1.20 a disrupting change which
prevents numpy.dtype type objects like np.int16 be pickled. Trying
to pickle numpy.dtypes causes pickling error in python 3.7 and python
3.8 which utilize numpy > 1.20. As this error will according to numpy
issue numpy/numpy#16692 not before numpy 1.21
is released. To fix for now set numpy requirement to be less than 1.20.
Recheck required when numpy 1.21 is released. If fixed update requirement
that all numpy 1.20 versions will be properly excluded from beeing
supported by hickle.

Latest cryptography packages required by tox changed their dependency
upon rust compiler. This change makes installation of virtualenv fail
if not like for pip its latest versoin is used. Therefor virtualenv is
now included in pip --update line

Deviating from former plans missing indexer on regex match class in
python 3.5 fixed.

Latest HDF5 versions (>= 1.12) seem to not support anymor windows 32 bit
platform. This causes all appveyor builds on windows 32 bit to fail
exempt for python 3.5. It seems h5py 2.10 seems to be the last
supporting python 3.5.

Added TOXENV entry to appveyor yaml as tox_appveryor seems not anymore
limit environments to be tested to the selected testing environment.
Recovering data:
----------------
All data for which no loader can or could be loaded is attempted to
recove

Windows 32 bit fix:
-------------------
Binary prebuilt wheels for h5py are available on PyPy upt to version
2.10 and Python 3.8. Any later version only provides prebuilt wheels for
  64 bit Windows.

Any later version must be built manually if to be installed on 32 bit
Windows. Similar situation for libhdf5 with is required by h5py. The
last version for which prebuilt binaries for Bindows 7 32 Bit are
available is 1.10.4 https://portal.hdfgroup.org/display/support/HDF5+1.10.4).
Ayn later version must be built manually before building h5py manually.

Consequently hickle can only for 32 bit Winudows as long as no h5py
feautres are use which are provided by h5py >= 3.0 and when run on
Python <= 3.8. Any other version is 64 bit only.

On Appveyor the variable TOX_H5PY_REQIREMENTS is used to select the
requirements32.txt file specifying all requirements in their latest
version providing prebuild wheels for 32 bit Windows. On any other
system this variable is not set and thus tox will select
requirements.txt instead.

Other:
------
Small fixes, documentation etc.
switched deployment and testing from appveyor and travis as requested
by @1313e.

Added description on how to enable custom loaders and setup loader
moduels for custmom python packages, modules and applications.

Requirement for h5py set to >= 2.10

Hickle version set to 5.0.0.dev0

Added dummy test for not yet existing loaders for pandas package

Documentation cleanup.

Revision of unit tests and removed `# pragma: no cover` from branches
which are covered by existing tests.
a) drop eval for b'NoneType' dict key_base_type
b) switch back to literal_eval for b'tuple' dict key_base_type
c) limit github action workflow to running for push and pull_reqeust
   event on telectraphic/hickle repo only
d) limit github action workflow to running for following pull_request
   types opened, synchronize, reopened, edited only

NOTE: c) and d) needed some extra learning on how github actions work
NOTE: c) and d) for debugging workflow and workflows in general from
   forked repo comment top most if conditoin of the edited job. This
   ensure its edited steps are executed on the forked repo.
   Until the edits are merged into upstream repo the pull_reqest.synchronize
   will use the unchanged workflow. When done with editing do not forget
   to uncomment if condition to prevent workflow to be executed on
   forked repos and upstream repo even though it still will appear
   in the actions sectoin of the forked repo when configured so on the
   upstream repo.
…py_obj has to be pickled due to missing loader.

fixes: RuntimeError emmitted for properly registered loader internal to hickle with None type dump_fcn
…cleanup. Without package, module and script loader support is broken
…properly and stable under any circumstances?
…properly and stable under any circumstances?
… collector, try forcing collection after file flushing and before closing
On GitHub actions windows runners it was obswerved for python 3.7 32 bit
and 64 bit and python 3.8 64 bit that h5py.copy aborted with PermissionError
exceptoin when copying hdf5 files which are opened for writing. On the
remaining three windows runners as well as linux and mac runners copying
was possible without any issue.

Adopted test_ReferenceManager_context and test_ReferenceManager tests to
close file accessed through h5_data fixture first before calling
shutil.copy.

Fixed spourious Unrecognized typecode ValueError exception occuring
during h5py.File.close. These are caused by python garbage collector
which kicks in while h5py.close in critical section. By disabling
garbage collection before call to h5py.File.close and re-enabling
afterwards this can be solved for all python versions. This error most
consistently could be observed on windows 32 bit running python 3.7 with
h5py 2.10.0.
…cleanup. Without package, module and script loader support is broken
…irements updated to allow any time removal from requirements.txt and requirements32.txt while it stays requirement for testing. Removed explicit six requirement as no loader yet available and likely meanwhile installed by other packages anyway in appropriate version.
hernot and others added 2 commits October 20, 2021 20:31
…irements updated to allow any time removal from requirements.txt and requirements32.txt while it stays requirement for testing. Removed explicit six requirement as no loader yet available and likely meanwhile installed by other packages anyway in appropriate version.
@telegraphic
Copy link
Owner Author

Any reason to keep appveyor now we have github actions? (I think not?)

@1313e
Copy link
Collaborator

1313e commented Dec 16, 2021

Any reason to keep appveyor now we have github actions? (I think not?)

No, that link should be removed at AppVeyor's side.

@telegraphic telegraphic merged commit 118cc44 into master Dec 17, 2021
@telegraphic
Copy link
Owner Author

@hernot and @1313e 🎉 🎉 🎉

Thanks for your perserverance and patience!

@telegraphic
Copy link
Owner Author

I've updated the PYPY_* secrets so that the automated upload to pypi works.

The version is based on that in __version__.py, which created:
https://pypi.org/project/hickle/5.0.0.dev0/

We should change this to 5.0.0 once we're satisfied that it is working for end users

@hernot
Copy link
Contributor

hernot commented Dec 18, 2021

@telegraphic Cool thank you very much.
By the way when officially releasd to pypi the issues #135, #139, #143
And for #150 @1313e should suggest whether this should stay open as a reminder or can be closed.

By the way I guess you should update the list of major changes in README.md and possibly one day if relevant at all redo the speed comparisons to fit hickle 5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants