-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for python copy protocol __setstate__ __getstate__ if present in object #125
Comments
@hernot Thanks for the suggestion. However, I am not sure how exactly you would like this class to be hickled. |
like pickle does, check if the class implements the
with cls beeing the class for which the object has to be unhickled and state beeing the already unhickled dictionary which on hickling has been returned by call to The main difference between hickling plain python dictionary and a class implementing On unhickling a hdf5 group representing a dictionary it has to be checked if the metadata/attributes identifiy it as a state dictionary of an instance object of the specified class provided by the named module. If this is the case the module has to be imported and the object brought back to life using above lines. At least hat would be the rough way i would do it. The tricky stuff as allways is hidden within the details. One last thought. The |
Yes from outside view it is an iterable but its inner workings are different thus trying to hickle it as plain iterable would fail as would for |
Alright, I will have a look this weekend (hopefully). |
INot sure what the code of conduct is in your team but if it would help i could make a fork and come up with a suggestion submitted through a pull request from my personal GH account. But for this i would need to know how this is expected to happen and what manuals and designdocuments etc to read and understand first before touching anything. |
Normally, a PR would be highly appreciated, which would basically consist out of adding the functionality and writing tests for the added functionality. |
Ok if in this process would be something i could help with let me know. |
@hernot looks like @1313e has this under control, but FYI we do have a code of conduct, and a basic contribution guide. If there are more specifics you think would be helpful for contributing, please let us know and we can improve the contribution guide. |
Hm not sure if I'm helpfule byond this cause i did expext nothing else, given i consider my self rather experienced. The only thing i would add to contributors guide for Beginners is to read existing code and tests as more or less good best practice examples for structuring and formatting the code and formulating the unit tests. But conderning me i think I'm fine. |
@hernot Can you check if you can properly use the |
Sure. Any special things to check or just general verification? |
Check if you can use the version on the |
And, there is no hurry, as @telegraphic has to go through all the changes I made for v4 in #117 before it will be merged. |
Hm get the following error with simple custom pytest trying to load file created with current release of hickle:
dill version '0.3.1.1' When i try to inspect root object simply iterating through I get Not sure whether I'm missing some addtional library besides dill, h5py and libhdf5 or if there is still some bug inthere. EDIT see attached images about how HDF5 compass reads file If you like i can provide the file to you, is nothing really fancy top secret inside. NOTE
Before data was stored inside numpy.npz file where one can pass all variables as parameters to EDIT On debugging Loading of files from previous versions should be possible also in version 4 at least for testing purpose, cause my files are produced by productive code, where i do not like to change production environment. ps.: Why are you using try catch for checking if version is a single number or Major.Minor.Patch format. why you are not simply using |
@hernot As I mentioned earlier, release v4 will be incompatible with v3. And yes, loading files of previous versions will be possible in v4. EDIT: I see now that I may have been a bit confusing here. |
You are kidding. But refusing reading V3 files with V4 to at least be able to recode/rehickle old V3 data to new V4 format so that they stay readable and usable is for me a no-go especially as my data is science data and i do have the constraint that data stored in any format has to be readable years after still. In other words. The only reason why i would switch to hickle instead of dig into h5py directly is cause i can use it like pickle and unpickle but have the advantage that i can export content without any further change to file to c/c++/c#, perl and other languages as for all of them a binding for hdf5 library exists. And this is also my main reason that i requested support for If reading V3 is not the long term goal, issue initially a deprecated waring or a warning that in future support for V3 and older hickle formats will be removed, and thus rehickling to V4 format and new is recommended. I do consider the more sensible solution. |
I think you misunderstood what I meant, or I am just horribly confusing. However, any file made with v3 that was made incorrectly, like one that attempted to store objects of classes with the signature you posted here, will obviously not be able to be read with v4. It is this last point that I was asking for you to check, to make sure I implemented it correctly: Can the Sorry for the confusion. |
hm not sure if I'm getting your right. Which measures would you suggest to take in order to avoid unnecessary problems in migration from hickle 3.X to 4.X in future. EDIT: |
@hernot Any file that was made with v3 and can be properly loaded with v3, will be loaded properly with v4 as well, once v4 gets released.
|
What I was asking for was, if you could check if you can both dump and load objects of classes that have the structure as you described in the first post, using v4 (the version that is currently on Just as a reminder: v4.0.0 is not released yet, which is why it is on the |
hm ok what i can do is make atrificial stuff instead of testing with real data which i initially intended to. Will try to do tonight possibly. EDIT After fixing my outdated copies of my sources to best match production everything seems ok.
Thus another minor for now especially as the data set hickled is about 400MB in hdf5 file size. But when dumping the resulting V4 File still contains objects as pickle strings and not as hdf5 groups representing object state dictionary. Looking at the code neither
Or is it simply that adding code for this request is still pending. |
@hernot Thanks for checking that for me. Now that you have confirmed that it indeed works properly, I can take a look at storing the actual state in the HDF5-file and write a routine that can recreate the object properly. When I have a solution for it, I will let you know, so you can maybe rerun the tests for me. PS: Keep in mind that the |
Unless you planned still major changes how things are handled i would suggest to simply keep the order of tests which are currently
I simply would naively just insert the following test between step 2 and 3
On loading i would check if 'loadstate' or alike attribute is present and if would
At least that would have been my first naive approach. If it would have lead me anywhere i can't tell |
Alright, so I have looked into all of the specifics and such, and I am finding it quite hard to figure out how this could be useful in more than a select few cases. So, the way that The problem with this is that Having said all of that, I think your request is the following: If a class instance has the If so, then here is my problem with it: It basically means reimplementing about everything that The point here is that saving this attributes dict (or the state if it can be hickled properly without serializing it) as an actual dict in the HDF5-file, is almost never going to be useful. >>> import numpy as np
>>> array = np.array([1, 2, 3])
>>> array.__reduce__()
(<function numpy.core.multiarray._reconstruct>,
(numpy.ndarray, (0,), b'b'),
(1,
(3,),
dtype('int32'),
False,
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00')) The first two items state what function must be called with what arguments to set the initial state of the array. I hope you can agree with me that hickling the state of this very simple NumPy array (or its entire reduced state), is in no way more readable than just saving the individual values. So, awaiting @telegraphic's opinion on this matter, I suggest you to do the following: |
See my above comment. And form my experience pickle or at least some alikes like jsonpickle does not pickle class declaration it self at all (afaik). And I too would not go that far and overdo things. I would start with a subset which can be handled by hickle with least changes. In other words I do consider this request way more than fullfilled, if hickle would call The only two cases i would handle in addition would be |
@hernot I fully understand that that is what you would like to see, but the problem is that that is not expected behavior by the average user. As I had stated above, if you want what you are asking for, you can do this quite simply using a few lines of code. |
Is likely. Step1: (in first release) add support of dumping class states and recreating instances by loading state into class instances created from classes imported from external third party modules. Step2: (in one of the first minor releases) add support for dumping class definitions as such and restoring them as such independent from anything else. Thereby Allowing users to call hickle.dump(class,filename), class=hickle.load(file) Step3: (in one of the minor releases following Step 2 release) try to restore class definition and declaration from dumped description and intialize its instance by stored state with fallback to importing from module stored else wehre as introduced in step 1. Would imho render thingls a lot cleaner and better manageable. But that is the approache i would choose. And calling |
(rotfl) (facepalm) (rotfl) (nirg) why I'm thinking so complicated. Why invent the wheel again just lets do most of the work the builtin machinery setup by copy, pickle and copyreg.dispatch_tables and just walk the resulting reduced tuples. They contain all information which is required to reconstruct what user requested and expects back. Step by step starting from use cases which are available either cause built into python or cause having access to some real life or real life like examples using following approach (most of it is already implemented in hickle, some will need fine tuning and extension over time)
Most thereof is anyway what is already implemented in hickle instead of manually figuring how to convert content into something serializable use existing copy and pickle machinery encoded in And yes there remain still special or rather new cases they can be done when appropriate use cases are provided by users. I would meanwhile introduce some memoization of already hickled objects (like How could i miss that when reading description of pickle module and not just challenge it by just calling And concerning pickling of any classobjects, function objects etc the about last three paragraphs under the following link indicate how pickle handles them. They are basically non picklable and just stored as named or fully qualified named references. |
@hernot Although I appreciate all the thought you have put into it, my decision wasn't based on the difficulty of the implementation. |
@1313e and @hernot thanks, I'll need to find some free time to devote to this, clearly there are some subtleties that need to be considered. I will say that this project started off with a limited set of supported types, and followed a pretty simple: "if this specific type, dump it to file using this function". At first glance this request seems in essence to ask for some level of duck typing (if it has |
I think the two of us had a pretty thorough discussion that you can read. :) |
Hi just to catch up and not miss anything. |
Ah, yes this was closed automatically when doing merge; I will reopen. Given the amount of discussion above I'd say this is certainly outside the scope of v4.0.0, which already has major changes. To summarise my understanding of the issue: custom classes are currently pickled and stored as a serialised string. This is hard to read in other languages via HDF5 API. I see three possible solutions:
I prefer the second or third approach (which may not be mutually exclusive) |
I linked it, as I did solve the issue with classes implementing this structure not being able to be dumped and loaded properly. @telegraphic My opinion on this matter is that something like this should not be added to |
@1313e @telegraphic i followed a bit the release process of Hickle 4.0.0 and played a bit how this could be added in the clean and lean concept of containers and datasets exporters and loaders. And i came to the conclusion that I do agree with you @1313e adding this brute force would just readd again highly specialized and difficult to maintain functions and classes to Hickle which Release 4 just got rid of. But i think by pushing this special issue aside for the moment and asking the question how could that be implemented sticking to containers and datasets, exporters and loaders only and what would be required from Hickle side to achieve this i believe i stumbled over something which could be more general than just this topic here but for example also open a route to issue and #133 others. Therefore i do suggest to
Till then I'm perfectly OK if this issue stays at halt or closed. |
@hernot Sounds good, looking forward to it. |
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0: ============= Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface.
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0: ============= Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface.
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0: ============= Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface.
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.0, 4.0.1 files which do not yet support PyContainer concept beyond list, tuple, dict and set
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.
While working upon pullrequest #138 and a some preliminary proof of concept work for issue #139 i was forced to accept that hd5 data format is not really designed for storing dict, list and tuple structures containing vast ammount of heterogenous data especially if they are organized in a rather chaotic huge tree like structure. I summarized my finding within pull request #138. Consequently as any other solution yields more appropriate results compared to naive implementation of copy protocol support within hickle i hereby close this issue in favour for H4EP003 (issue #145). |
With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.
* no copy when loading astropy * typo * Bumping to 3.4.4 * Package license and requirements files * Bump to 3.4.5 * Added some tests to Travis CI and AppVeyor that check if hickle can be properly packaged up, distributed and installed. * And maybe I should not forget twine in the requirements_test.txt. * Moved the tests directory from root to ./hickle. * Added missing tests folder to the MANIFEST.in file. * Add dict-type permanently to types_dict. * Subclasses of supported types can now be pickled as well (although not yet with their proper type). * Removed all cases of saving dtypes as a single element list. * Renamed the 'type' attribute to 'base_type' in preparation for adding subclass support. * Also make sure that strings are saved as single elements. * The types_dict now uses tuples with the create-functions and the hkl_dtype key. * All create-functions now take an extra 'base_type' string, that describes what the hkl_type is that will be saved to HDF5. * Groups also obtain their base_type from the create_dataset_lookup()-function now. * The actual type of a hickled object is now saved as well (in pickled form). * Finalized implementing support for subclasses. * Coveralls -> codecov.io * Add codecov.io badge * The order of the dict item keys are now saved as well. Any dict that is loaded will be initialized with the items sorted in that order. For all types that derive from dict, the dict will be initialized using its true type directly (apparently, I had written it that way before already for some reason). This fixes telegraphic#65. * Hickle's required HDF5 attributes are now solely applied to the data group that contains the hickled Python object, instead of the entire file (this allows for hickled objects to be more easily added to already existing HDF5-files, without messing up the root of that file). * Datasets and groups now solely use 'group_xxx' if there is more than a single item at that level. All 'create_xxx' functions are now passed a string containing the name of the group/dataset that they should create. * Added forgotten base_key_type attribute to PyContainer. * Reverted working tree back to before using 'track_order'. Added missing 'six' requirement. Added a test for testing the dumping and loading of an OrderedDict. * The root of a hickled group is no longer read now, as it is not necessary. Removed the auxiliary attributes that were required for reading it. The true type of a dict key is now saved as well, even though it is not used for anything (simply saving it now in case we want to use it later). * The version is now stored using a single Python file, whose strings are read using regex. * HDF5 groups can now be given as a file_obj as well when dumping and loading. Providing a path that does not start with '/' will automatically add it to it. Added tests for these functionalities. * Arbitrary-precision integers can now be dumped and loaded properly. * Also make sure that 'long' dtypes are taken into account on Python 2. * make hickle work with pathlib.Path Basically, any package /module that saves to file supports this too (including `h5py`). * make Python 2 compatible * Changed wording. * Added six requirement, and added minimum versions for all requirements. * Now 'dill' is always used for dumping and loading serialized objects. Added a test for dumping and loading a local session function. * Add support for actually loading serialized data in Python 2. * Add new test to main as well. * Make sure new changes are also used for Python 2 * Update file_opener re telegraphic#123 * Fixed documentation of `dump` and `load` to be NumPy doc style (and look a bit better). Replaced broken `pickle` documentation link with a proper one. * Only lists and tuples are now recognized as acceptable iterables. All other iterables are either handled separately, or simply pickled. * Changed the lookup system to use object typing for more consistency. * Added test for detecting the problem raised in telegraphic#125 * Added support for hickling dicts with slashes in their dict keys. * Make sure that functional backslashes still work properly in dict keys. * Loaders are now only loaded when they are required for dumping or loading a specific object. * Make sure to do proper future import. * Raise an error if a dict item key contains a double backslash. * Only filter out import errors due to the loader not being found. * As Python 2 apparently only reports the last part of the name of a non-importable module, search for something a bit more specific. * Some small QoL changes. * The py_type of a pickled object is no longer saved to HDF5, as it is not necessary to restore the object's original state. * Removed legacy support for v1 and v2. Added start of legacy support for v3. v4 now stores its HDF5 attributes using a 'HICKLE_' prefix, to allow users to add attributes to the group without interference. * Objects can now be hickled to existing HDF5 groups, as long as they don't contain any datasets or groups themselves. * Made sure that v3 always uses relative imports, to make it easier to port any functionality change from v3 to v4. (Even though I am not a fan of relative imports) * The version is now stored using a single Python file, whose strings are read using regex. * Backported change to version store location to v3 as well. Bumped version to 3.4.7 to include the latest changes. * Removed support for Python 2. Added legacy support for hickle v3 (currently uses v3.4.7). * Remove testing for Python 2.7 as well. * Always specify the mode with which to open an HDF5-file. * Test requirements updates. * And make sure to always execute 'pytest'. * Removed basically everything that has to do with Python 2.7 from v4. As legacy_v3 is expected to be able to load files made with Python 2.7, these are not changed. * Many, many QoL changes. Converted all v4 files to be PEP8 compliant. Rewritten 'load_python3' into 'load_builtins'. The 'load_numpy' and 'load_builtins' modules are only loaded when they are required, like all other loaders. Removed the 'containers_type_dict' as the true container type is already saved anyway. Astropy's classes are now reinitialized using their true type instantly. Astropy constants can now be properly dumped and loaded. * Save types of dict keys as a normal string as well. * Some minor improvements. * Added test for opening binary files, and make sure that any 'b' is removed from the file mode. (telegraphic#131) * Added pytests for all uncovered lines. Removed lines that are never used (and thus cannot be covered). Added 'NoneType' to the dict of acceptable dict key types, as Nones can be used as keys. * Replaced all instances of 'a HDF' with 'an HDF'. * Removed the index.md file in the docs, as it is now simply pointing to the README.md file instead. * Badges! * Added few classifiers to the setup.py file. * Update requirement for pytest. * Removed use of 'track_times'. * NumPy arrays with unicode strings can now be properly dumped and loaded. * NumPy arrays containing non-NumPy objects can now be properly dumped and loaded. * Added all missing tests for full 100% coverage!!! * Make sure that kwargs is always passed to 'create_dataset'. * If hickle fails to save an object to HDF5 using its standard methods, it will fall back to pickling it (and emits a warning saying why that was necessary). Simplified the way in which python scalars are hickled. * Actually mention that the line in parentheses is the reason for serializing. * Use proper development status identifier. * Make sure to preserve the subclass type of a NumPy array when dumping. * make sure that a SkyCoord object will be properly saved and retrieved when it's a scalar or N-D array * revert the change to legacy_v3/load_astropy.py * Updated legacy v3 to v3.4.8. Co-authored-by: Kewei Li <kewl@microsoft.com> Co-authored-by: Danny Price <dan@thetelegraphic.com> Co-authored-by: Isuru Fernando <isuruf@gmail.com> Co-authored-by: Ellert van der Velden <ellert_vandervelden@outlook.com> Co-authored-by: Bas Nijholt <basnijholt@gmail.com> Co-authored-by: Rui Xue <rx.astro@gmail.com>
[Suggestion]
I do have serveral complex classes which in order to be picled by different pickle replacements like
jsonpickle and others implement
__getstate__
and__setstate__
methods. Besides beeing copyable for free using copy.copy and copy.deepcopy pickling is quite straight forward.The above example is very simplified removing anyhting unnecessary.
Currently these classes are hickled with the warning that object is not understood as a test whether
__setstate__
__getstate__
is implemented is missing. Admitted both are handled by pickle fallback but class ends up as string instead of dataset making it quite tedious to extract from hdf5 file on non Python end like c# or other languages.Therefore i do suggest to add a test for both methods defined and storing class as class state dictionary instead of pickled string. Would need some flag or other means which allows indicating that dict represents result of
<class>.__getstate__
and not a plain python dictionary. Test should be run after testing for numpy data and before testing for python iterables as the above class appears to be iterable but isn't.ADDENDUM:
If somebody guides met through i would try the attempt to add appropriate test and conversion function. But I would need atleast guidance which existing methods would be best suitable for template and inspiration and what parts and sections to carefully read from h5py manual, hdf5 spec and other contribute to hickle documentation.
The text was updated successfully, but these errors were encountered: