support for python copy protocol setstate getstate if present in object #125

hernot · 2020-06-02T12:52:09Z

[Suggestion]

I do have serveral complex classes which in order to be picled by different pickle replacements like
jsonpickle and others implement __getstate__ and __setstate__ methods. Besides beeing copyable for free using copy.copy and copy.deepcopy pickling is quite straight forward.

import numpy as np
class with_state():
    def __init__(self):
         self.a=12
         sef.b = { 'love':np.ones([12,7]),'hatred':np.zeros([4,9]) }
    def __getstate__(self):
         return dict(
            a=self.a
            b = self.b
       )
   def __setstate__(self,state):
         self.a=state['a']
        self.b=state['b']

   def __getitem__(self,index):
       if index == 0 :
          return self.a
       if index < 2:
          return b['hatred']
      if index > 2:
          raise ValueError("index unknown")
       return b['love']

The above example is very simplified removing anyhting unnecessary.
Currently these classes are hickled with the warning that object is not understood as a test whether __setstate__ __getstate__ is implemented is missing. Admitted both are handled by pickle fallback but class ends up as string instead of dataset making it quite tedious to extract from hdf5 file on non Python end like c# or other languages.

Therefore i do suggest to add a test for both methods defined and storing class as class state dictionary instead of pickled string. Would need some flag or other means which allows indicating that dict represents result of <class>.__getstate__ and not a plain python dictionary. Test should be run after testing for numpy data and before testing for python iterables as the above class appears to be iterable but isn't.

ADDENDUM:
If somebody guides met through i would try the attempt to add appropriate test and conversion function. But I would need atleast guidance which existing methods would be best suitable for template and inspiration and what parts and sections to carefully read from h5py manual, hdf5 spec and other contribute to hickle documentation.

The text was updated successfully, but these errors were encountered:

1313e · 2020-06-02T15:18:23Z

@hernot Thanks for the suggestion.
In the example you posted above, the class has the dunder method __getitem__, which is a characteristic of an iterable.
Every class that implements this dunder method is an iterable.

However, I am not sure how exactly you would like this class to be hickled.
It is, as you say, not an iterable, and thus there is no way to represent it in an HDF5-file besides simply pickling it.

hernot · 2020-06-02T18:28:19Z

like pickle does, check if the class implements the __setstate__ and __getstate__ method from the python copy protocol and if it does obj.__getstfate__() is called. That call is required to return a dictionary containing every information required to rebuild obj and its state, given the receiver has access to the module exporting the objects class and the exported class itself.
And on unhickling create the object by calling

 
   object_self = cls.__new__(cls)
   object_self.__setstate__(state)

with cls beeing the class for which the object has to be unhickled and state beeing the already unhickled dictionary which on hickling has been returned by call to obj.__getstate__()

The main difference between hickling plain python dictionary and a class implementing __getstate__ and __setstate__ methods is that the corresponding data group needs some metadata/attributes indicating that this is a dictionary describing the state and content of a class instance object of given class exported by named python module.

On unhickling a hdf5 group representing a dictionary it has to be checked if the metadata/attributes identifiy it as a state dictionary of an instance object of the specified class provided by the named module. If this is the case the module has to be imported and the object brought back to life using above lines.

At least hat would be the rough way i would do it. The tricky stuff as allways is hidden within the details.

One last thought. The __getstate__ and __setstate__ methods as part of the copy protocol provide the developer of the class a sort of control which parts of the object should be hickled at all, which are just artefacts required for initial constructoin an thus not needed when copying or pickling or unpickling an object. And they allow to pickle classe which implement sort of load on demmand semantics where not yet loaded data should not be loaded on pickling as it is an addtional informate or set of data not needed to acomplish targeted application.

hernot · 2020-06-02T18:51:21Z

@hernot Thanks for the suggestion.
In the example you posted above, the class has the dunder method __getitem__, which is a characteristic of an iterable.
Every class that implements this dunder method is an iterable.

Yes from outside view it is an iterable but its inner workings are different thus trying to hickle it as plain iterable would fail as would for copy.copy and copy.deepcopy and pickle.dimp and pickle.load and alike if the class would not define __getstate__ and __setstate__ methods ensuring proper state transfer.

1313e · 2020-06-03T00:46:22Z

Alright, I will have a look this weekend (hopefully).

hernot · 2020-06-03T04:31:06Z

INot sure what the code of conduct is in your team but if it would help i could make a fork and come up with a suggestion submitted through a pull request from my personal GH account. But for this i would need to know how this is expected to happen and what manuals and designdocuments etc to read and understand first before touching anything.

1313e · 2020-06-03T04:41:27Z

INot sure what the code of conduct is in your team but if it would help i could make a fork and come up with a suggestion submitted through a pull request from my personal GH account. But for this i would need to know how this is expected to happen and what manuals and designdocuments etc to read and understand first before touching anything.

Normally, a PR would be highly appreciated, which would basically consist out of adding the functionality and writing tests for the added functionality.
However, I am currently in the process of rewriting large parts of hickle for the future v4.0.0 release (I should actually get back to that), so code written for v3 will likely be incompatible with the code for v4.
I therefore opt for making these changes myself, such that I can assure that they will work in both versions.
I hope that is okay with you?

hernot · 2020-06-03T05:51:45Z

Ok if in this process would be something i could help with let me know.

telegraphic · 2020-06-04T03:27:55Z

@hernot looks like @1313e has this under control, but FYI we do have a code of conduct, and a basic contribution guide.

If there are more specifics you think would be helpful for contributing, please let us know and we can improve the contribution guide.

hernot · 2020-06-04T07:09:28Z

Hm not sure if I'm helpfule byond this cause i did expext nothing else, given i consider my self rather experienced. The only thing i would add to contributors guide for Beginners is to read existing code and tests as more or less good best practice examples for structuring and formatting the code and formulating the unit tests.

But conderning me i think I'm fine.

1313e · 2020-06-06T07:17:51Z

@hernot Can you check if you can properly use the hickle version currently on the dev branch?

hernot · 2020-06-06T19:46:28Z

Sure. Any special things to check or just general verification?
Should i just check if i can run hickle and get same results as for current release version? Or should that version already be able to understand __getstate__ __setstate__? Is testing with existing data set which i unhickle and rehickle sufficient? Or should i also test with new data set? in the latter case i would have to redo testing on Monday when back in office. Anything else should be doable tomorrow.

1313e · 2020-06-07T01:43:12Z

Check if you can use the version on the dev branch to properly dump and load the various complex classes you have made.
It should be able to handle them properly now.
I would say, go crazy with the testing until you either find a problem or are convinced it works properly. ;)

1313e · 2020-06-07T02:43:56Z

And, there is no hurry, as @telegraphic has to go through all the changes I made for v4 in #117 before it will be merged.

hernot · 2020-06-07T08:45:53Z

Hm get the following error with simple custom pytest trying to load file created with current release of hickle:

    def test_load_complex_hickle():
        starttime = time.perf_counter()
        testfile = glob.glob(os.path.join(datapath,filename))[0]
        print(testfile)
>       dataset = hkl.load(testfile)


<git-working-copies>/hickle/hickle/tests/test_stateful.py:20: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
<git-working-copyies>/hickle/hickle/hickle.py:616: in load
    py_container = _load(py_container, h_root_group['data'])
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper
    ???
/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py:264: in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   KeyError: "Unable to open object (object 'data' doesn't exist)"

h5py/h5o.pyx:190: KeyError

dill version '0.3.1.1'

When i try to inspect root object simply iterating through items()

I get
TypeError: Not a location id (invalid object ID)

Not sure whether I'm missing some addtional library besides dill, h5py and libhdf5 or if there is still some bug inthere.

EDIT
Inspecting file with HDF5compass it seems group in root is called data_0 and not data.

see attached images about how HDF5 compass reads file

If you like i can provide the file to you, is nothing really fancy top secret inside.

NOTE
The file itself was created by collecting all contents inside an anonymous dict using:

hickle.dump(dict( ... ),filename)

Before data was stored inside numpy.npz file where one can pass all variables as parameters to numpy.savez and numpy.savez_compressed. The anonymous dict was simply used to mimic this behaviour.

EDIT
I think best is i provide you with the file, cause it uses 'data_0' instead 'data' for base group name and it uses 'type' instead of 'base_type' for group type attribute. All in all seems as if for version 4 a rename of attributes is undertaken in hickle or for 2.10.0 h5py whichever.

On debugging h_root_group.attrs["VERSION"] yields '3.4.6'

Loading of files from previous versions should be possible also in version 4 at least for testing purpose, cause my files are produced by productive code, where i do not like to change production environment.

ps.: Why are you using try catch for checking if version is a single number or Major.Minor.Patch format. why you are not simply using int('VER'.split('.')[0]) for checking major version number. In case VER is major only split will return list with single item, in case Major.Minor.Patch format list will have 3 entries and item 0 will be major. And the benefit is in future if some day Major starts to exceed 9 no nasty change is necessary to version check which currently assumes Major will for ever stay between 0 and 9.

1313e · 2020-06-07T17:07:15Z

@hernot As I mentioned earlier, release v4 will be incompatible with v3.
Therefore, any file created with the current release cannot be opened with v4.
What I meant with testing, was to dump and load your objects using the development version (which is v4).

And yes, loading files of previous versions will be possible in v4.
I simply haven't added the legacy support yet, as I'm still testing and developing.

EDIT: I see now that I may have been a bit confusing here.
I meant to say here that v4 itself is incompatible with v3 (therefore, v4 files cannot be read with v3 and vice-versa), but that it will be able to read v3 files by simply using the scripts of v3.
It simply cannot do so just yet as I haven't added support for it yet.

hernot · 2020-06-07T17:24:05Z

You are kidding.
Reading a file created by hickle >= 4.X with old hickle package <= 3.X, that this will not work is logic to me and the only Possibility to keep maintenance as low as possible. The same for hickle >= 4 not writing any thing else than V4 format. Anything else would be illogical.

But refusing reading V3 files with V4 to at least be able to recode/rehickle old V3 data to new V4 format so that they stay readable and usable is for me a no-go especially as my data is science data and i do have the constraint that data stored in any format has to be readable years after still.
Or do you know any possibility any chance to have both versions installed alongside and import both versions alongside in one python script for converting V3 file into V4. I don't think that this is really possible.

In other words. The only reason why i would switch to hickle instead of dig into h5py directly is cause i can use it like pickle and unpickle but have the advantage that i can export content without any further change to file to c/c++/c#, perl and other languages as for all of them a binding for hdf5 library exists. And this is also my main reason that i requested support for __getstate__ __setstate__ protocol.

If reading V3 is not the long term goal, issue initially a deprecated waring or a warning that in future support for V3 and older hickle formats will be removed, and thus rehickling to V4 format and new is recommended. I do consider the more sensible solution.

1313e · 2020-06-08T03:20:32Z

I think you misunderstood what I meant, or I am just horribly confusing.
Once released, v4 will be capable of reading any properly made v3 file, just as how v3 can read v2 files.
I simply haven't added the support for reading v3 files to v4 yet (it can read v2 files though), as I am still doing lots of testing.
It will be there for release.

However, any file made with v3 that was made incorrectly, like one that attempted to store objects of classes with the signature you posted here, will obviously not be able to be read with v4.
The information was never stored correctly, and thus there is no way for us to figure out how to restore it properly.
v4 itself however should be able to dump and load them just fine.

It is this last point that I was asking for you to check, to make sure I implemented it correctly: Can the dev branch version of v4 currently properly dump and load objects of the classes you wish to dump?
If so, then that is great.
If not, then I need to do some more fixing.

Sorry for the confusion.
Hope everything is clear now. :)

hernot · 2020-06-08T05:45:13Z

hm not sure if I'm getting your right.
The above file was created by dumping a dictionary with divers content using hickle 3.4.6 which was installed through pip install hickle for python 3.6. This hfile somehow seems to be broken cause of having incorrect structure.
This would mean hickle version 3.4.6 does under certain circumstances not produce proper formated hickle documents/streams. But how can it be that hickle 3.4.6 is capable to read this file without any flaw.
Or did i just not get you right this time again. Just want to understand it to if necedssary take now measures for avoiding hassels and problems in future. Even if that means now downgrading to last hickel version 3.X which produces properly formated files and also backconverting all files crated by 3.4.6 to last proper created fromat.

Which measures would you suggest to take in order to avoid unnecessary problems in migration from hickle 3.X to 4.X in future.

EDIT:
I have rechecked my production code. 'data' is a valid key inside the dictionary dumped. Which caused one part of impression file would be not correct. Still dataset name is 'data_0' not 'data' is that correct and valid for files created by V3.X or should that even in proper V3.X Format be called 'data'?
EDIT
After some courious debugging it seems as if hickle 3.4.6 assumes any dataset to be an item within a list or other iterable structure even if obviously can not be as it is represents the object passed to dump it for dumping.

1313e · 2020-06-08T08:24:38Z

@hernot Any file that was made with v3 and can be properly loaded with v3, will be loaded properly with v4 as well, once v4 gets released.
It simply cannot do so at this very moment, as I haven't added the loading scripts for v3 to v4 yet.
This is temporary, and also the reason why v4 isn't released yet (I am not done with it just yet).
So, you don't have to worry about the next hickle release not being able to read your v3 files anymore.
It will be able to do so, as it would be stupid if it cannot (as you commented yourself as well).
Therefore, no measures are required from your side to deal with the transition from v3 to v4 once it releases.

data_0 is indeed the name that is given to the first dataset in v3 files.
In v4, I changed this to be just data in case there is only one dataset required (see #44).

hickle both does and does not assume that every dataset is an item within an iterable.
Every dataset created by hickle carries several attributes, that are used to determine whether that dataset is part of an iterable, is an iterable itself, is both, or is neither of those things.
In case an object is not understood by hickle, its representation is pickled, and therefore it will never be seen as an iterable (but it can still be part of one of course).

1313e · 2020-06-08T08:33:34Z

What I was asking for was, if you could check if you can both dump and load objects of classes that have the structure as you described in the first post, using v4 (the version that is currently on dev).
v3 cannot handle them properly and it never will be able to do so either.
Any file that was created with v3 that attempted to dump those objects, are lost forever, as neither v3 nor v4 will be able to recover the original object.

Just as a reminder: v4.0.0 is not released yet, which is why it is on the dev branch, and why you cannot find it on PyPI.
I am simply asking you to test for me, if the features you requested in this issue, are present in the development version of v4.
I already have implemented a test in hickle v4 that uses the code that you gave me in the first post, and it can dump and load objects made with that class just fine.
However, as you mentioned that you have several classes that use that structure, which I assume are much more complicated, I would like to know if it still works fine for them as well.

hernot · 2020-06-08T12:30:38Z

hm ok what i can do is make atrificial stuff instead of testing with real data which i initially intended to.
Not quite true in V3 (at least 3.4.6 does) anything hickle passeses everything it has no idea about through dill and simply stores resulting python pickle string, thus structure is not lust but can not be passed to other programs than python with pickle protocol >2.
I report back to you if atrificial stuff appears in hd5 file in a fashion i would expect or if it is still passed through dill for generating pickle byte string and extracting everything thereof.

Will try to do tonight possibly.

EDIT
Ok managed to get something not so artificial. hickle.dump and hickle.load seem to work eventhough when trying to open resulting h5 file in HDFCompass 0.6.0 i get some UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte but I'm not sure if this is really an error caus of some incaccuracies in hickle or rather a bug in HDFCompass as i do not get any such error when simply calling h5dump -H hickle_dump_test.h5 or h5dump -A hickle_dump_test.h5 or just h5dump hickle_dump_test.h5 on command line. So i consider the complaints of HDFCompass form now not related minors especially as it is running under Python2 and I'm running hickle under Python 3.6.

After fixing my outdated copies of my sources to best match production everything seems ok.
But what i observe is that hickling tages about 5 min while the same seemed to be faster in V3. But this likely may be caused by running test using

python3 -m pytest --pdb --trace hickle/tests/test_stateful.py

Thus another minor for now especially as the data set hickled is about 400MB in hdf5 file size.

But when dumping the resulting V4 File still contains objects as pickle strings and not as hdf5 groups representing object state dictionary. Looking at the code neither __getstate__ nor __setstate__ are called yet.
Can this be cause my git upstream settings are not proper

git branch
* dev
  master
git remote --verbose
origin	git@github.com:hernot/hickle.git (fetch)
origin	git@github.com:hernot/hickle.git (push)
upstream	git@github.com:telegraphic/hickle.git (fetch)
upstream	git@github.com:telegraphic/hickle.git (push)

Or is it simply that adding code for this request is still pending.

1313e · 2020-06-09T01:47:28Z

@hernot Thanks for checking that for me.
Yes, it does currently store it in pickled form.
The reason for that is because I first wanted to make sure that classes of this type can be handled properly by hickle, regardless of how it actually stores them.
In v3, classes of this type would sometimes be serialized, either properly or not, or be stored incorrectly as an iterable.
It was therefore important for me to get this working first.

Now that you have confirmed that it indeed works properly, I can take a look at storing the actual state in the HDF5-file and write a routine that can recreate the object properly.
This may take some effort from my side, as there are also classes that implement the __getstate__ and __setstate__ methods that should not have their state saved specifically, like NumPy arrays.

When I have a solution for it, I will let you know, so you can maybe rerun the tests for me.

PS: Keep in mind that the __getstate__ and __setstate__ methods are meant to return an object that is pickleable.
Therefore, attempting to pickle an object that has these methods defined will pickle the output of __getstate__ instead, plus storing information on what the class of the object is.
hickle will therefore have to go out of its way to store it in normal form instead.
This is documented here: https://docs.python.org/3/library/pickle.html#object.__getstate__

hernot · 2020-06-09T05:28:49Z

Unless you planned still major changes how things are handled i would suggest to simply keep the order of tests which are currently

1) test if numpy array or alike numpy item -> serialize as dedicated numpy dataset/group
2) test if dict item -> serialize as dict dateset
3) test if iterable sequence-> serializes either as array or if contains complex objects as dataset containing all items as subdatasets with counter appended to their name
4) otherwise save as pickled string

I simply would naively just insert the following test between step 2 and 3

getobjstate = getattr(obj,'__getstate__',None) 
if getobjstate is not None and getattr(obj,'__getstate__',None) is not None:

   # recall _dump on state with special attributes set for __class__ and __module__ and __loadstate__ which would read True unless returned state is false

On loading i would check if 'loadstate' or alike attribute is present and if would

   if # group has attribute __loadstate__:
       # import module indicated by module attribute and create new instance of class by
       # instance = module.class.__new__(module.class) as done by copy.copy and copy.deepcopy
       if # __loadstate__ is None or __loadstate
            state = # load state value
            # call __setstate__ with loaded state value

At least that would have been my first naive approach. If it would have lead me anywhere i can't tell

1313e · 2020-06-09T05:33:55Z

Alright, so I have looked into all of the specifics and such, and I am finding it quite hard to figure out how this could be useful in more than a select few cases.

So, the way that pickle (and dill, which is what hickle uses) works is, in the absence of any methods that modify this behavior, that it gathers all the attributes of a class object in a dict.
It then creates a tuple containing the location of the class definition and this dict, and then serializes the tuple.
When unpickling, it creates an uninitialized instance of the class and adds all the attributes back to it.
dill modifies this slightly to save the file location of the class definition as well, such that the definition does not necessarily need to be in the current namespace already.
This is one of the reasons hickle uses dill and not pickle.

The problem with this is that pickle assumes that all attributes must be saved in order to uniquely identify the class instance.
There are many cases however in which only a few attributes are necessary for this, as all other attributes can be derived from them.
This is where the __getstate__ and __setstate__ dunder methods come in (and to a lesser extent the __reduce__ dunder method as well, which NumPy uses, but that one is way more complicated).
It allows you to return a pickleable object with __getstate__ that is pickled instead of the dict I mentioned before.
When unpickling, this object is unpickled and provided to __setstate__.
It is meant to be used to make pickling/unpickling more efficient, by allowing you to save less information and having a method at your disposal that is called whenever the instance is unpickled (this could, for example, be used to read a file that contains all information required, instead of having to pickle that file's information).

Having said all of that, I think your request is the following: If a class instance has the __getstate__ and __setstate__ dunder methods, retrieve the state of the instance and hickle the state instead.
If the state is something that can be hickled without having to serialize it, then it could be viewed in the HDF5-file.
When unhickling, hickle must unhickle the state and provide it to the uninitialized class instance to recreate the instance again.
This is what you are asking right?

If so, then here is my problem with it: It basically means reimplementing about everything that pickle already does.
The sole exception is that we would hickle the class instance state instead of pickling it.
However, the class definition itself would still need to be pickled anyway, as otherwise hickle cannot store the class definition.
Given that when the __getstate__ and __setstate__ dunder methods are not present, pickle stores all attributes of the specific class instance in a dict, which is an object that hickle can handle, I feel like it would then also be expected that ALL class instances have their attributes dict hickled and not pickled.

The point here is that saving this attributes dict (or the state if it can be hickled properly without serializing it) as an actual dict in the HDF5-file, is almost never going to be useful.
For example, the following is the (reduced) state of a NumPy array (you can read here what __reduce__ does exactly):

>>> import numpy as np
>>> array = np.array([1, 2, 3])
>>> array.__reduce__()
(<function numpy.core.multiarray._reconstruct>,
 (numpy.ndarray, (0,), b'b'),
 (1,
  (3,),
  dtype('int32'),
  False,
  b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'))

The first two items state what function must be called with what arguments to set the initial state of the array.
The third item in the tuple above is the object that is provided to the __setstate__ dunder method of the newly created NumPy array to repopulate the object.

I hope you can agree with me that hickling the state of this very simple NumPy array (or its entire reduced state), is in no way more readable than just saving the individual values.
Especially because the actual values stored in the NumPy array are provided in the last item of the last tuple (the b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' string), which is in serialized form anyway.
The sole reason why we can view NumPy arrays in HDF5-files just fine, is because h5py converts NumPy arrays (and anything that can be converted to one) to the format that HDF5 requires.
When reading the file, h5py converts the information back to a NumPy array (even if it wasn't one when the data was saved).

So, awaiting @telegraphic's opinion on this matter, I suggest you to do the following:
Write a small function that specifically calls hickle on just the state of the class instance, and let it assign to an HDF5 group.
In v4, I added the ability to have hickle work on open HDF5-files and groups, so you can add attributes to that group all you want.
Then, when loading, you can unhickle the state and and reinitialize the class yourself.
An attribute you could add to the HDF5 group is for example which class this hickled object is a state of.

hernot · 2020-06-09T06:13:33Z

See my above comment. And form my experience pickle or at least some alikes like jsonpickle does not pickle class declaration it self at all (afaik). And I too would not go that far and overdo things. I would start with a subset which can be handled by hickle with least changes.

In other words I do consider this request way more than fullfilled, if hickle would call __getstate__ and dump (by recalling _dump method) the resulting state object instead of the original object, additionally remembering that the content is the state of an object of type __module__.__class__ and thus has to try to import __module__ on loading and call __setstate__ upon a new not initialized instance of __module__.__class__ passing loaded state object to it for restoring.

The only two cases i would handle in addition would be state is None and isinstance(state,bool) and not state.
The rest i would just document in the code for the case that somebody else would issue a related request asking for support of serializing class declarations by hickle. And even in that case i would priorize lowest such a request and first add support for custom serializing and deserializing methods which would be called if present for specific object if hickle has no idea how to handle and before simply creating pickle string.

1313e · 2020-06-09T06:28:01Z

@hernot I fully understand that that is what you would like to see, but the problem is that that is not expected behavior by the average user.
The average user would expect that the class instance is reconstructed whenever they load a hickled object, regardless of what that class instance actually was.
Therefore, hickle MUST save what the class definition was and reinitialize the class properly.

As I had stated above, if you want what you are asking for, you can do this quite simply using a few lines of code.
Obtain the state of a class instance and dump that instead of the class instance.

hernot · 2020-06-09T06:50:34Z

Is likely.
I used to typically approach such big seemingly complex problems by a stepwise approach.
Instead of failing on the whole from the start, i would start with a subset in initial release and extend from there in future minor releases .

Step1: (in first release) add support of dumping class states and recreating instances by loading state into class instances created from classes imported from external third party modules.

Step2: (in one of the first minor releases) add support for dumping class definitions as such and restoring them as such independent from anything else. Thereby Allowing users to call hickle.dump(class,filename), class=hickle.load(file)

Step3: (in one of the minor releases following Step 2 release) try to restore class definition and declaration from dumped description and intialize its instance by stored state with fallback to importing from module stored else wehre as introduced in step 1.

Would imho render thingls a lot cleaner and better manageable. But that is the approache i would choose.

And calling __getstate__ and __setstate__ manually on each class instance as you suggest and than dump the resulting state object is not an option as i need to dump a > 1000 Elment Pythone list full list of objects which in their states host additional complex objects intermixes with numpy objects and basic python standard types. In other words this would reqire to reimplement hickle _dump and _load method to lower hirarchy to struture using Python basic types and numpy objects only before calling hickle.dump and restore object hirarchy again manulally after calling hickle.load. You see why i placed this request? I want to spare work you and other hickle developers already have done in a way better job than i likely would ever be able to acomplish.

hernot · 2020-06-09T20:32:02Z

(rotfl) (facepalm) (rotfl) (nirg) why I'm thinking so complicated.

Why invent the wheel again just lets do most of the work the builtin machinery setup by copy, pickle and copyreg.dispatch_tables and just walk the resulting reduced tuples. They contain all information which is required to reconstruct what user requested and expects back. Step by step starting from use cases which are available either cause built into python or cause having access to some real life or real life like examples using following approach (most of it is already implemented in hickle, some will need fine tuning and extension over time)

0) get copyreg.dispatch_table for all reductor methods registered. attention will also contain numpy
 default reductors which are to be simply ignored as handled otherwise.

1) check if object is numpy or alike -> store as dataset 
2) check if object.__class__ in copyreg.dispatch_table -> call registered method and store resulting
    tuple as restorable dataset (see below)
3) check if __reduce_ex__ is defined and call it with value >= 2 (value is pickle protocol number) to
    ensure lists tuples and dicts are reduced properly and store resulting tuple as restorable dataset
    (see below) 
4) check if object is dict -> create dict group and dataset and walk its items
5) check if iterable -> store as iterable
6) check if __reduce__ is defined and store resulting tuple as restorable dataset 
    (see below, this would cause TypeError upon dict, tuple and list type objects and some others not
     supported by pickle protocol 1)
7) something quite special not yet understood ask pickle (dill) for help 
8) item hickled return

dump_restorable(tuple,object):
1) store along with data __module__ and __name__ of restore method specified in first tuple entry (index 0)
2) check second entry in tuple: it contains a tuple which the arguments
    to be passed to restore method. for standard reconstructor method it is a tuple containing as first
   item the class of object, second the class object or tuple object representing base classes of class
   and None on my classes (not sure what it would contain if not None). In this case simply record
   __module__ and __name__ of all class objects mentioned. Anything else add support for handle
   when receiving appropriate use case examples.
3) check if len of tuple is >= 6 and if sixth entry is callable. if store  __module__ and __name__ of it. 
   It shall be called upon restore as replacement for `__setstate__` or instance __dict__.update()
    method
4) check if third entry is dict, if store its content as dict data set entries within current object group
   ( on restore pass its content to method specified by sixth tuple entry if present else to __setstate__
   if defined (skip __setstate__ call if content reads False) as a final fallback use content to update
   instance  __dict__ 
   If not dict either __setstate__ must be defined or sixth entry int tuple must be callable receiving
   content of state structure (try to store as iterable dataset if iterable or as simple data type otherwise
   for now abort and let dill to pickle whole object for now)  
5) check if fourth entry is iterator, if: than simple list/tuple like sequence. store its items  as iterable/list
   dataset entries of current object group (on restore pass to listlike extend method if defined or one
   by one to list like append if defined ) 
6) check if fifths entry is iterator. if: than dict like sequence store its items as dict dataset entries of
    current object group. ( on restore pass it to dict like update method if defined or assign one by one
    as key value pairs to dict like object )

Most thereof is anyway what is already implemented in hickle instead of manually figuring how to convert content into something serializable use existing copy and pickle machinery encoded in __reduce_ex_ __reduce__ and copyreg.dispatch_tables these return all information required to restore content. Only special handling would be required for now for numpy arrays.

And yes there remain still special or rather new cases they can be done when appropriate use cases are provided by users. I would meanwhile introduce some memoization of already hickled objects (like memo parameter of copy.deepcopy) which stores for each id(obj) the link object to the resulting HDF5 group or daset to be used when obj with same id should be referenced multiple times within same hierarchy. But that is a different request.

How could i miss that when reading description of pickle module and not just challenge it by just calling obj.__reduce_ex__(2) [].__reduce_ex_(3) etc. just for curiosity. Damit why!?!?! Why allways the most complicated approach first.

And concerning pickling of any classobjects, function objects etc the about last three paragraphs under the following link indicate how pickle handles them. They are basically non picklable and just stored as named or fully qualified named references.
https://docs.python.org/3.8/library/pickle.html#what-can-be-pickled-and-unpickled

1313e · 2020-06-10T01:40:57Z

@hernot Although I appreciate all the thought you have put into it, my decision wasn't based on the difficulty of the implementation.
Rather, it is based on the usefulness of the implementation.
I had already made a plan on how to implement it if I wanted to (which did indeed involve copyreg).
We will have to wait for what @telegraphic has to say about this.

telegraphic · 2020-06-10T11:11:27Z

@1313e and @hernot thanks, I'll need to find some free time to devote to this, clearly there are some subtleties that need to be considered.

I will say that this project started off with a limited set of supported types, and followed a pretty simple: "if this specific type, dump it to file using this function". At first glance this request seems in essence to ask for some level of duck typing (if it has __setstate__, then it quacks) -- certainly a lofty goal but may be hard to implement as @1313e has found.

1313e · 2020-06-10T13:14:37Z

I think the two of us had a pretty thorough discussion that you can read. :)

hernot · 2020-06-19T12:59:41Z

Hi just to catch up and not miss anything.
Is there already a final Hickle 4.0.0 release which this could be added for future minor.
Or is the close just cause of merge of linked pullrequest and general decisission on this topic independent of hickle 4.0.0 release still pending?

telegraphic · 2020-06-19T13:38:31Z

Ah, yes this was closed automatically when doing merge; I will reopen. Given the amount of discussion above I'd say this is certainly outside the scope of v4.0.0, which already has major changes.

To summarise my understanding of the issue: custom classes are currently pickled and stored as a serialised string. This is hard to read in other languages via HDF5 API.

I see three possible solutions:

Do as above. IMO this looks like a non-trivial change (would entertain a pull request but can't tackle it myself).
Fix by using a different pickling engine than dill. E.g. implement a monkey-patched version of dill that adds the functionality requested.
Refresh / improve / document the plugin system. The ability to hickle arbitrary classes via a plugin-style system is an alternative approach for custom classes, my first attempt was here:
Todo list: V3 functionality and documentation #28 (comment)

I prefer the second or third approach (which may not be mutually exclusive)

1313e · 2020-06-19T15:25:48Z

I linked it, as I did solve the issue with classes implementing this structure not being able to be dumped and loaded properly.
However, they are now fully serialized in order to achieve this.

@telegraphic My opinion on this matter is that something like this should not be added to hickle, for the reasons I discussed here: #125 (comment).
I would like your opinion on this, but this was also partly the reason I kept the link.

hernot · 2020-06-19T15:48:02Z

@1313e @telegraphic i followed a bit the release process of Hickle 4.0.0 and played a bit how this could be added in the clean and lean concept of containers and datasets exporters and loaders. And i came to the conclusion that I do agree with you @1313e adding this brute force would just readd again highly specialized and difficult to maintain functions and classes to Hickle which Release 4 just got rid of.

But i think by pushing this special issue aside for the moment and asking the question how could that be implemented sticking to containers and datasets, exporters and loaders only and what would be required from Hickle side to achieve this i believe i stumbled over something which could be more general than just this topic here but for example also open a route to issue and #133 others.

Therefore i do suggest to

put this issue at hold or close for the moment, reopening is anyway possible at any instant
I wait until Hickle 4.0 is officially released
Think through the idea on the final release and if i consider it worth to be discussed I open an issue which allows to clearly identify what is current state, what would have to be changed, what could be the benefits for Hickle from it and how does it strengthen and improve the current design, document structure and so forth, a bit like eg. Python pep's.
If at the end it either turns out that you @1313e had already a similar suggestion already planned for future releases and it would be something to be in Hickle, than in a next step after that it should be a lot more straight forward and along the lines to re uptake this issue.

Till then I'm perfectly OK if this issue stays at halt or closed.

1313e · 2020-06-19T15:53:44Z

@hernot Sounds good, looking forward to it.
v4 is currently in the final stages before release (@telegraphic was a bit too trigger happy and accidentally already merged v4 into master, so a last PR is required before release).

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0: ============= Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface.

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0: ============= Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface.

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.0, 4.0.1 files which do not yet support PyContainer concept beyond list, tuple, dict and set

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.

hernot · 2020-11-23T14:40:21Z

While working upon pullrequest #138 and a some preliminary proof of concept work for issue #139 i was forced to accept that hd5 data format is not really designed for storing dict, list and tuple structures containing vast ammount of heterogenous data especially if they are organized in a rather chaotic huge tree like structure. I summarized my finding within pull request #138. Consequently as any other solution yields more appropriate results compared to naive implementation of copy protocol support within hickle i hereby close this issue in favour for H4EP003 (issue #145).

With hickle 4.0.0 the code for dumping and loading dedicated objects like scalar values or numpy arrays was moved to dedicated loader modules. This first step of disentangling hickle core machinery from object specific included all objects and structures which were mappable to h5py.Dataset objects. This commit provides an implementaition of hickle extension proposal H4EP001 (telegraphic#135). In this proposal the extension of the loader concept introduced by hickle 4.0.0 towards generic PyContainer based and mixed loaders specified. In addition to the proposed extension this proposed implementation inludes the following extensions hickle 4.0.0 and H4EP001 H4EP001: ======== PyContainer Interface includes a filter method which allows loaders when data is loaded to adjust, suppress, or insert addtional data subitems of h5py.Group objects. In order to acomplish the temorary modification of h5py.Group and h5py.Dataset object when file is opened in read only mode the H5NodeFilterProxy class is provided. This class will store all temporary modifications while the original h5py.Group and h5py.Dataset object stay unchanged hickle 4.0.0 / 4.0.1: ===================== Strings and arrays of bytes are stored as Python bytearrays and not as variable sized stirngs and bytes. The benefit is that hdf5 filters and hdf5.compression filters can be applied to Python bytearrays. The down is that data is stored as bytes of int8 datatype. This change affects native Python string scalars as well as numpy arrays containing strings. Extends pickle loader create_pickled_dataset function to support Python copy protocol as proposed by issue telegraphic#125 For this a dedicated PickledContainer is implemented to handle all objects which have been stored using Python copy protocol. numpy.masked array is now stored as h5py.Group containin a dedicated dataset for data and mask each. scipy.sparce matrices now are stored as h5py.Group with containing the datasets data, indices, indptr and shape dictionary keys are now used as names for h5py.Dataset and h5py.Group objects. Only string, bytes, int, float, complex, bool and NonType keys are converted to name strings, for all other keys a key-value-pair group is created containg the key and value as its subitems. string and bytes keys which contain slashes are converted into key value pairs instead of converting slashes to backslashes. Distinction between hickle 4.0.0 string and byte keys with converted slashes is made by enclosing sting value within double quotes instead of single qoutes as donw by Python repr function or !r or %r string format specifiers. Consequently on load all string keys which are enclosed in single quotes will be subjected to slash conversion while any others will be used as ar. h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle' are on load automatically get assigned object as their py_object_type. The related 'type' attribute is ignored. h5py.Group and h5py.Dataset objects which do not expose a 'base_type' attribute are assumed to either contain pickle string or conform to copy protocol and thus get implicitly assigned 'pickle' base type. Thus on dump for all h5py.Group and h5py.Dataset objects which contain pickle strings or conform to Python copy protocol 'base_type' and 'type' attributes are ommited as their values are 'pickle' and object respective. Other stuff: ============ Full separation between hickle core and loaders Distinct unit tests for individual loaders and hickle core Cleanup of not any more required functions and classes Simplification of recursion on dump and load through self contained loader interface. is capbable to load hickle 4.0.x files which do not yet support PyContainer concept beyond list, tuple, dict and set includes extended test of loading hickel 4.0.x files contains fix for labda py_obj_type issue on numpy arrays with single non list/tuple object content. Python 3.8 refuses to unpickle lambda function string. Was observerd during finalizing pullrequest. Fixes are only activated when 4.0.x file is to be loaded Exceptoin thrown by load now includes exception triggering it including stacktrace for better localization of error in debuggin and error reporting.

* no copy when loading astropy * typo * Bumping to 3.4.4 * Package license and requirements files * Bump to 3.4.5 * Added some tests to Travis CI and AppVeyor that check if hickle can be properly packaged up, distributed and installed. * And maybe I should not forget twine in the requirements_test.txt. * Moved the tests directory from root to ./hickle. * Added missing tests folder to the MANIFEST.in file. * Add dict-type permanently to types_dict. * Subclasses of supported types can now be pickled as well (although not yet with their proper type). * Removed all cases of saving dtypes as a single element list. * Renamed the 'type' attribute to 'base_type' in preparation for adding subclass support. * Also make sure that strings are saved as single elements. * The types_dict now uses tuples with the create-functions and the hkl_dtype key. * All create-functions now take an extra 'base_type' string, that describes what the hkl_type is that will be saved to HDF5. * Groups also obtain their base_type from the create_dataset_lookup()-function now. * The actual type of a hickled object is now saved as well (in pickled form). * Finalized implementing support for subclasses. * Coveralls -> codecov.io * Add codecov.io badge * The order of the dict item keys are now saved as well. Any dict that is loaded will be initialized with the items sorted in that order. For all types that derive from dict, the dict will be initialized using its true type directly (apparently, I had written it that way before already for some reason). This fixes telegraphic#65. * Hickle's required HDF5 attributes are now solely applied to the data group that contains the hickled Python object, instead of the entire file (this allows for hickled objects to be more easily added to already existing HDF5-files, without messing up the root of that file). * Datasets and groups now solely use 'group_xxx' if there is more than a single item at that level. All 'create_xxx' functions are now passed a string containing the name of the group/dataset that they should create. * Added forgotten base_key_type attribute to PyContainer. * Reverted working tree back to before using 'track_order'. Added missing 'six' requirement. Added a test for testing the dumping and loading of an OrderedDict. * The root of a hickled group is no longer read now, as it is not necessary. Removed the auxiliary attributes that were required for reading it. The true type of a dict key is now saved as well, even though it is not used for anything (simply saving it now in case we want to use it later). * The version is now stored using a single Python file, whose strings are read using regex. * HDF5 groups can now be given as a file_obj as well when dumping and loading. Providing a path that does not start with '/' will automatically add it to it. Added tests for these functionalities. * Arbitrary-precision integers can now be dumped and loaded properly. * Also make sure that 'long' dtypes are taken into account on Python 2. * make hickle work with pathlib.Path Basically, any package /module that saves to file supports this too (including `h5py`). * make Python 2 compatible * Changed wording. * Added six requirement, and added minimum versions for all requirements. * Now 'dill' is always used for dumping and loading serialized objects. Added a test for dumping and loading a local session function. * Add support for actually loading serialized data in Python 2. * Add new test to main as well. * Make sure new changes are also used for Python 2 * Update file_opener re telegraphic#123 * Fixed documentation of `dump` and `load` to be NumPy doc style (and look a bit better). Replaced broken `pickle` documentation link with a proper one. * Only lists and tuples are now recognized as acceptable iterables. All other iterables are either handled separately, or simply pickled. * Changed the lookup system to use object typing for more consistency. * Added test for detecting the problem raised in telegraphic#125 * Added support for hickling dicts with slashes in their dict keys. * Make sure that functional backslashes still work properly in dict keys. * Loaders are now only loaded when they are required for dumping or loading a specific object. * Make sure to do proper future import. * Raise an error if a dict item key contains a double backslash. * Only filter out import errors due to the loader not being found. * As Python 2 apparently only reports the last part of the name of a non-importable module, search for something a bit more specific. * Some small QoL changes. * The py_type of a pickled object is no longer saved to HDF5, as it is not necessary to restore the object's original state. * Removed legacy support for v1 and v2. Added start of legacy support for v3. v4 now stores its HDF5 attributes using a 'HICKLE_' prefix, to allow users to add attributes to the group without interference. * Objects can now be hickled to existing HDF5 groups, as long as they don't contain any datasets or groups themselves. * Made sure that v3 always uses relative imports, to make it easier to port any functionality change from v3 to v4. (Even though I am not a fan of relative imports) * The version is now stored using a single Python file, whose strings are read using regex. * Backported change to version store location to v3 as well. Bumped version to 3.4.7 to include the latest changes. * Removed support for Python 2. Added legacy support for hickle v3 (currently uses v3.4.7). * Remove testing for Python 2.7 as well. * Always specify the mode with which to open an HDF5-file. * Test requirements updates. * And make sure to always execute 'pytest'. * Removed basically everything that has to do with Python 2.7 from v4. As legacy_v3 is expected to be able to load files made with Python 2.7, these are not changed. * Many, many QoL changes. Converted all v4 files to be PEP8 compliant. Rewritten 'load_python3' into 'load_builtins'. The 'load_numpy' and 'load_builtins' modules are only loaded when they are required, like all other loaders. Removed the 'containers_type_dict' as the true container type is already saved anyway. Astropy's classes are now reinitialized using their true type instantly. Astropy constants can now be properly dumped and loaded. * Save types of dict keys as a normal string as well. * Some minor improvements. * Added test for opening binary files, and make sure that any 'b' is removed from the file mode. (telegraphic#131) * Added pytests for all uncovered lines. Removed lines that are never used (and thus cannot be covered). Added 'NoneType' to the dict of acceptable dict key types, as Nones can be used as keys. * Replaced all instances of 'a HDF' with 'an HDF'. * Removed the index.md file in the docs, as it is now simply pointing to the README.md file instead. * Badges! * Added few classifiers to the setup.py file. * Update requirement for pytest. * Removed use of 'track_times'. * NumPy arrays with unicode strings can now be properly dumped and loaded. * NumPy arrays containing non-NumPy objects can now be properly dumped and loaded. * Added all missing tests for full 100% coverage!!! * Make sure that kwargs is always passed to 'create_dataset'. * If hickle fails to save an object to HDF5 using its standard methods, it will fall back to pickling it (and emits a warning saying why that was necessary). Simplified the way in which python scalars are hickled. * Actually mention that the line in parentheses is the reason for serializing. * Use proper development status identifier. * Make sure to preserve the subclass type of a NumPy array when dumping. * make sure that a SkyCoord object will be properly saved and retrieved when it's a scalar or N-D array * revert the change to legacy_v3/load_astropy.py * Updated legacy v3 to v3.4.8. Co-authored-by: Kewei Li <kewl@microsoft.com> Co-authored-by: Danny Price <dan@thetelegraphic.com> Co-authored-by: Isuru Fernando <isuruf@gmail.com> Co-authored-by: Ellert van der Velden <ellert_vandervelden@outlook.com> Co-authored-by: Bas Nijholt <basnijholt@gmail.com> Co-authored-by: Rui Xue <rx.astro@gmail.com>

1313e self-assigned this Jun 3, 2020

1313e mentioned this issue Jun 6, 2020

Release of v4.0.0? #117

Merged

1313e added a commit to 1313e/hickle that referenced this issue Jun 6, 2020

Added test for detecting the problem raised in telegraphic#125

ada053d

1313e added the bug label Jun 8, 2020

telegraphic closed this as completed in #117 Jun 19, 2020

telegraphic reopened this Jun 19, 2020

1313e mentioned this issue Jun 19, 2020

Last few changes #134

Merged

hernot mentioned this issue Jun 20, 2020

H4EP 001: Container and mixed (dataset + Container) loaders (draft) #135

Closed

hernot mentioned this issue Aug 1, 2020

Implementaion of Container and mixed loaders (H4EP001) #138

Closed

hernot closed this as completed Nov 23, 2020

hernot mentioned this issue Nov 26, 2020

HEP003: Hickle Compact Expand protocol #145

Closed

hernot pushed a commit to hernot/hickle that referenced this issue Dec 1, 2020

Added test for detecting the problem raised in telegraphic#125

2ecace0

hernot mentioned this issue Jan 16, 2022

Problems with handling kwargs #157

Closed

support for python copy protocol __setstate__ __getstate__ if present in object #125

support for python copy protocol __setstate__ __getstate__ if present in object #125

Comments

hernot commented Jun 2, 2020 • edited Loading

1313e commented Jun 2, 2020

hernot commented Jun 2, 2020 • edited Loading

hernot commented Jun 2, 2020 • edited Loading

1313e commented Jun 3, 2020

hernot commented Jun 3, 2020

1313e commented Jun 3, 2020

hernot commented Jun 3, 2020

telegraphic commented Jun 4, 2020

hernot commented Jun 4, 2020

1313e commented Jun 6, 2020

hernot commented Jun 6, 2020 • edited Loading

1313e commented Jun 7, 2020 • edited Loading

1313e commented Jun 7, 2020

hernot commented Jun 7, 2020 • edited Loading

1313e commented Jun 7, 2020 • edited Loading

hernot commented Jun 7, 2020 • edited Loading

1313e commented Jun 8, 2020 • edited Loading

hernot commented Jun 8, 2020 • edited Loading

1313e commented Jun 8, 2020

1313e commented Jun 8, 2020

hernot commented Jun 8, 2020 • edited Loading

1313e commented Jun 9, 2020 • edited Loading

hernot commented Jun 9, 2020 • edited Loading

1313e commented Jun 9, 2020

hernot commented Jun 9, 2020 • edited Loading

1313e commented Jun 9, 2020

hernot commented Jun 9, 2020 • edited Loading

hernot commented Jun 9, 2020 • edited Loading

1313e commented Jun 10, 2020

telegraphic commented Jun 10, 2020

1313e commented Jun 10, 2020

hernot commented Jun 19, 2020 • edited Loading

telegraphic commented Jun 19, 2020

1313e commented Jun 19, 2020

hernot commented Jun 19, 2020 • edited Loading

1313e commented Jun 19, 2020

hernot commented Nov 23, 2020 • edited Loading

support for python copy protocol setstate getstate if present in object #125

support for python copy protocol setstate getstate if present in object #125

hernot commented Jun 2, 2020 •

edited

Loading

hernot commented Jun 2, 2020 •

edited

Loading

hernot commented Jun 2, 2020 •

edited

Loading

hernot commented Jun 6, 2020 •

edited

Loading

1313e commented Jun 7, 2020 •

edited

Loading

hernot commented Jun 7, 2020 •

edited

Loading

1313e commented Jun 7, 2020 •

edited

Loading

hernot commented Jun 7, 2020 •

edited

Loading

1313e commented Jun 8, 2020 •

edited

Loading

hernot commented Jun 8, 2020 •

edited

Loading

hernot commented Jun 8, 2020 •

edited

Loading

1313e commented Jun 9, 2020 •

edited

Loading

hernot commented Jun 9, 2020 •

edited

Loading

hernot commented Jun 9, 2020 •

edited

Loading

hernot commented Jun 9, 2020 •

edited

Loading

hernot commented Jun 9, 2020 •

edited

Loading

hernot commented Jun 19, 2020 •

edited

Loading

hernot commented Jun 19, 2020 •

edited

Loading

hernot commented Nov 23, 2020 •

edited

Loading