Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading the same session after dump creates a different file #195

Closed
sb2nov opened this issue Dec 2, 2016 · 5 comments
Closed

Loading the same session after dump creates a different file #195

sb2nov opened this issue Dec 2, 2016 · 5 comments
Labels
Milestone

Comments

@sb2nov
Copy link

sb2nov commented Dec 2, 2016

I have some trouble understanding why the file c.pkl would be different from b.pkl as we're loading the session we just dumped. Thank you for the help.

import dill
import os
import tempfile

tmpdir = tempfile.mkdtemp()

class A(object):
  def __init__(self, a=5):
    self.a = a

  def printa(self):
    print self.a

print 'Directory Path', tmpdir
dill.dump_session(os.path.join(tmpdir, 'b.pkl'))
dill.load_session(os.path.join(tmpdir, 'b.pkl'))
dill.dump_session(os.path.join(tmpdir, 'c.pkl'))
dill.load_session(os.path.join(tmpdir, 'c.pkl'))
dill.dump_session(os.path.join(tmpdir, 'd.pkl'))
dill.load_session(os.path.join(tmpdir, 'd.pkl'))
dill.dump_session(os.path.join(tmpdir, 'e.pkl'))
dill.load_session(os.path.join(tmpdir, 'e.pkl'))
dill.dump_session(os.path.join(tmpdir, 'f.pkl'))
dill.load_session(os.path.join(tmpdir, 'f.pkl'))

from subprocess import check_output
print 'Hash of all the files: '
print check_output("sha1sum " + os.path.join(tmpdir, '*'), shell=True)

Sample output

Hash of all the files: 
d03af7a9fda8f141cb75ef787c2bc274609edd43  /tmp/sourabhbajaj/tmp7rsR3h/b.pkl
c57c16d3ddaaaf2954aa550f699607baa74e6edd  /tmp/sourabhbajaj/tmp7rsR3h/c.pkl
c57c16d3ddaaaf2954aa550f699607baa74e6edd  /tmp/sourabhbajaj/tmp7rsR3h/d.pkl
c57c16d3ddaaaf2954aa550f699607baa74e6edd  /tmp/sourabhbajaj/tmp7rsR3h/e.pkl
c57c16d3ddaaaf2954aa550f699607baa74e6edd  /tmp/sourabhbajaj/tmp7rsR3h/f.pkl

cc: @mmckerns

@matsjoyce
Copy link
Contributor

You can use pickletools.dis(open("c.pkl")) to get a closer look. There's so much stuff that happens in dump/load_session that it's not surprising something changed. It seems in this case that in the first dump, pickle reuses some of the strings it defines, while in the following cases it redefines them. If you print the id of A.__name__, it changes after each load so I guess the first time "A" is shared with something else, say the module __dict__, but after dump+load, it is no longer shared and so pickle serializes it twice. Pickle has tons of oddities like this (dict randomisation also has effects like this, but that is between runs) so I wouldn't rely on hashes to check for differences. If you want a more stable storage, use a hacked version of json which sorts dicts before dumping (what I did in the past). Bonus is that it is readable.

@sb2nov
Copy link
Author

sb2nov commented Dec 5, 2016

@matsjoyce thanks for the explanation. I tried using pickletools but wasn't able to figure much out from it. I do see the id of A change across each load which I think is consistent with your hypothesis.

In terms of the ordered json serialization. Did you mean using that over pickle ?

@matsjoyce
Copy link
Contributor

If you really need consistency, then yes, use json (or msgpack, which is faster and more compact). The only problem you would then have to deal with is dict order randomisation, which can be solved reasonably easily. Of course this won't work if you need to pickle any python object, as json and friends are limited in the number of types they can handle.

@mmckerns
Copy link
Member

@sb2nov: I don't see the need to do anything here. Is there any reason to not close this issue?

@sb2nov
Copy link
Author

sb2nov commented Jan 23, 2017

@mmckerns closing it now

@sb2nov sb2nov closed this as completed Jan 23, 2017
@mmckerns mmckerns modified the milestone: dill-0.2.6 Feb 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants