Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dumping functions does not store dependencies #52

Closed
hhuuggoo opened this issue Jul 9, 2014 · 12 comments
Closed

dumping functions does not store dependencies #52

hhuuggoo opened this issue Jul 9, 2014 · 12 comments
Labels
Milestone

Comments

@hhuuggoo
Copy link

hhuuggoo commented Jul 9, 2014

If I use this script to serialize something:

import pandas as pd                                                                                                                                                                                                                                                              
import dill                                                                                                                                                                                                                                                                      

def func(x):                                                                                                                                                                                                                                                                     
    return pd.DataFrame({'a' : x})                                                                                                                                                                                                                                               

def func2(x):                                                                                                                                                                                                                                                                    
    return func(x) + func(x)                                                                                                                                                                                                                                                     

with open("out.dill", "w+") as f:                                                                                                                                                                                                                                                
    dill.dump(func2, f)    

And load it with:

import dill                                                                                                                                                                                                                                                                      

with open("out.dill") as f:                                                                                                                                                                                                                                                      
    func2 = dill.load(f)                                                                                                                                                                                                                                                         

print func2([1,2,3,4,5])                                                                                                                                                                                                                                                         

I get

Traceback (most recent call last):                                                                                                                                                                                                                                               
  File "read_test.py", line 6, in <module>                                                                                                                                                                                                                                       
    print func2([1,2,3,4,5])                                                                                                                                                                                                                                                     
  File "write_test.py", line 8, in func2                                                                                                                                                                                                                                         
    return func(x) + func(x)                                                                                                                                                                                                                                                     
NameError: global name 'func' is not defined           

What is the intention for how the user should handle this?

Thanks

@mmckerns
Copy link
Member

Hi Hugo.

The way dill treats that case is exactly how pickle handles it.

Interestingly, dill does treat global references differently than pickle very often… for example, in serializing a class instance, you can pick if you want to serialize by reference or to serialize the class def. For closures, dill has some interest in finding the minimal set of dependencies from the globals it needs to serialize… but it doesn't do this for a standard function.

Is this answer sufficient for what you were looking for? Or is this a subtle feature request, or otherwise?

It's an interesting, and feasible idea to try to find the global dependencies, and serialize them with the function object. It's not been on my radar for functions.

@hhuuggoo
Copy link
Author

@mmckerns, I make all my feature requests explicitly =) and submit PRs if I can. Right now I'm just exploring the various ways one could do remote execution of functions, and how general it could be. I'm also entertaining ideas like maintaining a directory that is rsynced with remotes, and only allowing function execution from things in the source code.

How would you find the feasible global deps? would you parse the function code to find out what it uses?

@mmckerns
Copy link
Member

I do it already in certain cases... and have a few options. There are a few examples of this in dill right now… but it depends on the context, for which is better, I think. For example, dill.pointers has code that tracks down references to parents through the gc module. There's another route for closures, especially, dill augments theinspect augmented with line caching and etc.

@matsjoyce
Copy link
Contributor

It will work if func2 and func are in a module. It only dosen't work because the old __main__ is set to the new __main__, without its globals being saved or restored in any way.

@mmckerns
Copy link
Member

True. So the question is, should there be an asymmetry between "in-module" behavior and "in-interpreter" behavior? I tend to say no. It's just not a case I'd considered before. @matsjoyce: what do you think?

@matsjoyce
Copy link
Contributor

It all gets very complicated when we question the design of pickle, doesn't it? I think it depends on what dill is trying to do. I can think of several options:

  1. Save everything, restoring the environment to its exact state when saved
  2. Save a "minimal complete" version of the environment, with just the objects that the saved object references, so that the saved object can function as before, but affecting the rest of the program only when needed
  3. Save just the object, presuming that everything else is available on the "other end"
  4. There are probably more

I think that 1 fits with the dill.dump_session idea, while 2 fits with the dill.dump idea. Pickle follows 3 more?
Back to the question, both 1 and 2 mean that __main__ should be saved. Looking though the dill source code, it seems that the __main__s are only not saved due to the fact that the __main__ module is never encountered, so that might be something we missed in #41. It should be easy to fix though, perhaps by adding a pik.dump(_main_module) in dill.dump, and any corresponding changes in dill.load.

@mmckerns
Copy link
Member

I think dill tends to (2), as you indicated. Getting dependencies of a function would be a strong break from pickle behavior, as it would say that (3) is not sufficient in this extremely common case. Do we then have to reexamine all the other objects dill pickles in the interpreter, and make then consistent too?

I think that pickle gets dependencies for objects inside a module because it can serialize a module. I think that dill right now maintains the pickle philosophy of not treating __main__ as a module, and follows (2) instead of (3). Does dill ever say, I can save __main__ as a module object? I think this only happens in the special case of dump_session… but I'll have to check.

@mmckerns mmckerns reopened this Jul 13, 2014
@mmckerns
Copy link
Member

#1 is asking for a changes that tend toward treating __main__ as a module, and it's partly why I've left it in limbo so long. Of course, that ticket relates to dump_session a bit more directly.

@matsjoyce
Copy link
Contributor

Well, part of me (the part which likes the fact that a method is just a function with one bound argument) likes the idea that __main__ is treated as just a module, but the other part realises that that may complicate some things, and introduce some confusing behaviour for people new to dill.

@mmckerns
Copy link
Member

I agree, and like it for the same reason. However, I think it seriously breaks what is currently a fairly consistent paradigm in dill... and that would likely mean a serious impact on other's existing code. So, I'm for not making the change, unless there's a really really compelling reason to do so. This might be branch-worthy of an exploration, just to see the impact, but I'm not going to do so right now with no motivating need.

This should then re-close this issue. Agreed?

@matsjoyce
Copy link
Contributor

The problem with treating __main__ as a module is that then, either __main__ objects would have to be saved like objects in other modules (refs), which is useless, or the objects in other modules would have to be completely pickled, like objects in __main__, which would lead to larger pickles and more "cannot pickle X" problems.

@mmckerns
Copy link
Member

Closing this again. For now, I'm filing it away as a nice idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants