Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some errors with dump_module and load_module #525

Open
mmckerns opened this issue Jul 10, 2022 · 11 comments · Fixed by #526 · May be fixed by #527 or #475
Open

some errors with dump_module and load_module #525

mmckerns opened this issue Jul 10, 2022 · 11 comments · Fixed by #526 · May be fixed by #527 or #475
Labels
Milestone

Comments

@mmckerns
Copy link
Member

mmckerns commented Jul 10, 2022

In trying out dump_module and load_module a bit more extensively after #507, and I ran across some interesting cases:

  1. dump an imported builtin module that contains unpicklable objects
>>> import dill
>>> import math
>>> dill.dump_module('math.pkl', main=math)
>>> _math = dill.load_module('math.pkl')
>>> _math.sin(0)
0.0
>>> _math = dill.load_module('math.pkl', main='math')
>>> _math.sin(0)
0.0
>>> import abc
>>> dill.dump_module('abc.pkl', main=abc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 488, in dump_module
    pickler.dump(main)
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 880, in dump
    StockPickler.dump(self, obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 2099, in save_module
    state=_main_dict)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 662, in save_reduce
    save(state)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 1667, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 859, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 885, in _batch_setitems
    save(v)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 2172, in save_type
    )), obj=obj, postproc_list=postproc_list)
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 1551, in _save_with_postproc
    pickler.save_reduce(*reduction, obj=obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 1667, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 859, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 885, in _batch_setitems
    save(v)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pickle.py", line 524, in save
    rv = reduce(self.proto)
TypeError: can't pickle _abc_data objects
  1. dump an imported module by name that fails otherwise, then fails to load
>>> import dill
>>> dill.dump_module('abc.pkl', main='abc')
>>> abc = dill.load_module('abc.pkl')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 640, in load_module
    pickle_main = _identify_module(file, main)
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 551, in _identify_module
    raise UnpicklingError("reached STOP without finding main module")
_pickle.UnpicklingError: reached STOP without finding main module
>>> abc = dill.load_module('abc.pkl', main='abc')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 640, in load_module
    pickle_main = _identify_module(file, main)
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 551, in _identify_module
    raise UnpicklingError("reached STOP without finding main module")
_pickle.UnpicklingError: reached STOP without finding main module
>>> import abc
>>> dill.load_module('abc.pkl', main=abc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 640, in load_module
    pickle_main = _identify_module(file, main)
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 551, in _identify_module
    raise UnpicklingError("reached STOP without finding main module")
_pickle.UnpicklingError: reached STOP without finding main module
  1. interestingly, math also fails to load when saved by name
>>> import dill
>>> dill.dump_module('math.pkl', main='math')
>>> dill.load_module('math.pkl')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 640, in load_module
    pickle_main = _identify_module(file, main)
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 551, in _identify_module
    raise UnpicklingError("reached STOP without finding main module")
_pickle.UnpicklingError: reached STOP without finding main module
>>> dill.load_module('math.pkl', main='math')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 640, in load_module
    pickle_main = _identify_module(file, main)
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 551, in _identify_module
    raise UnpicklingError("reached STOP without finding main module")
_pickle.UnpicklingError: reached STOP without finding main module
>>> import math
>>> dill.load_module('math.pkl', main=math)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 640, in load_module
    pickle_main = _identify_module(file, main)
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 551, in _identify_module
    raise UnpicklingError("reached STOP without finding main module")
_pickle.UnpicklingError: reached STOP without finding main module
>>> 
>>> dill.dump_module('math.pkl', main=math)
>>> dill.load_module('math.pkl')
<module 'math' from '/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/math.cpython-37m-darwin.so'>
>>> dill.load_module('math.pkl', main='math')
<module 'math' from '/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/math.cpython-37m-darwin.so'>
  1. same-named module-type object and imported module; apparently correct behavior, but not obvious
>>> import dill
>>> import types
>>> fail = types.ModuleType('fail')
>>> fail.x = 1
>>> dill.dump_module('fail.pkl', main=fail)

~ new session ~

>>> import dill
>>> fail = dill.load_module('fail.pkl')
>>> fail.x
1
>>> dill.load_module('fail.pkl', main='fail')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 667, in load_module
    % pickle_main
ValueError: can't restore non-imported module 'fail' into an imported one
>>> import fail  # module contains ```y = 0```
>>> dill.load_module('fail.pkl', main=fail)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 667, in load_module
    % pickle_main
ValueError: can't restore non-imported module 'fail' into an imported one
>>> 
>>> dill.dump_module('fail.pkl', main=fail)
>>> 
>>> _fail = dill.load_module('fail.pkl')
>>> _fail.y
0
>>> import types
>>> fail = types.ModuleType('fail')
>>> _fail = dill.load_module('fail.pkl', main=fail)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mmckerns/lib/python3.7/site-packages/dill/_dill.py", line 672, in load_module
    % pickle_main
ValueError: can't restore imported module 'fail' into a non-imported one
>>> _fail = dill.load_module('fail.pkl', main='fail')
>>> _fail.y
0
  1. Note that a .pkl file is left even upon failure of dump_module (artifact from case 1 above).
@mmckerns
Copy link
Member Author

@leogama: I also noticed that it might be more natural to replace the argument main with module.

@mmckerns mmckerns added this to the dill-0.3.6 milestone Jul 10, 2022
@leogama
Copy link
Contributor

leogama commented Jul 10, 2022

@leogama: I also noticed that it might be more natural to replace the argument main with module.

I was about to propose this yesterday... Do you think it suffices to change the parameter name, or would you change the local variable names too?

I'll take a look at the errors you pointed to.

@mmckerns
Copy link
Member Author

Changing the name of the argument is probably the easiest route. Then you can use main = module for the rest of it.

@mmckerns mmckerns added the bug label Jul 11, 2022
@leogama leogama linked a pull request Jul 13, 2022 that will close this issue
15 tasks
@leogama
Copy link
Contributor

leogama commented Jul 13, 2022

O crap... I think I added the statement if isinstance(main, str): main = _import_module(main) in one branch (filtering) and changed the function signature in the other (load_session bug). The first errors are the result of dump_module(main="name") be simply saving the string "name"... 🤦🏼‍♂️
I've merged this with the filtering branch and the error vanished.
Will see the other errors now.

@mmckerns
Copy link
Member Author

Can you open a PR to address the above errors separately, as opposed to mixing them in with #475? It's fine if they are in both. I'd like to be able to close this up before having to deal with new features in #475.

@leogama
Copy link
Contributor

leogama commented Jul 13, 2022

  1. same-named module-type object and imported module; apparently correct behavior, but not obvious

The behavior is exactly the expected, as load_session() docstring states:

main: a module object or the name of an importable module.

The error message can be made more explicit however. Instead of this and variations:

  • can't restore non-imported module 'fail' into an imported one

It could be:

  • can't restore the non-imported module named 'fail' from session file into the imported module 'fail'

  1. Note that a .pkl file is left even upon failure of dump_module (artifact from case 1 above).

This behavior is consistent with pickle.dump() and dill.dump() when pickling fails.

leogama added a commit to leogama/dill that referenced this issue Jul 13, 2022
…qfoundation#525)

New phrasing of mismatching modules error messages in load_session():

```python
>>> import dill
>>> dill.dump_module()
>>> dill.load_module(module='math')
ValueError: can't update module 'math' with the saved state of module '__main__'

>>> import types
>>> main = types.ModuleType('__main__')
>>> dill.load_module(module=main)
ValueError: can't update module-type object '__main__' with the saved state of imported module '__main__'

>>> dill.dump_module(module=main)
>>> dill.load_module(module='__main__')
ValueError: can't update imported module '__main__' with the saved state of module-type object '__main__'
```
@mmckerns
Copy link
Member Author

Re: (4,5) The behavior is exactly the expected

That's good. I wanted to confirm this was the intended behavior. We need to be very clear in the docs what will happen, and the errors should also be clear so people who ignore the docs should know why the error was thrown.

mmckerns pushed a commit that referenced this issue Jul 15, 2022
* fix dump_module() bugs and rename parameter 'main' to 'module' (fixes #525)

New phrasing of mismatching modules error messages in load_session():

```python
>>> import dill
>>> dill.dump_module()
>>> dill.load_module(module='math')
ValueError: can't update module 'math' with the saved state of module '__main__'

>>> import types
>>> main = types.ModuleType('__main__')
>>> dill.load_module(module=main)
ValueError: can't update module-type object '__main__' with the saved state of imported module '__main__'

>>> dill.dump_module(module=main)
>>> dill.load_module(module='__main__')
ValueError: can't update imported module '__main__' with the saved state of module-type object '__main__'
```

* dump_module: clarify refimport description

* improvements to 'refimported' handling and extra checks in *_module() functions

* load_session(): clarify that the 'module' argument must match the session file's module
@mmckerns
Copy link
Member Author

mmckerns commented Jul 15, 2022

@leogama: Can you address the following?

  1. While it's fine that case (1) above throws a TypeError due to unpicklable objects in the module, it's still a bit unexpected for one builtin module (e.g. abc) not to pickle when another (e.g. math) does. Is this what you were envisioning, or does it make sense to be able to pickle a (non-altered) imported builtin module (at least, by reference) regardless?
  2. Also, how is the user supposed to know whether the module saved in the "pkl" file is an imported module as opposed to a non-imported module-type object? It would seem that if one is passing a module to be updated (by using the module keyword) you either have to know what type was stored, or you have to guess and see whether it throws a ValueError.
  3. Also, is the following expected behavior?
>>> import sys
>>> import test
>>> test.x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'test' has no attribute 'x'
>>> test.x = 1
>>> 
>>> import dill
>>> dill.dump_module('test.pkl', module=test)
>>> del sys.modules['test']
>>> del test
>>> 
>>> import test
>>> test.x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'test' has no attribute 'x'
>>> _test = dill.load_module('test.pkl')
>>> _test.x
1
>>> test.x  # imported module is also updated
1

@mmckerns mmckerns reopened this Jul 15, 2022
@leogama
Copy link
Contributor

leogama commented Jul 16, 2022

@leogama: Can you address the following?

  1. While it's fine that case (1) above throws a TypeError due to unpicklable objects in the module, it's still a bit unexpected for one builtin module (e.g. abc) not to pickle when another (e.g. math) does. Is this what you were envisioning, or does it make sense to be able to pickle a (non-altered) imported builtin module (at least, by reference) regardless?

Sure. This is my current understanding:

First, the premise of dill.dump_module() is to "pickle the current state of __main__ or another module to a file", so that the user can completely restore it, even in a different session. Unpickleable objects in a module may or may not contain relevant state to the module's inner working, and it's likely impossible to know it programmatically. Therefore, leaving such objects out of the session file in the interest of being able to pickle any module breaks that promise. So I think it is the correct, albeit maybe counterintuitive is some cases (but not in the __main__ case), to not do this by default and to raise an exception when any unpickleable object is found. This is a purely conceptual reasoning.

Pragmatically, we go back to that discussion in #475 about a filter for unpickleable objects, which could be used to allow, for example, any built-in module to be saved seamlessly. I didn't answer about that until now because I was pondering what kind of implementation would be more reasonable for this particular problem, which (as we see here) is broader in scope than simple objects to save selection. I'm reaching the conclusion that it's better to have a dedicated option for it in dump_module(), or even in Pickler.dump(), to save all unpickleable objects by reference (instead of excluding them from the namespace as would happen with a filter).

I just didn't get the last part of your question. What do you mean by pickling a built-in module by reference? The whole module, like dump_module() simply calling pickler.save_global(module)? This would be a no-op...


  1. Also, how is the user supposed to know whether the module saved in the "pkl" file is an imported module as opposed to a non-imported module-type object? It would seem that if one is passing a module to be updated (by using the module keyword) you either have to know what type was stored, or you have to guess and see whether it throws a ValueError.

I added a "__runtime__" prefix to the module name argument for _import_module() when pickling a runtime-created module (or module-type object, as you called it in a more intuitive way). It's as plain and simple as that. The relevant bits of logic are scattered though, in save_module(), _import_module() and load_module() (plus _is_imported_module() of course).

Again, didn't get the last bit. What would throw a ValueError beyond the checks in load_module() that explicitly use this line's result?

is_runtime_mod = pickle_main.startswith('__runtime__.')

  1. Also, is the following expected behavior?
>>> import sys
>>> import test
>>> test.x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'test' has no attribute 'x'
>>> test.x = 1
>>> 
>>> import dill
>>> dill.dump_module('test.pkl', module=test)
>>> del sys.modules['test']
>>> del test
>>> 
>>> import test
>>> test.x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'test' has no attribute 'x'
>>> _test = dill.load_module('test.pkl')
>>> _test.x
1
>>> test.x  # imported module is also updated
1

It's correct. For importable/imported modules, returning the module is just a convenience for not having to explicitly import the restored module to be able to access it. What is restored is always the sys.modules['module.name'] value. The name the user want's to associate to it is irrelevant.

So these three sequences of statements are equivalent:

>>> import dill
>>> import test # load module 'test' and assign it to name 'test'
>>> dill.load_module('test.pkl') # update it
<module 'test' from '/usr/lib/python3.8/test/__init__.py'>
>>> test.x
1
# in another session:
>>> import dill
>>> dill.load_module('test.pkl')  # load module 'test' and update it
<module 'test' from '/usr/lib/python3.8/test/__init__.py'>
>>> import test  # assign the name 'test' to the existent sys.modules['test'] object
>>> test.x
1
# in yet another session:
>>> import dill
>>> test = dill.load_module('test.pkl')  # load, update and assign in a single statement
>>> test.x
1

Note that, in the first case, if the parameter module was used:

  • with the string "test", it would be checked against the module name in the file session (a kind of sanity check);
  • with the module test, it would be checked, as above, and nothing would be returned (because presumably the user already have it associated with a variable and doesn't need to make an assignment, plus they know the loaded module is the same if there are no errors).

What you did is not much different than doing:

>>> import test
>>> test.x
AttributeError: module 'test' has no attribute 'x'
>>> test.x = 1
>>> del test
>>> import test as _test
>>> _test.x
1

As the module argument to load_module() is not required anymore, I find it nice that, when the returned module is not assigned to a variable, it is printed in the interactive console so that the user can identify which module was imported/restored. But if you think that something in the behavior you exemplified violates the principle of least surprise, it's OK to change the function to only return newly created module-type objects, and not imported modules.

@mmckerns
Copy link
Member Author

mmckerns commented Jul 16, 2022

First, the premise of dill.dump_module() is to "pickle the current state of main or another module to a file", so that the user can completely restore it, even in a different session. Unpickleable objects in a module may or may not contain relevant state to the module's inner working, and it's likely impossible to know it programmatically. Therefore, leaving such objects out of the session file in the interest of being able to pickle any module breaks that promise. So I think it is the correct, albeit maybe counterintuitive is some cases (but not in the main case), to not do this by default and to raise an exception when any unpickleable object is found. This is a purely conceptual reasoning.
...
I just didn't get the last part of your question. What do you mean by pickling a built-in module by reference? The whole module, like dump_module() simply calling pickler.save_global(module)? This would be a no-op...

For modules, generically, dill assumes that they are installed in both the location where dump and load is called. When there's a module dependency for a pickled object, the module is pickled by reference (i.e. an import statement is pickled). This assumes two things: (1) the module is installed (i.e. importable from the user's PATH at both locations), and (2) is not modified. dump_module and load_module give the opportunity to transfer a module when: (1) the module is not installed, and (2) is modified. So, these two approaches to serializing a module compliment each other, with the latter being the more advanced case.

So, there might be some expectations that dump_module(abc) works, as dump(abc) works:

>>> import dill
>>> import abc
>>> dill.dumps(abc)
b'\x80\x03cdill._dill\n_import_module\nq\x00X\x03\x00\x00\x00abcq\x01\x85q\x02Rq\x03.'

Thus, we need to (A) make the conceptual distinction clear (as I did above) in the docs, and (B) think about what should happen when dump_module currently fails on a module that would succeed when called with dump.

I added a "runtime" prefix to the module name argument for _import_module() when pickling a runtime-created module (or module-type object, as you called it in a more intuitive way). It's as plain and simple as that. The relevant bits of logic are scattered though, in save_module(), _import_module() and load_module() (plus _is_imported_module() of course).

Again, didn't get the last bit. What would throw a ValueError beyond the checks in load_module() that explicitly use this line's result?

What I meant is that, imagine you are given a .pkl file knowing nothing else about it. The natural assumption is to try dill.load to open it. If it's named session.pkl, then one might use dill.load_session. This has been the approach thus far. Now we have dill.load_module (replacing dill.load_session) but with two flavors (i.e. an imported or non-imported module). If the module keyword is not used during dill.load_module, then no problem, the module will be loaded regardless of the flavor of module stored. However, if someone wants the module to load into (i.e. update) a module container they just created... how do they know they are creating the right kind of module from inspection of the .pkl file? Are they just expected to try it until something works? Are they expected to just know what flavor of module is stored (maybe by naming convention)? Is there a function that can be called that lets them know what flavor of module is stored? I understand that __session__ and __runtime__ hold the above two pieces of information, and can be used to determine what is going on... but (A) this should be exposed to the user via function call (i.e. is_session(?) and is_importable or is_runtime or similar), and (B) whatever is the best route for the user should be documented in load_module (at the least).

It's correct. For importable/imported modules, returning the module is just a convenience for not having to explicitly import the restored module to be able to access it. What is restored is always the sys.modules['module.name'] value. The name the user want's to associate to it is irrelevant.

So these three sequences of statements are equivalent:
...

Right. So the three sequences you gave should be part of the documentation, so it's explicitly understood what is going on. One might assume that because you are assigning a name to a module that is returned from load_module (_test in my example) that test is not going to be affected, when it's imported. However, it is... because test is actually imported in sys.modules, then altered, then a different name is assigned to it locally. This should be clear from the documentation, especially since/if load_module_asdict has any behavior that is different (other than the target module being loaded as a dict instead of a module).

Do each of the above points make sense?

@leogama
Copy link
Contributor

leogama commented Jul 16, 2022

Yes, pretty much everything you wrote makes sense. I think part of this extended documentation should go into the specific functions' docstrings and part to the package's general documentation. How can we write these with four hands? I could open a PR and you could suggest changes to my additions and write entirely new paragraphs using the "suggested changes" feature in the code review page. (If you don't find the "Add a suggestion" button, take a look here: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/commenting-on-a-pull-request#adding-line-comments-to-a-pull-request)


Edit: just opened the PR (#527)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment