Deepcopy-based "safe" contexts can explode on rich objects #421

Closed
bitprophet opened this Issue Jan 18, 2017 · 7 comments

Projects

None yet

1 participant

@bitprophet
Member
bitprophet commented Jan 18, 2017 edited

Scenario

Pretty simple, just chuck things like compiled regexen (often found on their own, or within rich objects like boto2 clients) or thread locks (often found within database client connections, for example) into your config somewhere, then eagerly await some frustrating TypeErrors to bubble up from Python-core's copy.py.

E.g.:

  File "/Users/jforcier/Code/oss/invoke/invoke/config.py", line 534, in clone
    my_config = copy.deepcopy(self.config)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 261, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 261, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 194, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 338, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 261, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 186, in deepcopy
    rv = reductor(2)
TypeError: can't pickle thread.lock objects

or (slightly different traceback and object type, though the reasoning is nearly identical):

  File "/Users/jforcier/Code/oss/invoke/invoke/config.py", line 534, in clone
    my_config = copy.deepcopy(self.config)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 265, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 265, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 198, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
[... a good number more of this same repetition ...]
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 342, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 265, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 306, in _deepcopy_inst
    state = deepcopy(state, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 265, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/copy.py", line 175, in deepcopy
    y = copier(memo)
TypeError: cannot deepcopy this pattern object

Solutions / discussion

At its heart this is due to how we clone both configs and contexts in a few spots in Invoke's task execution stuff in order to prevent hard-to-debug unintentional state bleed between tasks, subroutines, etc. Cloning uses copy.deepcopy heavily since there's no other good way to copy arbitrary nested objects with unknown contents.

So what could we do to prevent/minimize these issues?

  • Straight up stop deepcopying (which would make #309 happy).
    • I still believe arbitrary mutation will lead to nasty hard-to-track bugs...
    • We legitimately need at least some cloning for things like parameterized tasks (think fab -H h1,h2,h3)
      • Though IIRC it's feasible to expect we could limit the cloning in that case to be far more specific; I forget the details.
    • But it would certainly fix this issue.
  • Attempt to deepcopy, and fallback to straight up regular copying or no copying, if we encounter TypeError during the deepcopy.
    • This feels real bad because it'd be silently inconsistent (even if it logs, it'll still confuse the hell out of people)
  • Attempt to write our own, "better" deepcopy (probably by extending config.merge_dicts) that crawls the entire tree and does a "regular" copy of leaf values instead of a wholesale deepcopy, and refuses to copy anything it doesn't understand.
    • Feels like reinventing deepcopy - we'd need to do everything it does (including attempting to call __deepcopy__ and such) with the sole difference being an attempt at "granular" TypeError exception handling.
    • But it would let us basically say "anything that isn't throwing a TypeError on deepcopy is getting copied, the rest are up to you", which is at least better than "there was a TypeError somewhere so you're now not getting any copying whatsoever"
      • Just...not actually a whole lot better, as it's still inconsistent.
  • Allow users to opt-out of the deepcopying/cloning when they encounter this issue & are willing to take on the burden of worrying about state bleed between tasks
    • Would also largely solve #309
    • I worry it would mean those corner cases where we really need cloning would then always break if this was turned on; or conversely that if we made them exempt, they'd still be enough to trigger the core issue here
  • Allow users to selectively opt-out of deepcopying specific keys (i.e. a blacklist of sorts)
    • Sidesteps the all-or-nothing nature of the other solutions
    • EDIT: this does require the "reimplement deepcopy, kinda" approach from above; clone() is currently implemented as "deepcopy all the cached source configs as well as the core config", it's merge() that uses merge_dicts. So it's still a moderate amount of work.
    • Less elegant / looks like an API/functionality wart
      • But it's not like there is an elegant solution to this besides "just never deepcopy, have fun debugging why nested task invocations are causing changes in behavior higher up the stack"
  • Ye olde "educate users" approach, instruct them to find other ways of passing around non-deepcopyable objects, or to (ugh) monkeypatch the objects in question so they either deepcopy happily or skip deepcopying explicitly.
    • Not great, obviously
    • But much less work, and less extra code on our end to develop its own bugs
@bitprophet
Member

Looking deeper at the fab -H use case - which is actually the only spot where I see us needing to do any sort of cloning for any operation to not be buggy - and:

  • It's contained within Fabric 2 itself, which subclasses Executor, which may affect how much I need to do right away for this issue (re: making it trivial to turn off cloning by default)
  • The object that truly needs cloning is the invoke.tasks.Call (because it's being parameterized, unlike the average Invoke task call; though this is still a legit Invoke-level use case eventually, see #228)
  • That cloning must happen, but this is only because it must be a distinct Python object for purposes of modifying its Context (which is a Connection, a subclass of Context).
  • At no point is the parameterization affecting the inner Config object itself; it's just host (etc) state, which is part of Connection itself, and not its .config
    • I have far too many objects in my system named with a capital C...
  • Conceptually, this makes sense; even if you're doing "things" to various and sundry remote hosts, the actual background/context/config for them - at least at the start of your task - won't need to be any different.
  • However, if we fail to clone those configs, the moment users modify them, that will bleed across the parameterization, unless the parameterization is updated to use threading (& threadlocals), multiprocessing, or similar.
    • But that's the entire point of this particular half-solution: let users who are being careful (or not using parameterization at all) say "yea please take the safeties off, they're a pain". Big shrug.

So I think I can simply (famous last words) update Config.clone so it checks a config setting (since, why not) determining whether it actually clones or just short-circuits and returns itself?

@bitprophet
Member

Grump grump grump. Problem: one other use of clone is to 'upgrade' config objects to a Config subclass (eg to grant them additional global defaults, but possibly arbitrary other methods/attributes as well).

So...if we make clone just a no-op, that use case has nowhere to go. I guess it's acceptable to still do the clone but skip all the deepcopying? (Maybe that's how it should work anyways; if we're neutering deepcopying, it doesn't matter a ton how it happens, stuff's gonna be dirty regardless of object identity.)

@bitprophet
Member
bitprophet commented Jan 18, 2017 edited

A side issue here that I touched on above is also that these objects which can't be deepcopied will also be resistant to most forms of concurrency that might underlay parameterization: they won't like being pickled into any sort of threading or multiprocessing Queue class, for example.

So even if this is solved temporarily it might still bite people in the Fabric 2 widely-parameterized use case of "do thing on many hosts (while, necessarily or no, having e.g. a cloud or db client in your config state)".

Makes me wonder: can we make it easy to defer creation of such objects until after any copying/queueing/etc has been performed?

  • Right now I'm hitting this because my code reads auth creds from config files at startup (basically, at module load time; it's in a task decorator) and uses them to instantiate some boto and db clients necessary for most/all tasks in the local tree.
  • But actually - this should mean that they're being added after the main/core/Invoke-level deepcopying is done. So the only real deepcopy in play for me is that of the Config.clone(into=xxx) being called at the Fabric level (the 'upgrade').
  • That said, it's absolutely within scope for users to do identical things in, say, their task-collection-level config. However...currently that also happens after the last clone call.
  • However, there is one spot that happens pre-clone that is still reasonable, if not super common - creating rich objects in .py-format config files. These currently load at Config instantiation time.
    • But I've actually got a branch that moves the loading much closer to task exec time, to give lib users more control over IO (touching the fs at init time always makes me feel dirty).
    • While that has its own issues (loading those conf files multiple times - once per task if not once per parameterization of a task - which is non-ideal if anyone expects individual tasks to run super-fast) it would also mean that practically all user-driven config values - and certainly all the ones that are liable to be rich and un-deepcopyable - become loaded up post-clone.
  • If I go ahead with that, it means we could dial clone way back (or even remove it in favor of a minor overhaul of __init__?) so the only data deepcopied are the defaults, or perhaps there is no deepcopying after all.
    • Though this is all still about the path from startup to single task exec; multi-task exec and the spooky-action-at-a-distance problem still exist, so we may want to keep regular clone around. Harrumph. Boy I wish deepcopy was less of a pitfall.
@bitprophet
Member

Back to the side issue of "using regular Invoke tasks/contexts, then wanting to use Fabric 2 Connection objects inside those tasks":

  • I'm doing it in a decorator, which is effectively the same as doing it inside the task body
  • This is 'late' enough that the config I was given doesn't actually really need cloning for protection purposes, especially since Fabric itself is unlikely to modify the config (all Fabric-level things that aren't reading config settings, are modifying attributes of the Connection object itself, not its Config.)
  • So...a non-deepcopying clone(xxx, into=fabric.Config) should still be okay here. I think.
@bitprophet
Member

Another random thought as I poke actually doing this: we already almost have a "reimplemented deepcopy" in merge_dicts, and it's already being used for a semi-deepcopy-esque case (that of preventing config subtrees from being references and thus allowing later config sources to pollute earlier ones, as in #420).

I could interpret this two ways:

  • If we're already eating some of the cost of a redone deepcopy, let's just keep expanding that so it looks into iterables and such too, for example, and then no more deepcopy.
  • Isn't most of the worry about state bleed exactly situations like #420, and if all actual nested-dict values (which are by definition sub-configs) are being correctly recreated w/ distinct identity under merge_dicts, do we even really have to worry? The only objects which could still preserve references would be iterables and the like, but that's entirely up to the user anyways / is a common enough footgun on its own.

The latter is what I like, since it lets me try removing deepcopy (and replacing it with Even More merge_dicts) while expecting the fallout to not be super great. Further, the other state bleed problems - such as a subroutine tweaking ctx.run.hide and affecting calls past its own - aren't being solved by our deepcopying anyways.

@bitprophet
Member

Have pared back most of the deepcopy use; there's still some in other bits I haven't touched in ages (mostly around parser and task/call related objects) that shouldn't matter nearly as much as they won't contain as much user-generated data.

I'm now back to dealing with my real world code instead of constantly bouncing off random unpicklable objects inside boto and db drivers

@bitprophet bitprophet closed this Jan 18, 2017
@bitprophet
Member

Of note, this does not (AFAICT) solve #309 because while we're not calling deepcopy, we're still giving configs and their subdicts distinct identities, meaning task A saying ctx.foo.bar = 17 will not affect the value of ctx.foo.bar in task B.

@bitprophet bitprophet added a commit that referenced this issue Jan 18, 2017
@bitprophet bitprophet Changelog re #421 aaa8d0b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment