Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asyncio: Use strong references for free-flying tasks #91887

Open
alexhartl opened this issue Apr 24, 2022 · 52 comments
Open

asyncio: Use strong references for free-flying tasks #91887

alexhartl opened this issue Apr 24, 2022 · 52 comments
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@alexhartl
Copy link

alexhartl commented Apr 24, 2022

In #88831 @vincentbernat pointed out that CPython only keeps weak references in _all_tasks, so a reference to a Task returned by loop.create_task has to be kept to be sure the task will not be killed with a "Task was destroyed but it is pending!" at some random point in time.

When shielding a task from cancellation with await shield(something()), something continues to run when the containing coroutine is cancelled. As soon as that happens, something() is free-flying, i.e. there's no reference from user code anymore. shield itself has a bunch of circular strong references, but these shouldn't keep CPython from garbage-collecting the task. Hence, here the same problem occurs and the task might be killed unpredictably. Additionally, when running coroutines in parallel with gather and return_exceptions=False, an exception in one of the coroutines will leave remaining tasks free-flying. Also in this case, the remaining tasks might be killed unpredictably.

Hence, a warning in the documentation for create_task unfortunately does not suffice to solve the problem. Additionally, it has been brought up in #88831 that an API for fire-and-forget tasks (i.e. when the user doesn't want to keep a reference) would be nice.

As solution, I suggest to either

(1) introduce a further _pending_tasks set to keep strong references to all pending tasks. This would be the simplest solution also with respect to the API. In fact, a lot of dicussions on Stack Overflow (e.g., here, here, here) already rely on this behavior (throwing away the reference returned by create_task), although it's wrong currently. Since the behaviour for free-flying tasks is unpredictable currently, it should not introduce any compatibility issues when making it predictable by preventing them from being garbage-collected.

(2) make sure there's always a chain of strong references from the most basic futures to the running tasks awaiting something. A quick grep resulted in potential problems, e.g., here and here. This does not seem like a very robust approach, though.

(3) introduce the concept of background tasks, i.e., tasks the user does not want to hold references to. The interface could look like suggested in #88831 (comment) . Tasks from shield and gather could be automatically converted to such background tasks. Clearly, it would add complexity to the API, but the distinction between normal tasks and background tasks might potentially be beneficial also for other purposes. E.g., one might add an API call that waits for all background tasks to be completed.

My preferred solution would be (1).

Linked PRs

@kumaraditya303
Copy link
Contributor

You can use the new asyncio.TaskGroup to avoid this in 3.11+

@bast0006
Copy link

It looks to me like it would be a trivial code change (keeping all active tasks in a separate, non-weak set) to gain a large benefit here.

This behavior is inherited from Futures (where it makes sense). If a future is not being kept track of, it's a bug (the developer forgot to await). However, tasks start executing, if not immediately, on their own, so not awaiting a task is not as obviously a mistake.

Interrupting actively running code because someone didn't maintain a reference (on the gc's terms and timing) is a very different beast than not executing code that the desire and timing of which is unclear (and an exception can be somewhat promptly triggered for).

I'll have time this evening to open a PR, if nobody else wants to get to it.

@bast0006
Copy link

bast0006 commented Feb 11, 2023

For impact perspective, discord.py drops the task after calling create_task on it.

https://github.com/Rapptz/discord.py/blob/742630f1441d4b0b12a5fd9a751ab5cd1b39a5c6/discord/client.py#L499

This means (effectively) every python based discord bot is affected by this, and every event dispatch appears to be technically at the whims of the GC.

@gvanrossum
Copy link
Member

Rather than rhetoric about impact or how trivial the fix would be, we need a discussion on why we designed tasks this way in the first place. It doesn't look to me like it was inherited from Futures -- the variable is called _all_tasks, not _all_futures. Without understanding this, we risk making things worse with a hasty "fix".

That said, I'm not sure why we designed it this way, and (from private email) @1st1 doesn't seem to recall either. But it was definitely designed with some purpose in mind -- the code is quite complex and was updated every time an issue with it was discovered, and we've had many opportunities to change this behavior but instead chose to update the docs (gh-29163), even when asyncio itself suffered (gh-90467).

I also don't understand why the dict of all tasks is a global, since the only API that uses this (all_tasks()) filters out tasks belonging to a different event loop. Again, I assume there's a reason for the complexity, but I don't know what it is. Is it 3rd party event loops like uvloop? Or frameworks like aiohttp?

FWIW if we simply make _all_tasks a plain dict, we would have to arrange for tasks to remove themselves from the list when they exit, possibly in a done-callback (though there's a cost to that). There's already an API to do that (_unregister_task) but it doesn't seem to be called.

@alexhartl
Copy link
Author

Agree that this should be approved by someone who is familiar with the code's design objectives, which is why I brought up two alternatives and did not create a PR right away.
I would not turn _all_tasks into a plain dict (or set), since this might break code that assumes that finished tasks (that have a user reference somewhere) are still returned by all_tasks(). The suggested change is to have an additional _pending_tasks set. Yes, we would need a done-callback, which would introduce a slight performance penalty.
In any case, imho the current behavior is somewhat annoying or misses some fundamental functionality.

@bast0006
Copy link

bast0006 commented Feb 13, 2023

Sorry, I've been doing more reading into what the current state is and other related issues since I wrote that comment and it is absolutely a lot more detailed behind the scenes (like this comment here I only got to today: #80788 (comment) ) which mirrors some of what you've sent here.

I was working with the idea that Tasks were inherited from Futures because that was how they were introduced in the original PEP-3156.

I think I have found the origin of why it was made a weakset in the first place: https://groups.google.com/g/python-tulip/c/13hfgbKrIyY/m/HpwWPGHKT6IJ

Specifically, this patch:
https://codereview.appspot.com/14295043/diff/1/tulip/tasks.py

So it was originally intended to be a registry to help recover stuck tasks and get stack frames for?

@gvanrossum
Copy link
Member

(Sorry, I hit some wrong buttons.)

So it was originally intended to be a registry to help recover stuck tasks and get stack frames for?

But apparently at that point it was already the case that tasks had to be kept alive by their owner -- ISTM a weakset was used specifically to avoid keeping tasks alive longer than necessary.

So this has always been part of the design, it just wasn't made explicit in the docs. I'm still curious why we originally designed it that way. It's possible that we never consciously realized this constraint. It's also possible that, since we did make Task a subclass of Future, we assumed that tasks would always be awaited.

I am still not convinced, despite this being a common stumbling point, that we can fix this without consequences for use code.

@bast0006
Copy link

bast0006 commented Feb 17, 2023

#65362 made this make a little more sense to me.

Tasks can't be dropped if they're actively executing or scheduled to execute (_ready) because there's strong references held by the event loop, and there's a comment that notes a bit about this at the start of the Task class, although I'm not sure if that's intended to apply to execution only or for reference management.

Tasks only appear to be GCable if they're blocking on another future which ends up being GCable itself. If the blocking future is _ready/executing, then it, also, is kept alive through the above invariant, and there's no chance of losing the dependent task.

It makes sense to error and garbage collect if the chain is broken, because we've got a "proven" memory leak (the task cannot be woken again if it's dependent future is unreachable).

@alexhartl
Copy link
Author

Yes, it is possible to fix this by making sure there's always a chain of strong references from the most basic futures to the running tasks awaiting something. But I have a feeling that this would be a somewhat fragile approach, both considering future changes in the stdlib and futures that a user creates. A year ago I greped asyncio's source and found two spots where weak references were used that might cause problems. Don't know if those have been fixed/modified meanwhile.

Haven't thought of it like causing a provable memory leak, though. Good point.

@alexhartl
Copy link
Author

So how about adding a keyword argument to create_task that allows adding the task to a _pending_tasks set? We could add a note in the documentation warning about potential memory leaks.

@bdarnell
Copy link
Contributor

bdarnell commented May 4, 2023

So it was originally intended to be a registry to help recover stuck tasks and get stack frames for?
But apparently at that point it was already the case that tasks had to be kept alive by their owner -- ISTM a weakset was used specifically to avoid keeping tasks alive longer than necessary.

It was already the case that tasks had to be kept alive by their owner, but I'm not sure this was understood to be the case. If you look back at the diff that @bast0006 found, there were additional methods all_done_tasks and all_failed_tasks that were present in Tulip but didn't make the transition to asyncio. One reason this may have been done was to diagnose reference cycle complexities with the coroutine runner (I saw a lot of this in Tornado's coroutines). If that was the intention, a weak set would have been necessary to make the "done" and "failed" tasks iterable while they're waiting for GC without keeping them alive unnecessarily.

I also don't understand why the dict of all tasks is a global, since the only API that uses this (all_tasks()) filters out tasks belonging to a different event loop. Again, I assume there's a reason for the complexity, but I don't know what it is. Is it 3rd party event loops like uvloop? Or frameworks like aiohttp?

It could just be a historical accident. IIRC, early on the event loop didn't have any knowledge of tasks, so all_tasks was tracked in class attributes of Task instead of introducing such a link.

I am still not convinced, despite this being a common stumbling point, that we can fix this without consequences for use code.

I appreciate your caution here, but I'll point out that the docs for create_task encourage users to create strong references from global variables to all of their Tasks, so if there are risks to doing this in create_task itself, the docs should reflect the same considerations we're discussing here. (That's how I came to this issue. I was thinking about making a global background_task set as documented, but realized that create_task could be doing that for me but it's not, and I don't understand why not)

Arguments for having create_task manage a set of strong references to all active tasks:

  • Tasks are analogous to threads, and running threads are GC roots
  • The vast majority of tasks are strongly-referenced by the event loop anyway, when they are awaiting network activity, timers, etc.
  • We're already paying the price to maintain a weak set, so the additional cost to make it a strong set (and cleaning up completed tasks) should be minimal. (I'd be loathe to introduce a strong set here if we didn't already have the weak one as precedent)

Arguments against:

  • If someone is somehow running a lot of tasks that become eligible for GC but never complete, they'd have a memory leak
  • The cost may be minimal, but it's not free
  • Not all coroutine implementations have this problem (to the best of my knowledge, this was never a problem for generator-based coroutines in Tornado until we integrated with asyncio for native coroutines. But we also never had logging in destructors so it would be much harder to tell if it were a problem). Are there other solutions we could consider copying?

The worst-case scenario that I can see if we make this change would be in a program that calls asyncio.run in a loop, and each time it leaves some tasks running (maybe waiting on a sleep) without cancelling them (or they catch the asyncio.CancelledError that is thrown into them when the loop shuts down). That's a case where the weak set really would save us from a memory leak. But the risk goes away if the task set were moved from a global to an attribute of the event loop, since the loop and all pending tasks could go away at once. That feels like a fairly safe solution to me, except for the fact that it's now two changes instead of one, to code that is already poorly understood :)

@itamaro
Copy link
Contributor

itamaro commented May 5, 2023

I would be interested in seeing "all_tasks" going from a state-global weakset of "alive tasks" to a per-loop strong-set of pending tasks.

interestingly (maybe), my motivation is not enabling free-flying tasks - it is performance!
adding tasks to the all_tasks weakset is pretty slow (because WeakSet is pretty slow) - using a regular set instead makes it faster. (10-20% faster across the async benchmarks based on recent experimentation by @jbower-fb and reproduced by me)

some observations from trying to dig through the evolution of all_tasks from tulip to today:

  • it seems that there was an intentional effort over the years to keep the task and the loop decoupled. maybe this helps third party loop providers and task providers? moving the registry to the loop would introduce such coupling (maybe that's ok - I don't really know).
  • the register_task and unregister_task APIs are relatively recent additions, and asyncio itself doesn't really use unregister_task to keep the registry tidy - it relies on the weakness of the set.
    • it should be straight forward to add an explicit unregister_task in asyncio, and probably a reasonable expectation from third-party loop/task providers to use these APIs properly.
    • the extra cost of calling unregister_task would be more than offset by the gain from a set being much more efficient than a WeakSet.
    • I'm not sure though it would be easy (or possible) to solve the problem of "stuck tasks" being kept alive forever by the strong reference. I think the test_log_destroyed_pending_task test demonstrates this. maybe this is also ok?
  • the all_tasks() API itself (in CPython asyncio, not tulip) never really returned all tasks (only briefly when introduced, but it was changed to pending tasks before it was released, in 3.7 I think).

so... considering there's some convoluted history behind all_tasks in general, and it being a WeakSet in particular, and changing this would make some users happy (free-flying tasks! perf!) and others unhappy (backwards incompatible change!) -- is there a realistic path to changing this?

@gvanrossum
Copy link
Member

so... considering there's some convoluted history behind all_tasks in general, and it being a WeakSet in particular, and changing this would make some users happy (free-flying tasks! perf!) and others unhappy (backwards incompatible change!) -- is there a realistic path to changing this?

Give it a try, but aim for 3.13. (Just sitting on the PR for 2.5 weeks should do it.)

@jbower-fb
Copy link
Contributor

jbower-fb commented May 6, 2023

I'm not sure though it would be easy (or possible) to solve the problem of "stuck tasks" being kept alive forever by the strong reference. I think the test_log_destroyed_pending_task test demonstrates this. maybe this is also ok?

The example in that test can be reduced to:

async def c():
  fut = asyncio.create_future()
  await fut

With a strong-set this will be kept alive, but with a weak-set it will be GC'd.

I find this pretty unintuitive but it happens because there are no references to the Task/coroutine in the event-loop. What is supposed to happen is someone will call set_result() or set_exception() on the Future. This in-tern would cause the Future to schedule a callback onto the event loop to re-schedules the Task. This will of course never happen as the coroutine is blocked and is the only thing with a reference to the Future. With all_tasks as a weak-set this free-standing reference cycle is GC'able.

In my mind, anyone writing code like this deserves a memory leak. They might even want a leak as this might be one way of spotting a programming error here.

@kumaraditya303
Copy link
Contributor

@itamaro Your comment seems similar to what I have planned in #80788 (comment) to do in 3.13.

@bdarnell
Copy link
Contributor

bdarnell commented May 7, 2023

it seems that there was an intentional effort over the years to keep the task and the loop decoupled. maybe this helps third party loop providers and task providers? moving the registry to the loop would introduce such coupling (maybe that's ok - I don't really know).

One way to avoid expanding the AbstractEventLoop interface would be to make all_tasks become a WeakKeyDictionary[EventLoop, Set[Task]] so that everything still becomes eligible for GC when the event loop is discarded, but the event loop does not need to explicitly know what's going on. But I think it would be cleaner to make it a responsibility of the event loop if we can tolerate the backwards incompatibility (I suspect uvloop might actually prefer it this way - I'm sure they'd rather avoid an indirection through a WeakKeyDictionary)

This will of course never happen as the coroutine is blocked and is the only thing with a reference to the Future.

The Future will never resolve, but there is still one way the Task can be rescheduled: if someone cancels the task via the reference in all_tasks (which is done automatically by asyncio.run()). This was suggested in tornadoweb/tornado#3173 as a kind of hacky "shutdown hook" to allow for some cleanup operations. One of the surprising quirks is that it works with an arbitrarily long await asyncio.sleep (because the event loop's timer queue keeps a strong reference) but it doesn't work to await a plain Future that never resolves (or equivalently an asyncio.Event, etc).

In my mind, anyone writing code like this deserves a memory leak. They might even want a leak as this might be one way of spotting a programming error here.

I've made the same argument in the other direction: you might want the "task was destroyed but it is pending" log message because it's a more legible indication of a problem than a memory leak (which needs to either be large or occur frequently to be noticeable). But I think that overall, if someone writes a task that waits forever, they should get a task that lives forever (with the memory consumption that implies).

@jbower-fb
Copy link
Contributor

you might want the "task was destroyed but it is pending" log message because it's a more legible indication of a problem than a memory leak

Ah! but this is less likely to happen than you might think. Consider this runnable example:

import asyncio

async def x():
  await asyncio.get_event_loop().create_future()

async def y():
  asyncio.create_task(x())

asyncio.run(y())

This actually completes silently because the manually created task ends up being cancelled.

@jbower-fb
Copy link
Contributor

One way to avoid expanding the AbstractEventLoop interface would be to make all_tasks become a WeakKeyDictionary[EventLoop, Set[Task]]

But this will have the downside @itamaro and I want to avoid: WeakKeyDictionary (implemented in Python) is likely slower than native strong-sets.

The Future will never resolve, but there is still one way the Task can be rescheduled: if someone cancels the task via the reference in all_tasks (which is done automatically by asyncio.run()). This was suggested in tornadoweb/tornado#3173 as a kind of hacky "shutdown hook" to allow for some cleanup operations. One of the surprising quirks is that it works with an arbitrarily long await asyncio.sleep (because the event loop's timer queue keeps a strong reference) but it doesn't work to await a plain Future that never resolves (or equivalently an asyncio.Event, etc).

Interesting, I hadn't thought of cancelation to wake-up the task. Although overall this sounds like an even stronger argument for a strong-set as it will at least make things consistent.

itamaro Your comment seems similar to what I have planned in #80788 (comment) to do in 3.13.

@kumaraditya303 what you describe there sounds great if you're okay with using a strong-set. Is that the case?

bdarnell added a commit to bdarnell/tornado that referenced this issue May 15, 2023
Per the warning in the asyncio documentation, we need to hold a strong
reference to all asyncio Tasks to prevent premature GC. Following
discussions in cpython (python/cpython#91887),
we hold these references on the IOLoop instance to ensure that they are
strongly held but do not cause leaks if the event loop itself is
discarded.

This is expected to fix all of the various "task was destroyed but
it is pending" warnings that have been reported. The
IOLoop._pending_tasks set is expected to become obsolete if
corresponding changes are made to asyncio in Python 3.13.

Fixes tornadoweb#3209
Fixes tornadoweb#3047
Fixes tornadoweb#2763

Some issues involve this warning as their most visible symptom,
but have an underlying cause that should still be addressed.
Updates tornadoweb#2914
Updates tornadoweb#2356
@bdarnell
Copy link
Contributor

FYI I've implemented a Tornado-specific version of the change proposed here (a strong set of references to pending tasks as an attribute of the event loop, cleared via Future.add_done_callback) in tornadoweb/tornado#3269. Assuming some version of this idea makes it into a future version of python, I'll revert tornado's version.

SomberNight added a commit to SomberNight/electrum that referenced this issue Jul 17, 2023
This clears up log spam for regtest tests.

related:
- https://bugs.python.org/issue44665
- python/cpython#88831
- https://textual.textualize.io/blog/2023/02/11/the-heisenbug-lurking-in-your-async-code/
- python/cpython#91887 (comment)
- "Task was destroyed but it is pending!"

Perhaps we should inspect all our usages of
- asyncio.create_task
- loop.create_task
- asyncio.ensure_future
- asyncio.run_coroutine_threadsafe
?

Example log for running a regtest test:
```
$ python3 -m unittest electrum.tests.regtest.TestLightningAB.test_collaborative_close
***** test_collaborative_close ******
initializing alice
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/home/user/wspace/electrum/electrum/logging.py", line 44, in format
    record = copy.copy(record)  # avoid mutating arg
  File "/usr/lib/python3.10/copy.py", line 92, in copy
    rv = reductor(4)
ImportError: sys.meta_path is None, Python is likely shutting down
Call stack:
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1781, in call_exception_handler
    self._exception_handler(self, context)
  File "/home/user/wspace/electrum/electrum/util.py", line 1535, in on_exception
    loop.default_exception_handler(context)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1744, in default_exception_handler
    logger.error('\n'.join(log_lines), exc_info=exc_info)
Message: "Task was destroyed but it is pending!\ntask: <Task pending name='Task-2' coro=<Abstract_Wallet.on_event_adb_set_up_to_date() running at /home/user/wspace/electrum/electrum/wallet.py:485> wait_for=<Future finished result=0> cb=[_chain_future.<locals>._call_set_state() at /usr/lib/python3.10/asyncio/futures.py:392]>"
Arguments: ()

[--- SNIP --- more of the same --- SNIP ---]

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/home/user/wspace/electrum/electrum/logging.py", line 44, in format
    record = copy.copy(record)  # avoid mutating arg
  File "/usr/lib/python3.10/copy.py", line 92, in copy
    rv = reductor(4)
ImportError: sys.meta_path is None, Python is likely shutting down
Call stack:
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1781, in call_exception_handler
    self._exception_handler(self, context)
  File "/home/user/wspace/electrum/electrum/util.py", line 1535, in on_exception
    loop.default_exception_handler(context)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1744, in default_exception_handler
    logger.error('\n'.join(log_lines), exc_info=exc_info)
Message: "Task was destroyed but it is pending!\ntask: <Task pending name='Task-31' coro=<Abstract_Wallet.on_event_adb_set_up_to_date() running at /home/user/wspace/electrum/electrum/wallet.py:485> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.10/asyncio/futures.py:385, Task.task_wakeup()]> cb=[_chain_future.<locals>._call_set_state() at /usr/lib/python3.10/asyncio/futures.py:392]>"
Arguments: ()
true
true
true
true
funding alice
```
SomberNight added a commit to spesmilo/electrum that referenced this issue Jul 17, 2023
This clears up log spam for regtest tests.

related:
- https://bugs.python.org/issue44665
- python/cpython#88831
- https://textual.textualize.io/blog/2023/02/11/the-heisenbug-lurking-in-your-async-code/
- python/cpython#91887 (comment)
- "Task was destroyed but it is pending!"

Perhaps we should inspect all our usages of
- asyncio.create_task
- loop.create_task
- asyncio.ensure_future
- asyncio.run_coroutine_threadsafe
?

Example log for running a regtest test:
```
$ python3 -m unittest electrum.tests.regtest.TestLightningAB.test_collaborative_close
***** test_collaborative_close ******
initializing alice
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/home/user/wspace/electrum/electrum/logging.py", line 44, in format
    record = copy.copy(record)  # avoid mutating arg
  File "/usr/lib/python3.10/copy.py", line 92, in copy
    rv = reductor(4)
ImportError: sys.meta_path is None, Python is likely shutting down
Call stack:
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1781, in call_exception_handler
    self._exception_handler(self, context)
  File "/home/user/wspace/electrum/electrum/util.py", line 1535, in on_exception
    loop.default_exception_handler(context)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1744, in default_exception_handler
    logger.error('\n'.join(log_lines), exc_info=exc_info)
Message: "Task was destroyed but it is pending!\ntask: <Task pending name='Task-2' coro=<Abstract_Wallet.on_event_adb_set_up_to_date() running at /home/user/wspace/electrum/electrum/wallet.py:485> wait_for=<Future finished result=0> cb=[_chain_future.<locals>._call_set_state() at /usr/lib/python3.10/asyncio/futures.py:392]>"
Arguments: ()

[--- SNIP --- more of the same --- SNIP ---]

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/home/user/wspace/electrum/electrum/logging.py", line 44, in format
    record = copy.copy(record)  # avoid mutating arg
  File "/usr/lib/python3.10/copy.py", line 92, in copy
    rv = reductor(4)
ImportError: sys.meta_path is None, Python is likely shutting down
Call stack:
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1781, in call_exception_handler
    self._exception_handler(self, context)
  File "/home/user/wspace/electrum/electrum/util.py", line 1535, in on_exception
    loop.default_exception_handler(context)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1744, in default_exception_handler
    logger.error('\n'.join(log_lines), exc_info=exc_info)
Message: "Task was destroyed but it is pending!\ntask: <Task pending name='Task-31' coro=<Abstract_Wallet.on_event_adb_set_up_to_date() running at /home/user/wspace/electrum/electrum/wallet.py:485> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.10/asyncio/futures.py:385, Task.task_wakeup()]> cb=[_chain_future.<locals>._call_set_state() at /usr/lib/python3.10/asyncio/futures.py:392]>"
Arguments: ()
true
true
true
true
funding alice
```
@CendioOssman
Copy link
Contributor

Okay, I found the issue with gh-90467. It was caused by the fix in gh-78819. The weak link between StreamReaderProtocol and StreamReader breaks the chain between the event loop and the coroutine waiting for the StreamReader object.

The chain, as far as I understand it:

BaseSelectorEventLoop => EPollSelector => SelectorKey => Handle => _SelectorSocketTransport => StreamReaderProtocol =×=> StreamReader => Future => Task

If I convert that weak reference to a strong one, then the bug goes away.

@bdarnell
Copy link
Contributor

bdarnell commented Nov 2, 2023

Yes, that's a familiar pattern from Tornado too - see a circular reference, try to break it by making one direction a weak reference, but it turns out that reference was the one holding the whole structure together. This is one of the consequences of my comment about the direction of the object graph being the opposite of what you expect - it's easy to put a weak reference in the wrong place.

It's possible that this is the whole problem - if we never used weak references and just allowed the strong reference cycles to exist, maybe there wouldn't be any way for "free-flying tasks" to get prematurely garbage collected. But then we'd have a different problem, of the memory and CPU cost of relying more heavily on the full GC to free up these object cycles.

The global set of all pending tasks avoids this cycle GC problem by clearing the map entry when the Future resolves. Maybe there's a way to use callbacks on the future to break cycles manually instead of using weak references? But that's difficult to apply systematically; the nice thing about the global set is that even though it's inelegant, it does solve the problem for all tasks without having to fix all the individual instances of the problem.

@alexhartl
Copy link
Author

Okay, I found the issue with gh-90467. It was caused by the fix in gh-78819. The weak link between StreamReaderProtocol and StreamReader breaks the chain between the event loop and the coroutine waiting for the StreamReader object.

The chain, as far as I understand it:

BaseSelectorEventLoop => EPollSelector => SelectorKey => Handle => _SelectorSocketTransport => StreamReaderProtocol =×=> StreamReader => Future => Task

If I convert that weak reference to a strong one, then the bug goes away.

This was also the issue that I encountered afair. In my top post I mentioned another weak reference that might cause issues (not tested though). It might be possible to avoid such bugs by making these references strong ones. But then this behaviour has to be documented somewhere to make future developers aware of the problem. Also, similar to _all_tasks, someone had reasons why he chose to use weak references there (as @CendioOssman found).

All in all, I would prefer solving the problem on the task level (i.e. introducing a _pending_tasks set or keeping strong references in _all_tasks), as this seems to be a much more robust solution and less convoluted modification than changing those weak references.

@CendioOssman
Copy link
Contributor

I can understand the concern, and I'm growing more open to the conclusion that some extra references to tasks is the lesser evil.

I can be a bit concerned that this is a broader problem than just tasks, though, as this fundamentally has to do with how the event loop works rather than specifically tasks? Perhaps weak references should be viewed as something very dangerous and the barrier for accepting usage of them should be much higher?

What I'd like to avoid is also throwing out the good effects the current system has. Specifically, that the test test_tasks.py:BaseTaskTests.test_log_destroyed_pending_task() continues to work.

@CendioOssman
Copy link
Contributor

The discussion here so far has been very long term, future proofing. However, this test case is currently broken:

import asyncio
import gc

async def background():
    remote_read, remote_write = await asyncio.open_connection("example.com", 443, ssl=False)
    await remote_read.read()

async def cleanup():
    while True:
        gc.collect()
        await asyncio.sleep(3)

async def main():
    asyncio.create_task(background())
    await cleanup()

asyncio.run(main())

Should I open a new issue for that, or do we deal with it here?

@gvanrossum
Copy link
Member

@CendioOssman If you have a solution in mind, you can submit a PR and link it to this issue in the PR title.

@CendioOssman
Copy link
Contributor

I have an idea at least. I'll open a PR once I have something that seems to work.

However, after looking more at this, the active task list unfortunately doesn't completely solve this. So I'm back towards preferring some ban on weak references. You can hit this issue even without coroutines:

import asyncio
import functools
import gc
import weakref

class Proto:
    def __init__(self, reader):
        self.reader_wr = weakref.ref(reader)
        asyncio.get_event_loop().call_later(20, self._wake_reader)

    def _wake_reader(self):
        self.reader_wr().wake()

class Reader:
    def __init__(self):
        self.proto = None

    def set_proto(self, proto):
        self.proto = proto

    def wait(self):
        self.waiter = asyncio.get_event_loop().create_future()
        return self.waiter

    def wake(self):
        self.waiter.set_result(None)

class LogFunc:
    def __init__(self):
        self._done = False
    def __del__(self):
        if not self._done:
            context = {
                'message': 'Callback was destroyed but it is pending!',
            }
            asyncio.get_event_loop().call_exception_handler(context)
    def __call__(self, *args, **kwargs):
        self._done = True
        print("Foo", args, kwargs)

def start():
    reader = Reader()
    proto = Proto(reader)
    reader.set_proto(proto)
    fut = reader.wait()
    fut.add_done_callback(functools.partial(LogFunc(), reader, proto))

async def cleanup():
    while True:
        gc.collect()
        await asyncio.sleep(3)

async def main():
    start()
    await cleanup()

asyncio.run(main())

The above example constructs the same weak reference structure as the streams API, but there are no tasks involved that could be used to anchor things.

@gvanrossum
Copy link
Member

That example just shows what weakrefs do: they don't keep an object alive. That's a feature of weakrefs. If you want the object to stay alive, don't use a weakref. So what's the point of your example?

@CendioOssman
Copy link
Contributor

Indeed. But the example uses the same design as StreamReader and StreamReaderProtocol currently uses (from gh-78819).

So this is the type of code we are trying to make "safe" by providing a strong reference elsewhere. The point of the example was to show that the suggested fix of having a list of tasks will be insufficient in that goal. So something more would be needed.

@gvanrossum
Copy link
Member

@CendioOssman I'm afraid I've lost track of what you're arguing. Likely you are in violent agreement with the other participants here. You speak of a ban on weak refs. Do you mean everywhere, or everywhere in asyncio, or in a specific place in asyncio? Or do you mean this as a general recommendation to asyncio users? And you claim that a list of active tasks isn't sufficient. I'd think that would be sufficient for preventing active tasks from being lost -- are there other things in asyncio that aren't kept alive? (Handles or callbacks?)

@bdarnell
Copy link
Contributor

@CendioOssman Thank you for the example in #91887 (comment), this is the simple non-Tornado example we're looking for. No weakrefs here (unless there are some in asyncio itself). I'm a little confused by your most recent example, but the first one works to illustrate the problem we're discussing in this issue.

For those following along who haven't run the code, the example I linked will log a "Task was destroyed but it is pending" message for the background() coroutine. In this particular example the coroutine doesn't actually do anything upon completion, but you could add a print() call to the end of the function and see that it never gets called. Instead, a GeneratorExit exception is raised by one of the awaits when GC runs.

So this is an illustration of the guideline in the docs to save a reference to the result of create_task. If create_task added all pending tasks to a global set, this code would work as expected. Separately, you say that the second example shows that there are unexpected results even without create_task, but given the use of weakref I'm not sure whether that's a real issue or not. If you want to argue for the relevance of the second example, could you say more about why this particular use of weakref is expected and the GC behavior is at fault?

@alexhartl
Copy link
Author

I agree with @gvanrossum that just because it's possible to construct a scenario where a weak reference's object is destroyed, that doesn't necessarily mean that that's a problem.

As I see it, the entire idea behind garbage collection is that objects, that are no longer reachable by user code, can be destructed. Since user code is executed by tasks in asyncio, the natural assumption is that strong references are held from the side that executes user code, i.e. by tasks. Apparently the asyncio standard library also was designed with this intuition in mind. Introducing a _pending_tasks set would live up to this intuition and fix the issue. Not sure if anyone is working on this at the moment. If not, I can suggest a PR.

@gvanrossum
Copy link
Member

Is there still an action item here? If not, I can close it (either as "won't fix" or "fixed" depending on the mood).

@bdarnell
Copy link
Contributor

bdarnell commented Feb 1, 2024

Yes, there's still an issue here. There's some discussion of weak references and whether we should care, but that's kind of a distraction and we have examples without weak references and only using core asyncio code.

This minimal example was provided by @CendioOssman in #91887 (comment) and I'll repeat it here:

import asyncio
import gc

async def background():
    remote_read, remote_write = await asyncio.open_connection("example.com", 443, ssl=False)
    await remote_read.read()

async def cleanup():
    while True:
        gc.collect()
        await asyncio.sleep(3)

async def main():
    asyncio.create_task(background())
    await cleanup()

asyncio.run(main())

This is a background task which does not follow the docs' guidance to hold a strong reference to the result of create_task. The task gets garbage collected and its side effects (if any) won't happen.

If this were the only instance, perhaps we could say that asyncio is working as documented and nothing needs to change. But going back to the original message in this issue, asyncio.shield() and asyncio.gather(..., return_exceptions=False) create very similar situations, and in those cases there is no natural (non-global) place to store any strong references.

I continue to think that a global _pending_tasks set is the best solution here (this is what I implemented in Tornado where we see a similar problem). An alternative would be to introduce an explicit notion of "background tasks" and only store global references to those, but this would remain error-prone.

@CendioOssman
Copy link
Contributor

I've been meaning to come back to this issue and provide some tangible change suggestions. So I would appreciate it if things could be kept open for a while longer. :)

@gvanrossum
Copy link
Member

Let's call the example with classes Proto, Reader and LogFunc a red herring. Yes, it shows that such a structure is possible without tasks, and I believe that asyncio's stream reader design has a similar structure, but I believe that if someone is using a stream reader without tasks they have plenty of opportunity to put in hard references, and they should.

So let's focus on the shorter example with coroutines background() and cleanup(), which demonstrates that there's a more general issue when tasks are not kept alive by the user (regardless of whether they are using stream readers).

Like Ben, I don't immediately see a downside to having a global set of pending tasks either, as long as tasks are guaranteed to remove themselves from it when they complete (and without the need for __del__).

That doesn't mean there isn't a downside, but we'll never find out until we try. We'll probably have to ask @1st1 to think about this -- there might be a reason that involves uvloop or Edgestore. Possibly the fact that uvloop has its own task factory makes things more complicated -- we should definitely think about that some more. Perhaps we can use a done callback that unlinks the task when it completes?

@CendioOssman, are you interested in coming up with a PR? Or do you continue to feel that the other example must also be fixed? In that case we may have to agree to disagree, and someone else can create a PR.

@Falmarri
Copy link

Here's a concrete example of a hack to workaround this issue https://github.com/Falmarri/podman-compose/blob/8d8fa54855ce7eb73e802ef06e9f48645a30e2ac/podman_compose.py#L1229-L1237

It's possible there's a better way of structuring this, but IMO this should just be the default where I don't have to care about this. I can just start these tasks and not have to worry.

@gvanrossum
Copy link
Member

I have a feeling we need a new champion to drive a PR here. I think a global or per-loop (non-weak) set of active tasks should solve the issue, but there are a bunch of details that need sorting through -- notably how we ensure that 3rd party tasks (e.g. from uvloop) are properly inserted into and removed from the set, without requiring changes to the 3rd party library.

@itamaro
Copy link
Contributor

itamaro commented Feb 26, 2024

I have a feeling we need a new champion to drive a PR here. I think a global or per-loop (non-weak) set of active tasks should solve the issue

cc'ing @kumaraditya303, in relation to gh-104787 and #80788 (comment) - are you still planning to tackle this for 3.13?

@willingc
Copy link
Contributor

@itamaro @kumaraditya303 Unless one of you is actively working on a PR for this, I would recommend we move this item back to TO DO. Thoughts?

@itamaro
Copy link
Contributor

itamaro commented Jun 19, 2024

@itamaro @kumaraditya303 Unless one of you is actively working on a PR for this, I would recommend we move this item back to TO DO. Thoughts?

Agreed. I haven't worked on this.

@willingc
Copy link
Contributor

Moving back to To Do for now.

@willingc
Copy link
Contributor

Flagging this issue for discussion at the core dev sprint unless there is a champion before the sprint.

@alexhartl
Copy link
Author

I have created a draft PR at #121264. In this PR, I have not made this feature optional. I'm open to adding the task to _pending_tasks only if an optional keyword argument is set.

Potential for Memory Leaks

Whenever a future's state transitions from the _PENDING state (due to finishing, cancelling or an exception), _finish_execution will be triggered and the task will be removed from _pending_tasks. I've implemented the _pending_tasks set as an attribute of the event loop to ensure that no memory leaks are possible when asyncio is deinitialized. I.e. when dropping all references to the loop, and there still is a pending task, you will still get the "Task was destroyed but it is pending!" error. I think this is much more predictable and robust than the current behavior.

Use of _unregister_task

On calling _unregister_task, asyncio currently removes the task from _scheduled_tasks. asyncio's documentation for _unregister_task says "The function should be called when a task is about to finish.". Together, this is inconsistent with asyncio's main implementation, which does not remove tasks from _scheduled_tasks when they're finishing, but only when they're deleted.
I've changed _unregister_task to remove the task only from _pending_tasks but not from _scheduled_tasks, which makes it consistent with the documentation. This might, of course, break old code that relies on the old behavior. The only code I could find online that uses the _unregister_task interface is Tornado. Tornado is consistent with the documented behavior, i.e. the new implementation.

uvloop

As far as I can see, uvloop uses asyncio's Task. Therefore, tasks will be registered and unregistered correctly in _pending_tasks within Task.__init__ and Task._finish_execution.

alexhartl added a commit to alexhartl/cpython that referenced this issue Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error
Projects
Status: Todo
Development

No branches or pull requests