Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Nursery.start() be error-interruptible by cancellation? #2258

Open
goodboy opened this issue Feb 26, 2022 · 3 comments
Open

Should Nursery.start() be error-interruptible by cancellation? #2258

goodboy opened this issue Feb 26, 2022 · 3 comments

Comments

@goodboy
Copy link
Member

goodboy commented Feb 26, 2022

Synopsis

A somewhat convoluted/complex case I discovered while refining a "service manager" style system in tractor is where a so called "service nursery" experiences the following situation:

  • a first task is requested to start async in the background via Nursery.start_soon()
  • that same nursery's .start() is called to start another 2nd task which can execute (in theory async) up to its required .started() where it unblocks the caller but, between the parent:.start() -> child:.started() sequence may need synchronization with the previous first bg task (the syncing implemented with a trio.Event for example).
  • the first task errors and normally would be processed on Nursery exit

For further background on what the real-world multi-actor daemon service system looks like see this test. The TL;DR is that having a service nursery in the global state of a process (in tractor terms we call this actor local) is fairly useful for spawning long lived daemon service-actors which can be started and stopped from some (other) remote program (not in the process tree) where the lifetime of the daemons (async tasks running in processes) are still lifetime-managed using persistent trio nurseries.


Example of the issue

Here is a small pytest ready module which demonstrates how this scenario causes a hang:

import pytest
import trio
from trio_typing import TaskStatus


def test_stashed_child_nursery():

    _child_nursery = None

    async def waits_on_signal(
        ev: trio.Event(),
        task_status: TaskStatus[trio.Nursery] = trio.TASK_STATUS_IGNORED,
    ):
        '''
        Do some stuf, then signal other tasks, then yield back to "starter".

        '''
        await ev.wait()
        task_status.started()

    async def mk_child_nursery(
        task_status: TaskStatus = trio.TASK_STATUS_IGNORED,
    ):
        '''
        Allocate a child sub-nursery and stash it as a global.

        '''
        nonlocal _child_nursery

        async with trio.open_nursery() as cn:
            _child_nursery = cn
            task_status.started(cn)

            # block until cancelled by parent.
            await trio.sleep_forever()

    async def sleep_and_err(ev: trio.Event):
        await trio.sleep(0.5)
        doggy()
        ev.set()

    async def main():

        async with (
            trio.open_nursery() as pn,
        ):
            cn = await pn.start(mk_child_nursery)
            assert cn

            ev = trio.Event()
            cn.start_soon(sleep_and_err, ev)

            # this causes inf hang
            await cn.start(waits_on_signal, ev)

            # this does not.
            # cn.start_soon(waits_on_signal, ev)

    with pytest.raises(NameError):
        trio.run(main)

Analysis and discussion

@njsmith mentioned in chat

so what ends up happening is: sleep_and_err crashes with an error -> cn cancels everything inside it and tries to propagate the NameError -> but cn can't exit, because waits_on_signal is in a weird state where it's still executing in the parent, but cn knows that a start call is in progress. So cn waits to see what waits_on_signal is going to do -- if it exited, or if it called started, then either way cn can continue and propagate the NameError. But because start does neither, cn gets stuck forever, so the NameError can't propagate, so it can't reach pn and cause the start call to be cancelled.

I'm not sure if this suggests a problem in trio or not!

right now started always succeeds -- if start makes sure that the nursery is open and that it will stay open until the child calls started.

what I could imagine doing instead, is to say that the nursery closes while start is running, then started raises a RuntimeError, same as if you do start_soon into a closed nursery

The main question being which of the following 2 approaches should trio take:

  • keep .start() -> .started() never nursery-error-interruptible
  • the opposite such that this example test would not hang (my personal preference)
@GalaxySnail
Copy link
Contributor

I agree that this example test would not hang. Smith said:

when you use nursery.start, the function runs "under" the call to start until it calls started, and only then does it move into nursery

so if you're calling nursery.start from some other task outside of the nursery, and you want to cancel ev.wait(), then you have to cancel the nursery.start call, not the nursery itself

That's really suprising. In my opinion, in this test, nursery.start should raise an exception just like MemoryReceiveChannel.receive if the other side is closed; And a TaskStatus.started(something) acts like a MemorySendChannel.send(something), which can be cancelled easily.

@goodboy
Copy link
Member Author

goodboy commented May 12, 2022

I agree that this example test would not hang.

@GalaxySnail i presume s/would/should ?

The test definitely hangs 😂

@GalaxySnail
Copy link
Contributor

Thank you for correcting me! 😂 I'm not a native English speeker. I meant it really hangs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants