New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coroutine semantics are unsound. proposal to fix them #1363

Open
andrewrk opened this Issue Aug 10, 2018 · 13 comments

Comments

Projects
None yet
4 participants
@andrewrk
Member

andrewrk commented Aug 10, 2018

Currently there is a race condition when using multithreaded coroutines. Consider the following scenario:

  1. Thread 1 does epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.
  2. Thread 2 is executing the coroutine, which reaches a suspend point, and then reads its atomic state and learns that it is scheduled to be canceled (destroyed). It runs the defers and errdefers, which dutifully remove the coroutine from the epoll set. However because of (1) it isn't even in the epoll set/kqueue/iocp. (1) is about to resume the coroutine and there's nothing (2) can do to stop it. (2) proceeds to destroy the coroutine frame. The memory is gone.
  3. Thread 1 does resume coroutine_handle which now points to invalid memory. Boom.

In order to fix this, we need to introduce new syntax and semantics. Current semantics are:

  • async creates a promise which must be consumed with cancel or await.
  • suspend must be consumed with a resume.

The problem here described above - resume and cancel racing with each other. When a suspended coroutine is canceled, the memory must not be destroyed until the suspend is canceled or resumed. Proposal for new semantics:

  • async creates a promise which must be consumed with cancelasync or await.
  • suspend must be consumed with a cancelsuspend or resume.

With these semantics, each coroutine essentially has 0, 1, or 2 references, and when the reference count reaches 0 it is destroyed. There is the "async reference", which is the main one, and the "suspend reference", which might never exist.

Therefore it is crucial that when a coroutine uses suspend - which is a low level feature intended mainly for low level library implementations of concurrency primitives - that it ensures it will be consumed with either resume or cancelsuspend.

defer and errdefer will both run when a coroutine's cancellation process begins. This means that the first cancelasync or cancelsuspend will run the defers of the coroutine. When the other reference drops, the memory will be destroyed.

@andrewrk andrewrk added this to the 0.3.0 milestone Aug 10, 2018

@andrewrk andrewrk referenced this issue Aug 10, 2018

Open

re-enable stage2 tests #1364

1 of 6 tasks complete
@shawnl

This comment has been minimized.

Contributor

shawnl commented Sep 1, 2018

What you describe has the same race condition but between cancelsuspend and resume.

The way I see to fix this is to wrap the epoll_wait() with a read-write mutex+semaphore. The `epoll_wait() thread does:

  1. acquire the mutex
  2. wait till the semaphore in 0
  3. epoll_wait()
  4. resume co-routine
  5. release mutex
  6. goto 1

While the cancel thread does:

  1. increment the semaphore. If the mutex is acquired goto 4, else goto 2.
  2. deliver an event that will wake up the epoll_wait() thread <= prevents starvation
  3. wait for the epoll thread to set the mutex to 0
  4. cancel the co-routine
  5. decrement the semaphore

But this introduces a deadlock where the epoll_wait() thread is trying to resume the thread that is trying to cancel itself. It can't drop this event (because it is using EPOLLET) so after it detects a dead-lock it has to let itsself be resumed (while holding the lock preventing epoll_wait() from being called) and cancel the co-routine on the next suspend.

But that introduces a stall in the main loop while the resume thread is running, so a second lock has to be introduced, individual to promise [1]. The cancelling thread:

  1. detect deadlock, if not destroy thread and return
  2. lock promise
  3. remove fd from epoll
  4. release epoll lock
  5. suspend
  6. on resume (from epoll thread) unlock promise
  7. destroy thread

Dead locks are detected with two locks taken in opposite orders.

@shawnl

This comment has been minimized.

Contributor

shawnl commented Sep 2, 2018

My proposal can be done entirely in user-space, except to keep cancel thread-safe it needs a new suspendnotfinal keyword.

Speaking of, I don't really like cancel. Why can't it just be a method on the promise. .scheduleCancellation()?

@kristate

This comment has been minimized.

Contributor

kristate commented Sep 2, 2018

@shawnl check-out #1307

shawnl added a commit to shawnl/zig that referenced this issue Sep 2, 2018

add "suspendnotfinal"
This is necessary to keep "cancel" thread-safe. See ziglang#1363
@kristate

This comment has been minimized.

Contributor

kristate commented Sep 2, 2018

So, I'm guessing that one of the better things to do here is to keep the current keyword semantics, except to hold-off on defer & delete until a resume is called instead of doing it after a suspend.

The epoll loop will always call resume on its promises, so if we are in cancel state then we defer and delete and then return control flow back to the loop. If not in a cancel state, we simply resume.

And hey, I'm made some ASCII art:

        +-[PROMISE / INSIDE COROUTINE]--+                                      
        |'''''''''''''''''''''''''''''''|                                      
        |''''''+------------------------+--+--------------+                    
        |''''''v''''''''''''''''''''''''|  |              |                    
+-----+ |''+-------+''''+-----------+'''|  |              |                    
|async|-+->|suspend|-+->|  return   |'''|  |              |                    
+-----+ |''+-------+'|''+-----------+'''|  |              |                    
        |''''''''''''|''+-----------+'''|  |              |                    
        |''''''''''''+->|cancel(noop|'''|  |              |                    
        |''''''''''''|''+-----------+'''|  |              |                    
        +------------+------------------+  |              |                    
                     |  +---------------+  |              |                    
                     +->|     await     |--+              |                    
                     |  +---------------+   +------+      |                    
                     |  +---------------+   | set  |      |                    
                     +->|    cancel     |-->|cancel|      |NO                  
                     |  +---------------+   | bit  |  .-------.      +--------+
                     |  +---------------+   +------+ /  TO BE  \ YES | defer& |
                     +->|    resume     |---------->( CANCELED? )--->| delete |
                        +---------------+            `.       ,'     +--------+
                                                       `-----'                 
@shawnl

This comment has been minimized.

Contributor

shawnl commented Sep 2, 2018

The epoll loop will always call resume on its promises,

No. If no events come it they never get resumed. The cancel part can be taken out of the promise type and become part of the loop: .scheduleCancellation(). (which works according to above locking semantics) Then before suspending a co-routine has to check if it was scheduled to be cancelled while it was running, and if so destroys itself.

I opened a LLVM bug on whether co-routines are allowed to destroy themselves (as zig currently does in suspend)

https://bugs.llvm.org/show_bug.cgi?id=38805

@shawnl shawnl referenced this issue Sep 3, 2018

Merged

initial arm64 support #1429

4 of 4 tasks complete
@andrewrk

This comment has been minimized.

Member

andrewrk commented Sep 3, 2018

I think the scenario I outlined above is not actually a problem, because a coroutine couldn't be in the epoll set and executing at the same time. But here's a problem scenario:

  1. Thread 1 does epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.
  2. Thread 2 does a cancel on the coroutine. Since it's at a suspend point, it runs the defers and errdefers, which dutifully remove the coroutine from the epoll set. However because of (1) it isn't even in the epoll set/kqueue/iocp. (1) is about to resume the coroutine and there's nothing (2) can do to stop it. (2) proceeds to destroy the coroutine frame. The memory is gone.
  3. Thread 1 does resume coroutine_handle which now points to invalid memory. Boom.

With the proposed semantics:

  1. Thread 1 does epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.
  2. Thread 2 does a cancelasync on the coroutine. Since it's at a suspend point, it runs the defers and errdefers, which dutifully remove the coroutine from the epoll set. However because of (1) it isn't even in the epoll set/kqueue/iocp. (1) is about to resume the coroutine and there's nothing (2) can do to stop it. However (2) does not destroy the coroutine frame because the suspend reference is present. The memory is still there.
  3. Thread 1 does resume coroutine_handle which finds out that the thread has been canceled. So instead of resuming, it frees the memory.
@ghost

This comment has been minimized.

ghost commented Sep 5, 2018

I think it would be best if Andrew were to implement his proposal and we could test it/ fuzz it.
I tried to think through this discussion and while Andrews explanation makes sense to me, I've made just enough concurrency bugs myself to know that this does not proof the correctness.

Testing the implementation does not give a proof either of course but I think its the best we can currently do until maybe zig attract some (other) domain expert that thinks it through.

I'd certainly be willing to run some test suit for an extensive time, compared to peoples time, hardware is cheap 😆

@kristate

This comment has been minimized.

Contributor

kristate commented Sep 5, 2018

When I get some time later tonight, I will try to make another ASCII chart trying to clarify things. It's an area where we should be documenting.

@ghost

This comment has been minimized.

ghost commented Sep 5, 2018

rather than ascii chart we'd need a formal state diagram/ automata with a formal proof, but I've seen a professor attempting the same and still failing (false positive actually) so...brute force is the best we have although its still no good.

@kristate

This comment has been minimized.

Contributor

kristate commented Sep 5, 2018

@monouser7dig yes, I agree. An ASCII chart is no replacement for a formal proof. Inch by inch.

@andrewrk andrewrk modified the milestones: 0.3.0, 0.4.0 Sep 11, 2018

@ghost

This comment has been minimized.

ghost commented Oct 1, 2018

so after 0.3.0 is this the first major thing on the list?
like get async working, because the self hosted compiler needs it, network needs it (sort of?) and thus

  • async (rewrite coroutines)
  • self hosted + comptime allocator/ bugs + tooling
  • docs
  • ??? lots of things

@andrewrk is this roughly accurate or what is to be expected?

@andrewrk

This comment has been minimized.

Member

andrewrk commented Oct 1, 2018

Yes, this is accurate. Reworking coroutines is the first major thing on my list. That, plus something like one bug fix + one documentation improvement per day. roadmap

@Matthias247

This comment has been minimized.

Matthias247 commented Oct 13, 2018

Maybe related to this: I've written a few lines in Rust's futures repository why synchronous cancellation might not work out well for all use-cases: rust-lang-nursery/futures-rs#1278

I guess similar things would apply to zig's coroutines as well, and it might in some cases be favorable to force them to run to completion (or at least to a safe exit point).

andrewrk added a commit that referenced this issue Nov 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment