Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatcher: check for outstanding actions before release #3510

Merged
merged 8 commits into from
Apr 15, 2023

Conversation

samspills
Copy link
Contributor

@samspills samspills commented Mar 22, 2023

resolves #3506

c0ed15a adds a test demonstrating that a Dispatcher, with await=true, will finish without completing the effect (thus failing the test)

Ultimately this is caused by the dispatcher finalizer running and releasing, even though there may have been new actions enqueued since the worker last checked.

Paired with @djspiewak and came to the solution in 0b6eaa4 in which we can now call step one last time, after the dispatcher is done but before releasing, so that those last-minute actions are forked appropriately. With this change the new test passes 🎉

samspills and others added 2 commits March 22, 2023 08:52
Co-authored-by: Daniel Spiewak <djspiewak@gmail.com>
@samspills samspills self-assigned this Mar 22, 2023
Copy link
Member

@armanbilge armanbilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Andrew, Sam, and I tripled to review this PR 😂 added some notes below.

We also produced a new test in 4374e98 that verifies that dispatchers "reject new tasks while shutting down". It is currently failing for some permutations, but may be fixable by setting alive to false before doing the final step(...) in the release process.

This issue is relevant to this PR, since we want to be sure that the outstanding actions can't keep growing once we start the release process.

Comment on lines +316 to +317
_ <- dispatcher.use { runner => IO(runner.unsafeRunAndForget(resultR.set(true))) }
result <- resultR.get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the race condition this test does not fail reliably for the parallel dispatcher. We might want to run it for several iterations or used the ticked runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reran the test (without fix) a bunch of times but I couldn't reproduce the flake. I still rewrote using ticked and also didn't see any flakes there so hopefully that is enough without running multiple times?

Comment on lines 306 to 307
// published by release
F.delay(doneR.lazySet(true)) *> release
F.delay(doneR.lazySet(true)) *> step(states(n), F.unit) *> release
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if lazySet makes sense anymore, since it's not immediately published by release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't think it makes sense. Now that we're doing work before the release, I think we want the doneR value to be immediately available

Comment on lines 295 to +297
// if we're marked as done, yield immediately to give other fibers a chance to shut us down
// we might loop on this a few times since we're marked as done before the supervisor is canceled
F.delay(doneR.get()).ifM(F.cede, step)
F.delay(doneR.get()).ifM(F.cede, step(state, await))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangential, but I wonder if we can complete this done loop by returning a sentinel F[Unit] value that the supervisor can use as indication that it should no longer restart this fiber.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I forgot to answer here :) When we talked about this it made sense to me, so I do want to try experimenting with that as a follow up!

@samspills
Copy link
Contributor Author

Heads up: I have a draft PR up (#3521) with the alive test and a potential fix, it includes the commits here but it's a different PR to keep the concerns from getting muddled :)

@samspills
Copy link
Contributor Author

samspills commented Apr 5, 2023

The work on alive turned out to be pretty modular and not terribly hairy so I've merged it into this PR 🎉

More details are in the writeup for #3521, but tldr:
I added a test that submits a 1.second task to a dispatcher, and then concurrently releases the dispatcher while submitting a second (rogue) task. This test failed for sequential dispachers always, but highlighted a race condition with alive for parallel dispatchers:

  • wait long enough before submitting the rogue task (100 milliseconds): the dispatcher is not alive and the rogue task isn't submitted at all
  • wait not at all: the dispatcher is alive and kicking and the rogue task is submitted and completed
  • wait a goldilocks period of 500-900ish nanos: the rogue task can be submitted but not completed. This case shouldn't be allowed to happen.

The fix is moving the alive resource so that it is the last thing initialized. This means it will be the first to release, setting alive to false even before the workers start releasing. I think this is good, it means that the dispatcher will stop accepting new tasks before the workers start releasing, so no new tasks can be enqueued while the workers are running their final steps

@armanbilge
Copy link
Member

Sam points out that the alive changes appear to have fixed #3501 as well 🎉 we should capture that in a test, I'm happy to do that in a follow-up PR as well.

Copy link
Member

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super sorry for taking so long to get back to this! Trying to climb out of the hole I fell into…

Thank you for taking this on!

@djspiewak djspiewak merged commit 5516608 into typelevel:series/3.4.x Apr 15, 2023
51 checks passed
@samspills samspills deleted the sam/3506/dispatcher-await branch April 30, 2023 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants