Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize work waiting pattern #2854

Merged
merged 1 commit into from Jan 11, 2021

Conversation

marta-lokhova
Copy link
Contributor

Resolves #2519

self->wakeUp();
}
});
exit->async_wait(wakeSelfUpCallback());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a different change and shows that the abstractions are not exactly right: we're not using any "waiting timer" in this situation, yet we call a function that talks about timers in its name...
We could add the concept of error code to "wakeUp" but it doesn't seem to fit well from a generic point of view, so I would keep RunCommand untouched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree I think it's easier to reason about this without RunCommandWork

@@ -27,7 +26,7 @@ BasicWork::State
ResolveSnapshotWork::onRun()
{
ZoneScoped;
if (mEc)
if (waitingTimerFailed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code was (and still is) confusing: for this to fail, it would have to be a cancelled timer, which can only happen if we delete the Work object while it's waiting for the timer to trigger.

In fact it's so bad that with 3 classes, we have 3 slightly different behaviors!

  • TxSimGenerateBucketsWork was ignoring this case (noop on failure) so would never wake up work in that situation (now we do?!).
  • ConditionalWork always reschedules which is also wrong (was and still is probably wrong).

To me it looks like we should instead take away control from implementors on what happens when we want to "yield" (to avoid busy waits):

  • add a new return code for onRun WORK_SLEEPING that tells BasicWork to setup the timer and callback in a consistent way (internal state would still be WAITING but BasicWork::crankWork() would make sure to do the setup to wake things up later and make keep track of some internal state to handle some edge cases).
  • probably make the sleep duration a property (default to 1 second) so that callers can make it shorter if they want to
  • harden the behavior so that we can actually reason about what will happen if a Work gets aborted/deleted while waiting like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment made me go back to the implementors and understand the use cases better. What I realize is that the three works don't need to check on the timer status (e.g. ResolveSnapshotWork), and timer cancellation/expiration should be handled within the BasicWork itself. After realizing that, I think the change is simpler than what you propose (and is only a slight variation of the current diff, and no new state is needed). The list of features is as follows:

  • Child works are only allowed to call the timer setup function (which is idempotent) then transition to WAITING state (no direct access to the timer)
  • All callbacks no-op on failure. I believe it's the right thing to do because the timer only gets cancelled if 1. work is abruptly deleted (this violates existing work safeguards already as work must be destructed in finished state, so I don't see a reason to wake up work here), 2. work is abruptly reset (this also violates the allowed transitions, as work is only reset in the finished state) and 3. work is aborting (I added timer cancellation when shutdown is issued -- I don't think it makes sense to fire the callback and wake up work, because once in ABORTING state, onRun is never scheduled)

I implemented the changes, and am working on adding a test for this behavior, but let me know if you don't agree with the proposal in the meantime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, added a test case as well.

@MonsieurNicolas MonsieurNicolas added this to In progress in v15.2.0 via automation Jan 5, 2021
@marta-lokhova marta-lokhova force-pushed the work_improvements branch 2 times, most recently from 3bffe84 to 41b76ab Compare January 8, 2021 02:45
@MonsieurNicolas
Copy link
Contributor

r+ dc76982

@latobarita latobarita merged commit 3db5591 into stellar:master Jan 11, 2021
v15.2.0 automation moved this from In progress to Done Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Generalize "Work yields every N ms while waiting" pattern
3 participants