-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(workflow): Handle unhandled rejections in workflow code #415
Conversation
The solution here is a bit complex and might not cover enough cases in the wild. |
v8 has |
})(); | ||
|
||
await p1; | ||
await p2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not just rethrow here? Before this PR, is this treated any different than if I just replaced this line with throw new Error('whatever')
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the second async function throws without anything catching it.
The way v8 deals with it is by providing a hook for unhandled rejections: https://v8.github.io/api/head/classv8_1_1Isolate.html#a702f0ba4e5dee8a98aeb92239d58784e.
This is exposed in node with process.on('unhandledRejection')
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But since that promise is await
ed on here, it throws here right? If you removed await p2
I think it'd be unhandled, but the promise is explicitly awaited on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's awaited on too late, only after the activity resolves which causes an unhandled rejection
this.requestIdToCompletion = new Map(); | ||
for (const completion of completions) { | ||
completion.reject( | ||
new UnexpectedError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I am unfamiliar with how the threads work here. Is there no way we can capture an error thrown from an async workflow instead of having it terminate a worker thread? Can the workflow function not be wrapped in a try+await+catch inside the thread? If concerned about top-level code throwing, can the entire require
be wrapped in a try+await+catch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We terminate the thread if it we can't determine the context that threw the handled error to avoid accidentally completing the activation successfully.
There's no way to catch these errors unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We terminate the thread if it we can't determine the context that threw the handled error to avoid accidentally completing the activation successfully.
I am not following, sorry. In my head, I don't see why we'd ever allow user code to terminate a thread unless it's a very fatal error. We can't wrap everything a user may do in a recoverable scenario? Maybe have one error path that completes activation if you can determine the context and another that logs/swallows or whatever without doing something that could fail the whole worker.
Any workflow that can fail the worker or mess with the worker thread pool or whatever in any SDK, especially for something as trivial as a thrown exception in a promise not awaited on, seems bad. Maybe I'm misunderstanding the use here and overthinking "fail the worker".
135c1ec
to
14f386c
Compare
packages/test/src/run-a-worker.ts
Outdated
import * as activities from './activities'; | ||
|
||
async function main() { | ||
if (['1', 'y', 'yes', 't', 'true'].includes((process.env.DEBUG ?? '').toLowerCase())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO Just checking for existence is much better than this dance. Easy enough to clear an env var if you want to set it to false.
Also just DEBUG
is too generic I think. Something more temporal specific might be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, don't use TEMPORAL_DEBUG
. That env var is used in Java and Go SDKs to remove the workflow deadlock timer so code can be stepped slowly. (and maybe we will have the same here if/when such a time check comes about if not already there)
const runId = match[1]; | ||
const workflow = workflowByRunId.get(runId); | ||
if (workflow !== undefined) { | ||
console.log('found workflow', runId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover console log?
// Apparently nextTick does not get it triggered so we use setTimeout here. | ||
await new Promise((resolve) => setTimeout(resolve, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems odd. Maybe we could set a flag in the handler, and run next tick until we see the flag set (then unset it again) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might not be enough
At this point my proposed solution seems both incomplete and flaky. For now I think the best thing to do is crash the worker in node 14 so the behavior is at least consistent with node 16 and we don't run into issues where workflows are stuck and cannot be replayed. |
14f386c
to
bd41dd8
Compare
bd41dd8
to
7d7671e
Compare
const runId = ctor('return __TEMPORAL__.runId')(); | ||
if (runId !== undefined) { | ||
const workflow = workflowByRunId.get(runId); | ||
if (workflow !== undefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else here seems like a weird situation that we might want to log on or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I logged the runId below
7d7671e
to
d678c37
Compare
d57b439
to
7d99d81
Compare
7d99d81
to
08e5d14
Compare
This PR makes the best effort to associate unhandled rejections from workflow code to a specific runId.
It also makes the unhandled rejection behavior consistent between node 14 and 16 and propagates failure back to the user.
Previously, in node 16 the process would crash and in node 14 we would incorrectly ignore rejections leading to unexpected workflow behavior (see the correct behavior in the added tests).