Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(workflow): Handle unhandled rejections in workflow code #415

Merged
merged 6 commits into from
Dec 11, 2021

Conversation

bergundy
Copy link
Member

This PR makes the best effort to associate unhandled rejections from workflow code to a specific runId.
It also makes the unhandled rejection behavior consistent between node 14 and 16 and propagates failure back to the user.
Previously, in node 16 the process would crash and in node 14 we would incorrectly ignore rejections leading to unexpected workflow behavior (see the correct behavior in the added tests).

@bergundy bergundy self-assigned this Nov 30, 2021
@bergundy
Copy link
Member Author

The solution here is a bit complex and might not cover enough cases in the wild.
I'm looking into alternatives, this is the best I could come up with for now.

@bergundy
Copy link
Member Author

bergundy commented Dec 1, 2021

v8 has Object::GetCreationContext() but I don't see it exposed in node, we might need native code to access this method.

})();

await p1;
await p2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not just rethrow here? Before this PR, is this treated any different than if I just replaced this line with throw new Error('whatever')?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the second async function throws without anything catching it.
The way v8 deals with it is by providing a hook for unhandled rejections: https://v8.github.io/api/head/classv8_1_1Isolate.html#a702f0ba4e5dee8a98aeb92239d58784e.
This is exposed in node with process.on('unhandledRejection').

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But since that promise is awaited on here, it throws here right? If you removed await p2 I think it'd be unhandled, but the promise is explicitly awaited on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's awaited on too late, only after the activity resolves which causes an unhandled rejection

this.requestIdToCompletion = new Map();
for (const completion of completions) {
completion.reject(
new UnexpectedError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am unfamiliar with how the threads work here. Is there no way we can capture an error thrown from an async workflow instead of having it terminate a worker thread? Can the workflow function not be wrapped in a try+await+catch inside the thread? If concerned about top-level code throwing, can the entire require be wrapped in a try+await+catch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We terminate the thread if it we can't determine the context that threw the handled error to avoid accidentally completing the activation successfully.
There's no way to catch these errors unfortunately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We terminate the thread if it we can't determine the context that threw the handled error to avoid accidentally completing the activation successfully.

I am not following, sorry. In my head, I don't see why we'd ever allow user code to terminate a thread unless it's a very fatal error. We can't wrap everything a user may do in a recoverable scenario? Maybe have one error path that completes activation if you can determine the context and another that logs/swallows or whatever without doing something that could fail the whole worker.

Any workflow that can fail the worker or mess with the worker thread pool or whatever in any SDK, especially for something as trivial as a thrown exception in a promise not awaited on, seems bad. Maybe I'm misunderstanding the use here and overthinking "fail the worker".

import * as activities from './activities';

async function main() {
if (['1', 'y', 'yes', 't', 'true'].includes((process.env.DEBUG ?? '').toLowerCase())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO Just checking for existence is much better than this dance. Easy enough to clear an env var if you want to set it to false.

Also just DEBUG is too generic I think. Something more temporal specific might be good.

Copy link
Member

@cretz cretz Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, don't use TEMPORAL_DEBUG. That env var is used in Java and Go SDKs to remove the workflow deadlock timer so code can be stepped slowly. (and maybe we will have the same here if/when such a time check comes about if not already there)

const runId = match[1];
const workflow = workflowByRunId.get(runId);
if (workflow !== undefined) {
console.log('found workflow', runId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover console log?

Comment on lines +174 to +173
// Apparently nextTick does not get it triggered so we use setTimeout here.
await new Promise((resolve) => setTimeout(resolve, 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd. Maybe we could set a flag in the handler, and run next tick until we see the flag set (then unset it again) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might not be enough

@bergundy
Copy link
Member Author

bergundy commented Dec 7, 2021

At this point my proposed solution seems both incomplete and flaky.
I'm going to bench this for now because I can't find a reliable way to handle unhandled rejections on time.
Looks like even setTimeout doesn't work if many workflows are run concurrently.

For now I think the best thing to do is crash the worker in node 14 so the behavior is at least consistent with node 16 and we don't run into issues where workflows are stuck and cannot be replayed.
This would be the case if errors were turned into workflow task failures instead of failing the entire workflow (as they are treated in the other SDKs).

const runId = ctor('return __TEMPORAL__.runId')();
if (runId !== undefined) {
const workflow = workflowByRunId.get(runId);
if (workflow !== undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else here seems like a weird situation that we might want to log on or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I logged the runId below

@bergundy bergundy enabled auto-merge (squash) December 10, 2021 21:51
@bergundy bergundy enabled auto-merge (squash) December 10, 2021 21:53
@bergundy bergundy merged commit 27e9fcd into main Dec 11, 2021
@bergundy bergundy deleted the unhandled-rejections branch December 11, 2021 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants