Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 BUG]: Worker after SIGTERM not wait for finish activity #276

Closed
1 task done
cv65kr opened this issue Sep 30, 2022 · 4 comments
Closed
1 task done

[🐛 BUG]: Worker after SIGTERM not wait for finish activity #276

cv65kr opened this issue Sep 30, 2022 · 4 comments
Assignees
Labels
B-bug Bug: bug, unhandled exception F-need-verification

Comments

@cv65kr
Copy link
Contributor

cv65kr commented Sep 30, 2022

No duplicates 🥲.

  • I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

Workflow

#[Workflow\WorkflowInterface]
class Workflow
{
    #[Workflow\WorkflowMethod(name: 'Workflow')]
    public function run()
    {
        $simple = Workflow::newActivityStub(
            SimpleActivity::class,
            ActivityOptions::new()
                ->withStartToCloseTimeout(60)
                ->withRetryOptions(RetryOptions::new()->withMaximumAttempts(1))
        );

        yield $simple->sendRequest('request');
    }
}

Acitivity:

#[ActivityInterface(prefix: "SimpleActivity.")]
class SimpleActivity
{
    #[ActivityMethod]
    public function sendRequest(
        string $input
    ): string {
        sleep(40);

        return strtolower($input);
    }
}

Let's imagine situation when one of activity is responsible for sending POST request (this request can be send only one time) and it takes around 30 seconds.
In the 15th second while sending request, one of developer made a deployment - so he send a SIGTERM to kill container.

In logs after sent SIGTERM I see

2022-09-30T11:03:44.501Z	WARN	temporal    	Failed to poll for task.	{"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:37c72875-bdf1-49a8-9730-b686e60194b0", "WorkerType": "WorkflowWorker", "Error": "worker stopping"}

Activity was interrupted and is not processed anymore so in result in Temporal I see

{
  "message": "activity StartToClose timeout",
  "source": "Server",
  "stackTrace": "",
  "cause": null,
  "timeoutFailureInfo": {
    "timeoutType": "StartToClose",
    "lastHeartbeatDetails": {
      "payloads": [
        {
          "metadata": {
            "encoding": "anNvbi9wbGFpbg=="
          },
          "data": "eyJ2YWx1ZSI6MTh9"
        }
      ]
    }
  }
}

From my perspective worker should wait for finish this activity as a secure way. Only SIGKILL should immediately kill process.

Functionality available in java-sdk - https://www.javadoc.io/static/io.temporal/temporal-sdk/1.0.0/io/temporal/worker/WorkerFactory.html#awaitTermination-long-java.util.concurrent.TimeUnit-

Or maybe I am doing something wrong.

Version

v2.11.3

Relevant log output

No response

@cv65kr cv65kr added B-bug Bug: bug, unhandled exception F-need-verification labels Sep 30, 2022
@rustatian
Copy link
Collaborator

Hey @cv65kr 👋🏻

From my perspective worker should wait for finish this activity as a secure way. Only SIGKILL should immediately kill process.

The worker does wait for the workflow. To set the grace_period please, set this option: https://github.com/roadrunner-server/roadrunner/blob/master/.rr.yaml#L1921

But, worker can't wait for an indefinite timeout. This functionality exists in any SDK. For the given Java SDK (shutdown):

 This method does not wait for previously received tasks to complete execution.

If you want to wait a bigger amount of time, just pass grace_period: 9999h value.

@rustatian
Copy link
Collaborator

Only SIGKILL should immediately kill process

SIGKILL kills the process, not the workflow. After the restart, the workflow will be restarted. Even if you kill the temporal server, the history journal will be present in DB and will be recovered.

@rustatian
Copy link
Collaborator

Activity was interrupted and is not processed anymore so in result in Temporal I see...

->withRetryOptions(RetryOptions::new()->withMaximumAttempts(1))

This is the reason, why the activity started only once.

@cv65kr
Copy link
Contributor Author

cv65kr commented Sep 30, 2022

Hi @rustatian

I already checked the grace_period and looks like this can solve the issue. I need to test once again very carefully. But thanks for fast response 👍

Only SIGKILL should immediately kill process

SIGKILL kills the process, not the workflow. After the restart, the workflow will be restarted. Even if you kill the temporal server, the history journal will be present in DB and will be recovered.

Yes, that's my point but ofc is dependent of various factors e.g. if activity was in the middle of the processing then it cannot be recovered after restart.

Activity was interrupted and is not processed anymore so in result in Temporal I see...

->withRetryOptions(RetryOptions::new()->withMaximumAttempts(1))

This is the reason, why the activity started only once.

I just give you a code to reproduce, and that's the case sometimes activity cannot be reattempted. And need to be safe during redeployments.

I will close topic for now, and if something pop up I will reopen it. Once again big thanks!

@cv65kr cv65kr closed this as completed Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-bug Bug: bug, unhandled exception F-need-verification
Projects
None yet
Development

No branches or pull requests

2 participants