[🐛 BUG]: Worker after SIGTERM not wait for finish activity #276

cv65kr · 2022-09-30T13:26:36Z

No duplicates 🥲.

I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

Workflow

#[Workflow\WorkflowInterface]
class Workflow
{
    #[Workflow\WorkflowMethod(name: 'Workflow')]
    public function run()
    {
        $simple = Workflow::newActivityStub(
            SimpleActivity::class,
            ActivityOptions::new()
                ->withStartToCloseTimeout(60)
                ->withRetryOptions(RetryOptions::new()->withMaximumAttempts(1))
        );

        yield $simple->sendRequest('request');
    }
}

Acitivity:

#[ActivityInterface(prefix: "SimpleActivity.")]
class SimpleActivity
{
    #[ActivityMethod]
    public function sendRequest(
        string $input
    ): string {
        sleep(40);

        return strtolower($input);
    }
}

Let's imagine situation when one of activity is responsible for sending POST request (this request can be send only one time) and it takes around 30 seconds.
In the 15th second while sending request, one of developer made a deployment - so he send a SIGTERM to kill container.

In logs after sent SIGTERM I see

2022-09-30T11:03:44.501Z	WARN	temporal    	Failed to poll for task.	{"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:37c72875-bdf1-49a8-9730-b686e60194b0", "WorkerType": "WorkflowWorker", "Error": "worker stopping"}

Activity was interrupted and is not processed anymore so in result in Temporal I see

{
  "message": "activity StartToClose timeout",
  "source": "Server",
  "stackTrace": "",
  "cause": null,
  "timeoutFailureInfo": {
    "timeoutType": "StartToClose",
    "lastHeartbeatDetails": {
      "payloads": [
        {
          "metadata": {
            "encoding": "anNvbi9wbGFpbg=="
          },
          "data": "eyJ2YWx1ZSI6MTh9"
        }
      ]
    }
  }
}

From my perspective worker should wait for finish this activity as a secure way. Only SIGKILL should immediately kill process.

Functionality available in java-sdk - https://www.javadoc.io/static/io.temporal/temporal-sdk/1.0.0/io/temporal/worker/WorkerFactory.html#awaitTermination-long-java.util.concurrent.TimeUnit-

Or maybe I am doing something wrong.

Version

v2.11.3

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

rustatian · 2022-09-30T13:36:34Z

Hey @cv65kr 👋🏻

From my perspective worker should wait for finish this activity as a secure way. Only SIGKILL should immediately kill process.

The worker does wait for the workflow. To set the grace_period please, set this option: https://github.com/roadrunner-server/roadrunner/blob/master/.rr.yaml#L1921

But, worker can't wait for an indefinite timeout. This functionality exists in any SDK. For the given Java SDK (shutdown):

 This method does not wait for previously received tasks to complete execution.

If you want to wait a bigger amount of time, just pass grace_period: 9999h value.

rustatian · 2022-09-30T13:39:10Z

Only SIGKILL should immediately kill process

SIGKILL kills the process, not the workflow. After the restart, the workflow will be restarted. Even if you kill the temporal server, the history journal will be present in DB and will be recovered.

rustatian · 2022-09-30T13:49:33Z

Activity was interrupted and is not processed anymore so in result in Temporal I see...

->withRetryOptions(RetryOptions::new()->withMaximumAttempts(1))

This is the reason, why the activity started only once.

cv65kr · 2022-09-30T19:02:24Z

Hi @rustatian

I already checked the grace_period and looks like this can solve the issue. I need to test once again very carefully. But thanks for fast response 👍

Only SIGKILL should immediately kill process

SIGKILL kills the process, not the workflow. After the restart, the workflow will be restarted. Even if you kill the temporal server, the history journal will be present in DB and will be recovered.

Yes, that's my point but ofc is dependent of various factors e.g. if activity was in the middle of the processing then it cannot be recovered after restart.

Activity was interrupted and is not processed anymore so in result in Temporal I see...

->withRetryOptions(RetryOptions::new()->withMaximumAttempts(1))

This is the reason, why the activity started only once.

I just give you a code to reproduce, and that's the case sometimes activity cannot be reattempted. And need to be safe during redeployments.

I will close topic for now, and if something pop up I will reopen it. Once again big thanks!

cv65kr added B-bug Bug: bug, unhandled exception F-need-verification labels Sep 30, 2022

cv65kr assigned rustatian Sep 30, 2022

cv65kr closed this as completed Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 BUG]: Worker after SIGTERM not wait for finish activity #276

[🐛 BUG]: Worker after SIGTERM not wait for finish activity #276

cv65kr commented Sep 30, 2022 •

edited

Loading

rustatian commented Sep 30, 2022

rustatian commented Sep 30, 2022

rustatian commented Sep 30, 2022

cv65kr commented Sep 30, 2022

[🐛 BUG]: Worker after SIGTERM not wait for finish activity #276

[🐛 BUG]: Worker after SIGTERM not wait for finish activity #276

Comments

cv65kr commented Sep 30, 2022 • edited Loading

No duplicates 🥲.

What happened?

Version

Relevant log output

rustatian commented Sep 30, 2022

rustatian commented Sep 30, 2022

rustatian commented Sep 30, 2022

cv65kr commented Sep 30, 2022

cv65kr commented Sep 30, 2022 •

edited

Loading