Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry ECS RunTask on AGENT error which can happen as its normal operation #1723

Merged

Conversation

builtinnya
Copy link
Contributor

@builtinnya builtinnya commented Mar 10, 2022

Problem

As described in Amazon ECS events - Amazon Elastic Container Service, the ECS container agent disconnects and reconnects several times per hour as a part of its normal operation.

The Amazon ECS container agent disconnects and reconnects several times per hour as a part of its normal operation, so agent connection events should be expected. These events are not an indication that there is an issue with the container agent or your container instance.

So, ECS RunTask API can fail with AGENT reason when it's requested in the middle of the reconnection. ECS RunTask API callers should handle this situation.

Currently, EcsCommandExecutor, which relies on ECS RunTask API, doesn't consider the AGENT error and simply fails with a RuntimeException "Submitted task could not be found".

protected Task submitTask(
final CommandContext commandContext,
final CommandRequest commandRequest,
final EcsClient client,
final TaskDefinition td)
throws ConfigException
{
final EcsClientConfig clientConfig = client.getConfig();
final RunTaskRequest runTaskRequest = buildRunTaskRequest(commandContext, commandRequest, clientConfig, td); // RuntimeException,ConfigException
logger.debug("Submit task request:" + dumpTaskRequest(runTaskRequest));
final RunTaskResult runTaskResult = client.submitTask(runTaskRequest); // RuntimeException, ConfigException
logger.debug("Submit task response:" + dumpTaskResult(runTaskResult));
return findTask(td.getTaskDefinitionArn(), runTaskResult); // RuntimeException
}

protected Task findTask(final String taskDefinitionArn, final RunTaskResult result)
{
for (final Task t : result.getTasks()) {
if (t.getTaskDefinitionArn().equals(taskDefinitionArn)) {
return t;
}
}
throw new RuntimeException("Submitted task could not be found"); // TODO the message should be improved more understandably.
}

What this PR does

This PR tries to handle the AGENT error by simply retrying a RunTask API call after a few seconds.
This approach has been confirmed to be effective to handle the random agent reconnections in our production environment.

On reproduction

It's difficult to establish stable steps to reproduce this issue due to its randomness but it should be easy to see what happens when RunTask failed with AGENT error and why it should be handled.

Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com>
@myui myui self-requested a review April 23, 2022 19:09
@szyn szyn added this to the v0.10.5 milestone Feb 9, 2023
Copy link
Member

@szyn szyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this PR. This sounds reasonable and looks good to me.

@szyn szyn merged commit a270498 into treasure-data:master Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants