Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix polling (ECS) command executors fail to run tasks on retry #1620

Merged

Conversation

builtinnya
Copy link
Contributor

This PR fixes a bug that ECS command executor fails to submit ECS tasks on retry.

Reproducible steps

digdag version: v0.10.1 (and some earlier versions probably)

Assuming that ECS command executor is set up correctly:

  1. Add and run the following workflow

    +ecs-retry-test:
      sh>: |
        echo "Task has been executed!"
        exit 1
      _retry: 1
    
  2. See task logs to confirm that the command executor actually prints Task has been executed! once.

Expected behavior

In the above steps, the command executor should print Task has been executed! twice.

Cause

Each operator runs a command (= submits an ECS task) only when "commandStatus" doesn't exist on state params.
However, ECS command executor's polling mechanism persists "commandStatus" even on retry, which tries to poll an ECS task that has already exited.

Approach in this PR

This PR takes minimal change approach and simply removes "commandStatus" from state params on retry.

Considerations

  • Maybe we should also use TaskExecutionException for commands' failure to propagate state params.
  • Currently, each operator tries to remove "commandStatus" from state params on failure (e.g. sh> op) but it has no effect on the state params on retry because it just throws RuntimeException right after that. Should we remove these confusing lines?

Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com>
@szyn
Copy link
Member

szyn commented Aug 17, 2021

Thank you for creating this PR!
First of all, we haven't been able to reproduce this issue in our environment yet. It seems that task retry works appropriately. Thus let me confirm, is this issue happening for another executor type such as Simple/DockerCommandExecutor as well?
I understand this approach is quite simple, but I'm worried that this change for BaseOperator affects all command executors. If this problem is happening only with EcsCommandExecutor, we would better take a different approach to fix it.

@builtinnya
Copy link
Contributor Author

builtinnya commented Aug 17, 2021

@szyn
This only applies for EcsCommandExecutor. Retry works correctly for other command executors because it doesn't rely on "commandStatus" and can't be affected by the previous execution anyway.

I'm worried that this change for BaseOperator affects all command executors. If this problem is happening only with EcsCommandExecutor, we would better take a different approach to fix it.

I understand your concern but we probably need to change codes other than EcsCommandExecutor and could affect all other command executors anyway because EcsCommandExecutor's polling behavior itself is achieved by coordination of BaseOperator, ShOperatorFactory, and so on.

At least the current executors can't be affected by this change as it can't rely on "commandStatus" presence.

if (!state.has("commandStatus")) {
// Run the code since command state doesn't exist
status = runCommand(params, commandContext);
}
else {
// Check the status of the running command
final ObjectNode previousStatusJson = state.get("commandStatus", ObjectNode.class);
status = exec.poll(commandContext, previousStatusJson);
}

Operators run commands only if "commandStatus" doesn't exist in state params and the else branch is never executed on other executors (poll() is currently implemented only on EcsCommandExecutor.)

To summarize my points:

  • The current EcsCommandExecutor's polling behavior is already relying on the implementation of BaseOperator, ShOperatorFactory, PyOperatorFactory, and so on. So changing some of these codes would be inevitable.
  • Command executors other than EcsCommandExecutor aren't affected by this PR's changes.

Please let me know if some of my changes aren't clear to you and any suggestion for better changes.

@myui myui requested review from myui and removed request for myui September 2, 2021 02:39
Copy link
Member

@myui myui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could reproduce it with ECS command executor.
Since commandStatus is not used by other command executors, no drawbacks in this PR. LGTM.

While sh and py operator tried to remove commandStatus for non-zero exits, I guess it's different objects and no effect for retrying.
https://github.com/treasure-data/digdag/blob/master/digdag-standards/src/main/java/io/digdag/standards/operator/ShOperatorFactory.java#L120

@szyn szyn added the bug label Sep 2, 2021
@szyn szyn added this to the v0.10.3 milestone Sep 2, 2021
@myui myui merged commit 34f1f56 into treasure-data:master Sep 2, 2021
@myui
Copy link
Member

myui commented Sep 2, 2021

Sho agreed to merge this PR. Thank you for contributing!

@szyn
Copy link
Member

szyn commented Sep 2, 2021

Yes, finally, I could reproduce this issue... and this approach looks good to me 👍 Thank you for contributing!

myui pushed a commit that referenced this pull request Sep 2, 2021
Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com>
myui added a commit that referenced this pull request Sep 3, 2021
#1632)

Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com>

Co-authored-by: Naoto Yokoyama <builtinnya@gmail.com>
szyn pushed a commit that referenced this pull request Nov 25, 2021
Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants