Fix polling (ECS) command executors fail to run tasks on retry #1620

builtinnya · 2021-08-01T06:30:17Z

This PR fixes a bug that ECS command executor fails to submit ECS tasks on retry.

Reproducible steps

digdag version: v0.10.1 (and some earlier versions probably)

Assuming that ECS command executor is set up correctly:

Add and run the following workflow

+ecs-retry-test:
  sh>: |
    echo "Task has been executed!"
    exit 1
  _retry: 1

See task logs to confirm that the command executor actually prints Task has been executed! once.

Expected behavior

In the above steps, the command executor should print Task has been executed! twice.

Cause

Each operator runs a command (= submits an ECS task) only when "commandStatus" doesn't exist on state params.
However, ECS command executor's polling mechanism persists "commandStatus" even on retry, which tries to poll an ECS task that has already exited.

Approach in this PR

This PR takes minimal change approach and simply removes "commandStatus" from state params on retry.

Considerations

Maybe we should also use TaskExecutionException for commands' failure to propagate state params.
Currently, each operator tries to remove "commandStatus" from state params on failure (e.g. sh> op) but it has no effect on the state params on retry because it just throws RuntimeException right after that. Should we remove these confusing lines?

Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com>

szyn · 2021-08-17T09:58:48Z

Thank you for creating this PR!
First of all, we haven't been able to reproduce this issue in our environment yet. It seems that task retry works appropriately. Thus let me confirm, is this issue happening for another executor type such as Simple/DockerCommandExecutor as well?
I understand this approach is quite simple, but I'm worried that this change for BaseOperator affects all command executors. If this problem is happening only with EcsCommandExecutor, we would better take a different approach to fix it.

builtinnya · 2021-08-17T10:41:58Z

@szyn
This only applies for EcsCommandExecutor. Retry works correctly for other command executors because it doesn't rely on "commandStatus" and can't be affected by the previous execution anyway.

I'm worried that this change for BaseOperator affects all command executors. If this problem is happening only with EcsCommandExecutor, we would better take a different approach to fix it.

I understand your concern but we probably need to change codes other than EcsCommandExecutor and could affect all other command executors anyway because EcsCommandExecutor's polling behavior itself is achieved by coordination of BaseOperator, ShOperatorFactory, and so on.

At least the current executors can't be affected by this change as it can't rely on "commandStatus" presence.

digdag/digdag-standards/src/main/java/io/digdag/standards/operator/ShOperatorFactory.java

Lines 105 to 113 in 0c6e58a

    
           if (!state.has("commandStatus")) { 
        
               // Run the code since command state doesn't exist 
        
               status = runCommand(params, commandContext); 
        
           } 
        
           else { 
        
               // Check the status of the running command 
        
               final ObjectNode previousStatusJson = state.get("commandStatus", ObjectNode.class); 
        
               status = exec.poll(commandContext, previousStatusJson); 
        
           }

Operators run commands only if "commandStatus" doesn't exist in state params and the else branch is never executed on other executors (poll() is currently implemented only on EcsCommandExecutor.)

To summarize my points:

The current EcsCommandExecutor's polling behavior is already relying on the implementation of BaseOperator, ShOperatorFactory, PyOperatorFactory, and so on. So changing some of these codes would be inevitable.
Command executors other than EcsCommandExecutor aren't affected by this PR's changes.

Please let me know if some of my changes aren't clear to you and any suggestion for better changes.

myui

I could reproduce it with ECS command executor.
Since commandStatus is not used by other command executors, no drawbacks in this PR. LGTM.

While sh and py operator tried to remove commandStatus for non-zero exits, I guess it's different objects and no effect for retrying.
https://github.com/treasure-data/digdag/blob/master/digdag-standards/src/main/java/io/digdag/standards/operator/ShOperatorFactory.java#L120

myui · 2021-09-02T08:27:48Z

Sho agreed to merge this PR. Thank you for contributing!

szyn · 2021-09-02T08:30:04Z

Yes, finally, I could reproduce this issue... and this approach looks good to me 👍 Thank you for contributing!

Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com>

#1632) Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com> Co-authored-by: Naoto Yokoyama <builtinnya@gmail.com>

Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com>

Fix polling (ECS) command executors fail to run tasks on retry

9715520

Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com>

myui requested review from myui and removed request for myui September 2, 2021 02:39

myui approved these changes Sep 2, 2021

View reviewed changes

szyn added the bug label Sep 2, 2021

szyn added this to the v0.10.3 milestone Sep 2, 2021

myui merged commit 34f1f56 into treasure-data:master Sep 2, 2021

myui pushed a commit that referenced this pull request Sep 2, 2021

Fix polling (ECS) command executors fail to run tasks on retry (#1620)

021c77e

Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com>

myui mentioned this pull request Sep 2, 2021

Fix polling (ECS) command executors fail to run tasks on retry (#1620) #1632

Merged

myui added a commit that referenced this pull request Sep 3, 2021

Fix polling (ECS) command executors fail to run tasks on retry (#1620) (

93da2fc

#1632) Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com> Co-authored-by: Naoto Yokoyama <builtinnya@gmail.com>

szyn pushed a commit that referenced this pull request Nov 25, 2021

Fix polling (ECS) command executors fail to run tasks on retry (#1620)

23569c3

Signed-off-by: Naoto Yokoyama <builtinnya@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix polling (ECS) command executors fail to run tasks on retry #1620

Fix polling (ECS) command executors fail to run tasks on retry #1620

builtinnya commented Aug 1, 2021

szyn commented Aug 17, 2021

builtinnya commented Aug 17, 2021 •

edited

myui left a comment

myui commented Sep 2, 2021

szyn commented Sep 2, 2021

Fix polling (ECS) command executors fail to run tasks on retry #1620

Fix polling (ECS) command executors fail to run tasks on retry #1620

Conversation

builtinnya commented Aug 1, 2021

Reproducible steps

Expected behavior

Cause

Approach in this PR

Considerations

szyn commented Aug 17, 2021

builtinnya commented Aug 17, 2021 • edited

myui left a comment

Choose a reason for hiding this comment

myui commented Sep 2, 2021

szyn commented Sep 2, 2021

builtinnya commented Aug 17, 2021 •

edited