Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parallel orchestration needs improvement #59959

Open
gabrielgt opened this issue Apr 2, 2021 · 5 comments
Open

[BUG] Parallel orchestration needs improvement #59959

gabrielgt opened this issue Apr 2, 2021 · 5 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Milestone

Comments

@gabrielgt
Copy link

Description
Execution of orchestration steps with parallel: True and dependencies among them using require does not work as expected.

Setup
I am trying to execute states B1 and B2 in one minion and C in another one using parallel: true, that is:

A __ B1 __ B2 __ D
  \_ C _________/

Dependencies among steps are indicated by require. The dependencies among steps could be more complicated and the results are still more unpredictable. But this example is simple enough to observe the problem.

Salt files can be generated with the following script: https://gist.github.com/gabrielgt/e6c5337e2bd69dfefc61cee2c06cc12c

Steps to Reproduce the behavior

  • Download script https://gist.github.com/gabrielgt/e6c5337e2bd69dfefc61cee2c06cc12c
  • Modify the script to use three different minion nodes: 'mn001', 'mn002' and 'mn003'
  • Run the script ./make_parallel_test_structure_2.sh
  • Run the orchestration: salt-run state.orch delme.service_orch -l debug 2>&1 | egrep 'Running state|Completed state'

Expected behavior
The expected behavior is B1 and C starting approximately at the same time. But what happens is that C starts about the same time as B2.

The duration of A, B1, B2 and D is 30 seconds, and the duration of C is 60 seconds, so it should last 120 seconds in total, but instead it lasts 150 seconds.

30 __ 30 __ 30 __________30
         \_ 60 _________/

Result:

$time sudo salt-run state.orch delme.service_orch -l debug 2>&1 | egrep 'Running state|Completed state'
[INFO    ] Running state [state.orchestrate] at time 18:32:14.612501
[INFO    ] Completed state [state.orchestrate] at time 18:32:14.620257 (duration_in_ms=7.748)
[INFO    ] Running state [orch.serviceA] at time 18:32:17.827567
[INFO    ] Completed state [orch.serviceA] at time 18:32:49.561936 (duration_in_ms=31734.369)
[INFO    ] Running state [state.orchestrate] at time 18:32:49.584701
[INFO    ] Completed state [state.orchestrate] at time 18:32:49.600972 (duration_in_ms=16.271)
[INFO    ] Running state [orch.serviceB1] at time 18:32:53.885257
[INFO    ] Completed state [orch.serviceB1] at time 18:33:25.768963 (duration_in_ms=31883.705)
[INFO    ] Running state [state.orchestrate] at time 18:33:25.789756
[INFO    ] Completed state [state.orchestrate] at time 18:33:25.804843 (duration_in_ms=15.088)
[INFO    ] Running state [state.orchestrate] at time 18:33:25.807449
[INFO    ] Completed state [state.orchestrate] at time 18:33:25.831102 (duration_in_ms=23.653)
[INFO    ] Running state [orch.serviceC] at time 18:33:30.419105
[INFO    ] Running state [orch.serviceB2] at time 18:33:31.251604
[INFO    ] Completed state [orch.serviceB2] at time 18:34:03.667342 (duration_in_ms=32415.737)
[INFO    ] Completed state [orch.serviceC] at time 18:34:32.920972 (duration_in_ms=62501.866)
[INFO    ] Running state [state.orchestrate] at time 18:34:32.943566
[INFO    ] Completed state [state.orchestrate] at time 18:34:32.956901 (duration_in_ms=13.336)
[INFO    ] Running state [orch.serviceD] at time 18:34:36.844001
[INFO    ] Completed state [orch.serviceD] at time 18:35:08.826227 (duration_in_ms=31982.226)

real    2m59.788s

Versions Report

Salt Version:
           Salt: 3003rc1

Additional context
This issue is a continuation of issue #55121, partially solved by issue #58976.

@gabrielgt gabrielgt added Bug broken, incorrect, or confusing behavior needs-triage labels Apr 2, 2021
@welcome
Copy link

welcome bot commented Apr 2, 2021

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at core@saltstack.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@garethgreenaway
Copy link
Contributor

@gabrielgt Thanks for the report. Still digging into this one but looks like (at the moment) the order of the orchestration states matter. Running the script provided and having them in the order you had I am able to reproduce this issue. Changing the order to looks like this (excluding serviceD) it appears to work as expected:

orch.runner.serviceA:
  salt.runner:
      - name: state.orchestrate
      - parallel: True
      - kwarg:
          mods: {{ sls_path }}/serviceA
orch.runner.serviceB1:
  salt.runner:
      - name: state.orchestrate
      - parallel: True
      - kwarg:
          mods: {{ sls_path }}/serviceB1
      - require:
          - salt: orch.runner.serviceA
orch.runner.serviceC:
  salt.runner:
      - name: state.orchestrate
      - parallel: True
      - kwarg:
          mods: {{ sls_path }}/serviceC
      - require:
          - salt: orch.runner.serviceA
orch.runner.serviceB2:
  salt.runner:
      - name: state.orchestrate
      - parallel: True
      - kwarg:
          mods: {{ sls_path }}/serviceB2
      - require:
          - salt: orch.runner.serviceB1

I suspect what is happens is that serviceA is running, then serviceB1 (which requires serviceA), then the state system is trying to run serviceB2 (which requires serviceB1 that has already run) and runs serviceC at the same time which only requires serviceA, which has already run. There is likely a bug somewhere in the logic that is ordering when the states should run.

@sagetherage sagetherage added severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around and removed needs-triage labels May 27, 2021
@sagetherage sagetherage added this to the Approved milestone May 27, 2021
@Rudd-O
Copy link

Rudd-O commented Dec 16, 2021

I have the same issue, parallel execution of orch SLSes with salt-run does not parallelize. One thing worth noting in my case is that all my states run with ssh=True (as well as with parallel=True) and yet, all the states marked parallel execute serially one by one.

Small sample of tasks that should execute in parallel, but don't:

    Configure test on financial.laptop:
        ----------
        salt:
            |_
              ----------
              parallel:
                  True
            |_
              ----------
              pillar:
                  ----------
                  skip_package_installs:
            |_
              ----------
              sls:
                  test
            |_
              ----------
              ssh:
                  True
            |_
              ----------
              tgt:
                  financial.laptop
            |_
              ----------
              tgt_type:
                  list
            - state
            |_
              ----------
              order:
                  10005
        __sls__:
            orch.test
        __env__:
            base
    Configure test on social.laptop:
        ----------
        salt:
            |_
              ----------
              parallel:
                  True
            |_
              ----------
              pillar:
                  ----------
                  skip_package_installs:
            |_
              ----------
              sls:
                  test
            |_
              ----------
              ssh:
                  True
            |_
              ----------
              tgt:
                  social.laptop
            |_
              ----------
              tgt_type:
                  list
            - state
            |_
              ----------
              order:
                  10003
        __sls__:
            orch.test
        __env__:
            base

@garethgreenaway garethgreenaway self-assigned this Mar 1, 2022
@Oloremo
Copy link
Contributor

Oloremo commented Jun 27, 2022

@garethgreenaway Any updates on this?

@uspen
Copy link

uspen commented Jun 28, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

No branches or pull requests

6 participants