Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parallel salt.state with ssh fails if same minion is targeted #62612

Open
808brinks opened this issue Sep 2, 2022 · 3 comments
Open

[BUG] Parallel salt.state with ssh fails if same minion is targeted #62612

808brinks opened this issue Sep 2, 2022 · 3 comments
Labels
Bug broken, incorrect, or confusing behavior needs-triage Salt-SSH

Comments

@808brinks
Copy link

Description
Parallel salt.state with ssh fails if same minion is targeted. The first minion is successful the other minion fail without a clear error.

Setup
Version 3005

parallel-same-minion.sls:

sleep-one:
  salt.state:
    - parallel: True
    - tgt: 'app1'
    - tgt_type: pcre
    - ssh: True
    - sls:
        - sleep

sleep-two:
  salt.state:
    - parallel: True
    - tgt: 'app1'
    - tgt_type: pcre
    - ssh: True
    - sls:
        - sleep

sleep.sls:

sleep:
  cmd.run:
    - name: sleep 2

Steps to Reproduce the behavior
Added the 2 sls files above and run: sudo salt-run state.orchestrate parallel-same-minion.

See the following output:

----------
          ID: sleep-one
    Function: salt.state
      Result: False
     Comment: Run failed on minions: app1
     Started: 16:01:55.617138
    Duration: 3500.82 ms
     Changes:   
              app1:
              
              Summary for app1
              -----------
              Succeeded: 0
              Failed:   0
              -----------
              Total states run:    0
              Total run time:  0.000 ms
----------
          ID: sleep-twp
    Function: salt.state
      Result: True
     Comment: States ran successfully. Updating app1.
     Started: 16:01:55.621592
    Duration: 5599.432 ms
     Changes:   
              app1:
              ----------
                        ID: sleep
                  Function: cmd.run
                      Name: sleep 2
                    Result: True
                   Comment: Command "sleep 2" run
                   Started: 16:01:58.902050
                  Duration: 2009.44 ms
                   Changes:   
                            ----------
                            pid:
                                1089286
                            retcode:
                                0
                            stderr:
                            stdout:
              
              Summary for app1
              ------------
              Succeeded: 1 (changed=1)
              Failed:    0
              ------------
              Total states run:     1
              Total run time:   2.009 s

Expected behavior
Either both results successful or some sort of error on one of the results.
Now it just shows Total states run: 0 for one of the results

Screenshots
If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3005
 
Dependency Versions:
          cffi: Not Installed
      cherrypy: Not Installed
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: 4.0.5
     gitpython: 3.1.14
        Jinja2: 2.11.3
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.0
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: Not Installed
  pycryptodome: 3.9.7
        pygit2: Not Installed
        Python: 3.9.2 (default, Feb 28 2021, 17:03:44)
  python-gnupg: Not Installed
        PyYAML: 5.3.1
         PyZMQ: 20.0.0
         smmap: 4.0.0
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: debian 11 bullseye
        locale: utf-8
       machine: x86_64
       release: 5.10.0-16-amd64
        system: Linux
       version: Debian GNU/Linux 11 bullseye

Additional context
Add any other context about the problem here.

@808brinks 808brinks added Bug broken, incorrect, or confusing behavior needs-triage labels Sep 2, 2022
@welcome
Copy link

welcome bot commented Sep 2, 2022

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@Rudd-O
Copy link

Rudd-O commented Sep 23, 2022

I can confirm the same issue happens to me. It's super easy to replicate, just run an orchestration SLS with parallel=True in three or more states targeting the same machine. Boom, most of them don't complete.

There seems to be a race condition when deploying the salt thin directory and tarball. There are no logs in the target machine indicating problems, nor does the Salt salt-run command offers logs either.

(To start with, the thin directory shouldn't need to be regenerated or redeployed on every execution. This indicates to me that there is a race condition in the way that Salt generates the tarball. Alternatively or complementarily, there must be a file being overwritten in the target machine during execution which is a shared resource and therefore one of the parallel state application processes "wins" and the others die.)

@Rudd-O
Copy link

Rudd-O commented Sep 23, 2022

Good stuff to report. With no thin dir options in my Saltfile, an orchestration run that targets the same machine across different states, doesn't work (only one of the runs "wins" and actually executes something). Watching the temporary directory very clearly indicates that Salt has attempted to deploy the thin dir multiple times.

With a fixed thin dir, same result.

Even with rand_thin_dir in the Saltfile, same result. At no point does Salt attempt to select a different thin dir based on different parallel salt-ssh runs (which I can see on my process list!) started by salt-run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage Salt-SSH
Projects
None yet
Development

No branches or pull requests

3 participants