Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] /usr/bin/salt-call state.apply hangs at end on Xubuntu 20.04 salt 3001 #57856

Closed
mgdotson opened this issue Jul 2, 2020 · 37 comments · Fixed by #58364
Closed

[BUG] /usr/bin/salt-call state.apply hangs at end on Xubuntu 20.04 salt 3001 #57856

mgdotson opened this issue Jul 2, 2020 · 37 comments · Fixed by #58364
Assignees
Labels
Bug broken, incorrect, or confusing behavior Confirmed Salt engineer has confirmed bug/feature - often including a MCVE Duplicate Duplicate of another issue or PR - will be closed fixed-pls-verify fix is linked, bug author to confirm fix Magnesium Mg release after Na prior to Al salt-call severity-high 2nd top severity, seen by most users, causes major problems ZMQ
Projects
Milestone

Comments

@mgdotson
Copy link

mgdotson commented Jul 2, 2020

Description
Running /usr/bin/salt-call state.apply on Xubuntu 20.04 minion finishes but never completes.

Summary for local
--------------
Succeeded: 153
Failed:      0
--------------
Total states run:     153
Total run time:    16.030 s

But command never returns back to terminal.

ps -ef |grep salt-call
root     2928868 2914767  0 08:12 pts/28   00:00:00 sudo /usr/bin/salt-call state.apply
root     2928869 2928868  0 08:12 pts/28   00:00:05 /usr/bin/python3 /usr/bin/salt-call state.apply
root     3026694 3026554  0 09:37 pts/27   00:00:00 grep --color=auto salt-call

Notice the command has been running over an hour but only took 16 seconds for the states to be applied.

Setup

Steps to Reproduce the behavior
Xubuntu 20.04 install with ZFS root

more /etc/apt/sources.list.d/saltstack.list
deb [arch=amd64] http://repo.saltstack.com/py3/ubuntu/20.04/amd64/latest focal main

Salt master is Ubuntu 18.04 version 3001 recently upgraded from nigrogen.
All other minions on network are 18.04 and working as expected.

Expected behavior
salt-call state.apply to finish and return to terminal.

Screenshots

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Master:
salt --versions-report
Salt Version:
           Salt: 3001
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: unknown
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: 2.0.3
      gitpython: 2.1.8
         Jinja2: 2.10
        libgit2: 0.26.0
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: 3.4.7
         pygit2: 0.26.2
         Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
   python-gnupg: 0.4.1
         PyYAML: 3.12
          PyZMQ: 17.1.2
          smmap: 2.0.3
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.2.5
 
System Versions:
           dist: ubuntu 18.04 Bionic Beaver
         locale: UTF-8
        machine: x86_64
        release: 4.15.0-101-generic
         system: Linux
        version: Ubuntu 18.04 Bionic Beaver

Minion With issue:
Salt Version:
           Salt: 3001
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.7.3
      docker-py: 4.1.0
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.10.1
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: 1.1.0
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: 3.6.1
         pygit2: Not Installed
         Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
   python-gnupg: 0.4.5
         PyYAML: 5.3.1
          PyZMQ: 18.1.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2
 
System Versions:
           dist: ubuntu 20.04 focal
         locale: utf-8
        machine: x86_64
        release: 5.4.0-39-generic
         system: Linux
        version: Ubuntu 20.04 focal


Working Minon:

salt-call --versions-report
Salt Version:
           Salt: 2017.7.4
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.10
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 1.0.7
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.2.5
 
System Versions:
           dist: Ubuntu 18.04 bionic
         locale: UTF-8
        machine: x86_64
        release: 4.15.0-101-generic
         system: Linux
        version: Ubuntu 18.04 bionic

Additional context
Add any other context about the problem here.

@mgdotson mgdotson added the Bug broken, incorrect, or confusing behavior label Jul 2, 2020
@max-arnold
Copy link
Contributor

Could be duplicate of #57456

@couilllard45682
Copy link

couilllard45682 commented Jul 3, 2020

I've got the same issue under Ubuntu 20.04 as well. If run from the master the state succeed and exit right after like normal. When executed directly on minion it hangs at the end until CTRL + C is used.

Salt Version:
           Salt: 3001
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.7.3
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.10.1
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: 3.6.1
         pygit2: Not Installed
         Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
   python-gnupg: 0.4.5
         PyYAML: 5.3.1
          PyZMQ: 18.1.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2
 
System Versions:
           dist: ubuntu 20.04 focal
         locale: utf-8
        machine: x86_64
        release: 5.4.0-1018-aws
         system: Linux
        version: Ubuntu 20.04 focal

I don't think its a duplicate of #57456 because this issue is when an SLS render to nothing and the executed state from the master will exit but with and error status code and its not that case here.

@mgdotson
Copy link
Author

mgdotson commented Jul 3, 2020

I'm hesitant on #57356 as well as I keep most of my states without jinja references and filter via grains in the top.sls file.

It also fails if I run just a single state file. For example, my vscode state:

salt-call state.apply vscode

cat vscode/init.sls 
vscode_repo:
  pkgrepo.managed:
    - name: deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main
    - file: /etc/apt/sources.list.d/vscode.list
    - key_url: https://packages.microsoft.com/keys/microsoft.asc
    - refresh_db: true

vscode:
  pkg.installed:
    - name: code
    - require:
        - pkgrepo: vscode_repo

Output is similar to other fails:

local:
----------
          ID: vscode_repo
    Function: pkgrepo.managed
        Name: deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main
      Result: True
     Comment: Package repo 'deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main' already configured
     Started: 13:02:08.425043
    Duration: 107.504 ms
     Changes:   
----------
          ID: vscode
    Function: pkg.installed
        Name: code
      Result: True
     Comment: All specified packages are already installed
     Started: 13:02:09.400992
    Duration: 73.534 ms
     Changes:   

Summary for local
------------
Succeeded: 2
Failed:    0
------------
Total states run:     2
Total run time: 181.038 ms
^C

CTRL-C after waiting a couple of minutes

@gvengel
Copy link
Contributor

gvengel commented Jul 3, 2020

This is either not a duplicate of #57456, or that issue's title is misleading. As far as I can tell, applying a highstate via salt-call which contains pkg.installed will prevent salt-call from shutting down cleanly. Here is a vagrant file with the minimum steps required to reproduce the bug:

Vagrantfile

Vagrant.configure("2") do |config|
  config.vm.box = "bento/ubuntu-20.04"
  config.vm.provision "shell", inline: <<-SHELL

wget -qO - https://repo.saltstack.com/py3/ubuntu/20.04/amd64/latest/SALTSTACK-GPG-KEY.pub | apt-key add -
echo 'deb http://repo.saltstack.com/py3/ubuntu/20.04/amd64/latest focal main' > /etc/apt/sources.list.d/saltstack.list
echo '127.0.1.1 salt' >> /etc/hosts
apt-get update
apt-get install -y salt-master
echo 'autosign_file: /etc/salt/autosign.conf' > /etc/salt/master
echo '*' > /etc/salt/autosign.conf
chmod 600 /etc/salt/autosign.conf
systemctl restart salt-master
apt-get install -y salt-minion
mkdir /srv/salt
cat << EOT > /srv/salt/top.sls
base:
  "*":
    - bug
EOT
     cat << EOT > /srv/salt/bug.sls
this will block salt-call from exiting:
  pkg.installed:
    - name: cowsay
EOT
salt-call state.apply -l debug

  SHELL
end

Expected behavior:

Running vagrant up should install salt, highstate, install the cowsay package, then exit.

Actual behavior:

Package is installed as expected, but salt-call never exits, causing vagrant up to also hang.

@OrangeDog
Copy link
Collaborator

OrangeDog commented Jul 3, 2020

that issue's title is misleading

The title describes the issue in 3000.1 (empty sls / fileserver? hang) that shifted into this one (pkg state? hang) in 3001.

@gvengel
Copy link
Contributor

gvengel commented Jul 3, 2020

Sorry, I just meant that it doesn't appear to be related to an empty formula. Applying a highstate with an empty formula works just fine in my tests.

@cmcmarrow cmcmarrow added this to the Approved milestone Jul 4, 2020
@cmcmarrow cmcmarrow added the Confirmed Salt engineer has confirmed bug/feature - often including a MCVE label Jul 4, 2020
@cmcmarrow
Copy link
Contributor

@mgdotson thanks for the bug report. I also got this on a 20.04 box as well.

@sagetherage sagetherage added severity-critical top severity, seen by most users, serious issues v3001.1 vulnerable version severity-high 2nd top severity, seen by most users, causes major problems salt-call and removed severity-critical top severity, seen by most users, serious issues labels Jul 6, 2020
@cmcmarrow
Copy link
Contributor

cmcmarrow commented Jul 6, 2020

salt-call --local state.apply does not hang!
salt-call state.apply hangs!

@VietTPham
Copy link

I am seeing this on fedora 32 as well.

@DmitryKuzmenko
Copy link
Contributor

Reproduced in 20.04 Docker container. Working on that.

@jpmenil
Copy link

jpmenil commented Jul 10, 2020

Got a lot of "Message timed out" on 3000.3 & 3001.
using --async help.

@DmitryKuzmenko
Copy link
Contributor

@jpmenil I don't think it could be related to this one. Please try to find a similar issue or create a new one with detailed description.

@sagetherage sagetherage moved this from To do to In progress in 3001.1 Bugfix release Jul 16, 2020
@jwolfe-ns
Copy link

jwolfe-ns commented Jul 20, 2020

Confirmed this is related to pkg.installed, and also confirmed if you use --local against the same set of states it doesn't hang. Running 20.04 with current pkgs as of today.

Anyone running anything other than python3-apt 2.0.0?

Edit: Note that highstate/states executed via the salt-minion schedule do not appear affected, this is only salt-call when --local is not used.

@JaseFace
Copy link
Contributor

JaseFace commented Jul 20, 2020

All the procs left are ZMQ showing an epoll_wait on an fd. So likely related to a pipe or local socket used by ZMQ.

---/usr/bin/python(146114)-+-{ZMQbg/IO/0}(146194)
                           |-{ZMQbg/IO/0}(146425)
                           |-{ZMQbg/Reaper}(146193)
                           `-{ZMQbg/Reaper}(146424)

epoll_wait(14,
epoll_wait(35,
epoll_wait(12,
epoll_wait(33,

# ls -l /proc/147046/fd/14 /proc/147279/fd/35 /proc/147045/fd/12 /proc/147278/fd/33
lrwx------ 1 root root 64 Jul 20 22:34 /proc/147046/fd/14 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jul 20 22:34 /proc/147279/fd/35 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jul 20 22:35 /proc/147045/fd/12 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jul 20 22:35 /proc/147278/fd/33 -> 'anon_inode:[eventpoll]'
ii  libzmq5:amd64           4.3.2-2ubuntu1
ii  python3-zmq             18.1.1-3

@DmitryKuzmenko
Copy link
Contributor

@JaseFace yep and I'm working in this direction. Thank you for analyze.

@max-arnold
Copy link
Contributor

Does transport: tcp help? The issue seems to be related to ZMQ: #57456 (comment)

@mgdotson
Copy link
Author

yes, returns instantly with transport set to tcp.

@sagetherage sagetherage removed this from In progress in 3001.1 Bugfix release Jul 27, 2020
@sagetherage sagetherage removed the v3001.1 vulnerable version label Jul 27, 2020
@disaster123
Copy link

this means salt isn't usable for ubuntu 20.04? As you cannot run mixed clients tcp and not tcp you cannot even upgrade your infra

@gvengel
Copy link
Contributor

gvengel commented Aug 11, 2020

@disaster123 salt is still mostly usable and I've personally started upgrading infrastructure. It appears the bug only affects CLI use of salt-call. Both salt on the master and scheduled states on the minion still work, so when salt does hang, you can just Ctrl-C once it's finished to exit.

That being said, the bug is quite annoying, and there are some legitimately broken workflows, such as scripted use of salt-call. It's disappointing that we didn't get a fix for the 3001 version release.

@banholzer
Copy link

Very disappointing...!

@cmcmarrow cmcmarrow added the Duplicate Duplicate of another issue or PR - will be closed label Aug 11, 2020
@cmcmarrow
Copy link
Contributor

#57456 (comment)

@disaster123
Copy link

@sagetherage how to solve this?

With transport: tcp salt-call itself is working BUT now i'm running into the next issue

    [ERROR   ] Exception parsing response
    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/salt/transport/tcp.py", line 1214, in _stream_return
        for framed_msg in unpacker:
      File "msgpack/_unpacker.pyx", line 562, in msgpack._cmsgpack.Unpacker.__next__
      File "msgpack/_unpacker.pyx", line 493, in msgpack._cmsgpack.Unpacker._unpack
    ValueError: 1048720 exceeds max_str_len(1048576)

it seems unpossible to run salt on ubuntu currently. tcp isn't working either.

@disaster123
Copy link

I tried to work around this by downgrading zmq + pyzmq but it still does not work see: #57456 (comment)

@disaster123
Copy link

Was anybody able to fix this? Even downgrading zmq did not work for me.

@sagetherage
Copy link
Contributor

@disaster123 both @cmcmarrow and @krionbsd are working on these issues that are not exactly the same but seem related or similar. Not sure they have a fix for all, but I will let them answer on specifics.

@ocfmatt
Copy link

ocfmatt commented Aug 13, 2020

Hi, I am having the same behaviour on CentOS 8.2 where every state hangs at closing MQ. States run perfectly fine in a CentOS 7 based environment with an older version of PyZMQ. I am on Salt 2019.2.5 and gather between this bug report and #57456 the fixes are reduce security or downgrade PyZMQ to 18.0.0 - are these the only options until an official fix is released?

[root@controller ~]# salt-call --versions-report
Salt Version:
           Salt: 2019.2.5

Dependency Versions:
           cffi: 1.11.5
       cherrypy: Not Installed
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.10.1
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: Not Installed
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.8 (default, Apr 16 2020, 01:36:27)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 19.0.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.2
            ZMQ: 4.3.2

System Versions:
           dist: centos 8.2.2004 Core
         locale: UTF-8
        machine: x86_64
        release: 4.18.0-193.14.2.el8_2.x86_64
         system: Linux
        version: CentOS Linux 8.2.2004 Core
local:
  Name: vm-pkgs - Function: pkg.latest - Result: Clean Started: - 13:08:17.543186 Duration: 1730.06 ms
  Name: mgmt-pkgs - Function: pkg.latest - Result: Clean Started: - 13:08:19.273469 Duration: 392.715 ms

Summary for local
------------
Succeeded: 2
Failed:    0
------------
Total states run:     2
Total run time:   2.123 s
[DEBUG   ] Closing AsyncZeroMQReqChannel instance

Regards,
Matt.

@sagetherage
Copy link
Contributor

@ocfmatt we are working on a long term solution, but since we do not maintain this library it is taking some time. It was one suggestion to try to downgrade, but we realize that isn't really a good idea, either, but looking for a real fix to this, still.

@cmcmarrow cmcmarrow mentioned this issue Sep 2, 2020
3 tasks
@cmcmarrow
Copy link
Contributor

Sorry for the lack of communication. The zmq hang should be merged soon. The reason why salt-call hanged was because we left the clean up to garbage collector on exit. The zmq library sometimes closes its objects out of order when left to its own. But if closed by salt this can be worked around. I tried to patch this bug for zmq but I could not figure out a good way to do it. So we just put in a patch for salt.

I would be grateful if you guys check if my fix zmq hang branch stops your environment from hanging. I would hate to see you guys wait for another release because I missed something.

@sagetherage sagetherage linked a pull request Sep 25, 2020 that will close this issue
3 tasks
@jwolfe-ns
Copy link

Hey @cmcmarrow, thanks for this patch. I've tested it on ~70 machines that exhibited the ZMQ hang, and it has resolved the problem. I'll be leaving them with the patch to burn in further, but it's pretty straightforward, I doubt anything will come up.

@ocfmatt
Copy link

ocfmatt commented Sep 28, 2020

Hi @cmcmarrow, will these commits work in 2019 varients or are they just for 300X versions?

@cmcmarrow cmcmarrow added the fixed-pls-verify fix is linked, bug author to confirm fix label Sep 28, 2020
Magnesium automation moved this from In progress to Done Sep 30, 2020
@DaveQB
Copy link
Contributor

DaveQB commented May 12, 2021

#57856 (comment)

I disagree. My start up scripts for new servers hang as a result. This is causing production issues at my company now. I am even finding on the salt master, when calling a highstate on minons, is hanging too.

So this is fixed in 3002?

Thank you.

@jwolfe-ns
Copy link

#57856 (comment)
This is causing production issues at my company now. I am even finding on the salt master, when calling a highstate on minons, is hanging too.

So this is fixed in 3002?

Assuming you're using state.apply from the Salt master, this bug would be unrelated, you have something else going on. This only affects direct salt-call runs.

Yeah 3002 is gtg.

@adambarnett52
Copy link
Contributor

Any update on this issue. I am seeing this issues still in 3001 and 3002

@DaveQB
Copy link
Contributor

DaveQB commented Jun 18, 2021 via email

@adambarnett52
Copy link
Contributor

Thanks, let me try 3003

@morph027
Copy link

3002 works for me too, 3001 does not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Confirmed Salt engineer has confirmed bug/feature - often including a MCVE Duplicate Duplicate of another issue or PR - will be closed fixed-pls-verify fix is linked, bug author to confirm fix Magnesium Mg release after Na prior to Al salt-call severity-high 2nd top severity, seen by most users, causes major problems ZMQ
Projects
No open projects
Magnesium
  
Done
Development

Successfully merging a pull request may close this issue.