Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Empty formula makes salt-call to hang forever. #57456

Closed
b-a-t opened this issue May 25, 2020 · 48 comments
Closed

[BUG] Empty formula makes salt-call to hang forever. #57456

b-a-t opened this issue May 25, 2020 · 48 comments
Assignees
Labels
Magnesium Mg release after Na prior to Al Regression The issue is a bug that breaks functionality known to work in previous releases. salt-call severity-high 2nd top severity, seen by most users, causes major problems State-Compiler Upstream-Bug is a result of an upstream issue, not in salt ZMQ
Projects
Milestone

Comments

@b-a-t
Copy link

b-a-t commented May 25, 2020

Description

Empty formula makes salt-call to hang forever.

Setup

We have a simple formula, that allows to add some custom states during the highstate, based on the minion's FQDN:
custom_minion_state/init.sls:

{%- set include_dir = grains['fqdn'] | replace('.', '_') %}
{#- test: slspath + '/' + include_dir #}
{%- if salt['cp.cache_dir']("salt://{0}/{1}".format(slspath, include_dir)) %}
include:
  - .{{ include_dir }}
{%- endif %}

In 99% of the cases that renders to the empty SLS file, as there is no customization necessary.

Steps to Reproduce the behavior

Applying that formula to the host, which doesn't require any customization produces:

# salt-call state.apply custom_minion_state
local:

Summary for local
-----------
Succeeded: 0
Failed:   0
-----------
Total states run:    0
Total run time:  0.000 ms

For Salt 2018.3.3 salt-call immediately returns to the shell. But for Salt 3000.3 it hangs after this output forever(?). Well, long enough to lose the temper and kill it with Ctrl-C.

Expected behavior
Application of the empty state should exit immediately after the state has been evaluated and applied.

Versions Report

salt --versions-report Salt Version: Salt: 3000.3

Dependency Versions:
cffi: 1.11.5
cherrypy: Not Installed
dateutil: 2.6.1
docker-py: 4.2.0
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.10.1
libgit2: Not Installed
M2Crypto: 0.35.2
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.6.2
mysql-python: Not Installed
pycparser: 2.14
pycrypto: Not Installed
pycryptodome: 3.9.7
pygit2: Not Installed
Python: 3.6.8 (default, Nov 21 2019, 19:31:34)
python-gnupg: Not Installed
PyYAML: 3.12
PyZMQ: 19.0.0
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.2

System Versions:
dist: centos 8.1.1911 Core
locale: UTF-8
machine: x86_64
release: 4.18.0-147.8.1.el8_1.x86_64
system: Linux
version: CentOS Linux 8.1.1911 Core

Additional context
Running the same command with -l debug gives a bit more information:

...
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001::13b7]:4506
[DEBUG   ] Gathering pillar data for state run
[DEBUG   ] Finished gathering pillar data for state run
[INFO    ] Loading fresh modules for state activity
[DEBUG   ] LazyLoaded jinja.render
[DEBUG   ] LazyLoaded yaml.render
[DEBUG   ] Could not find file 'salt://custom_minion_state.sls' in saltenv 'base'
[DEBUG   ] In saltenv 'base', looking at rel_path 'custom_minion_state/init.sls' to resolve 'salt://custom_minion_state/init.sls'
[DEBUG   ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/custom_minion_state/init.sls' to resolve 'salt://custom_minion_state/init.sls'
[DEBUG   ] compile template: /var/cache/salt/minion/files/base/custom_minion_state/init.sls
[DEBUG   ] Jinja search path: ['/var/cache/salt/minion/files/base']
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001::13b7]:4506
[DEBUG   ] LazyLoaded cp.cache_dir
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001:67c:2e8:11::c10:13b7]:4506
[INFO    ] Caching directory 'custom_minion_state/prometheus/' for environment 'base'
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/custom_minion_state/init.sls' using 'jinja' renderer: 0.04874920845031738
[DEBUG   ] Rendered data from file: /var/cache/salt/minion/files/base/custom_minion_state/init.sls:
# vim: sts=2 ts=2 sw=2 et ai ft=sls
#
# This state is loaded from top.sls
#

[DEBUG   ] Results of YAML rendering:
{}
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/custom_minion_state/init.sls' using 'yaml' renderer: 0.00016641616821289062
[DEBUG   ] File /var/cache/salt/minion/accumulator/140045102329928 does not exist, no need to cleanup
[DEBUG   ] LazyLoaded state.check_result
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001::13b7]:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] LazyLoaded highstate.output
local:

Summary for local
-----------
Succeeded: 0
Failed:   0
-----------
Total states run:    0
Total run time:  0.000 ms
[DEBUG   ] Closing AsyncZeroMQReqChannel instance

Running through strace shows:

getpid()                                = 100123
getpid()                                = 100123
write(1, "\33[0;32mlocal:\33[0;0m\n\33[0;36m\nSumm"..., 201) = 201
write(1, "\n", 1)                       = 1
getpid()                                = 100123
epoll_ctl(20, EPOLL_CTL_DEL, 28, 0x7ffda432bbcc) = 0
getpid()                                = 100123
getpid()                                = 100123
write(24, "\1\0\0\0\0\0\0\0", 8)        = 8
getpid()                                = 100123
getpid()                                = 100123
getpid()                                = 100123
poll([{fd=23, events=POLLIN}], 1, -1)   = 1 ([{fd=23, revents=POLLIN}])
getpid()                                = 100123
read(23, "\1\0\0\0\0\0\0\0", 8)         = 8
getpid()                                = 100123
write(26, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f89908ed9d0, FUTEX_WAIT, 100201, NULL) = 0
close(27)                               = 0
close(26)                               = 0
close(25)                               = 0
close(24)                               = 0
close(23)                               = 0
epoll_ctl(20, EPOLL_CTL_DEL, 21, 0x7ffda432c0dc) = 0
close(21)                               = 0
close(22)                               = 0
close(20)                               = 0
close(5)                                = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f89a266fdc0}, {sa_handler=0x7f89a296d610, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f89a266fdc0}, 8) = 0
munmap(0x7f8992c15000, 262144)          = 0
munmap(0x7f8992c55000, 262144)          = 0
close(8)                                = 0
close(9)                                = 0
close(31)                               = 0
close(32)                               = 0
close(7)                                = 0
munmap(0x7f8993522000, 262144)          = 0
munmap(0x7f89934e2000, 262144)          = 0
close(30)                               = 0
getpid()                                = 100123
getpid()                                = 100123
getpid()                                = 100123
write(38, "\1\0\0\0\0\0\0\0", 8)        = 8
getpid()                                = 100123
poll([{fd=33, events=POLLIN}], 1, -1)   = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---

Applying the same state from the master completes immediately, but with the note that state completed with the ERROR.

master# salt 'prometheus*' state.apply custom_minion_state -l debug
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'master', 'tcp://[::]:4506', 'clear')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[::]:4506
[DEBUG   ] Trying to connect to: tcp://[::]:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] Reading minion list from /var/cache/salt/master/jobs/02/a26ac9e846136c8148eb8e74c4e1507421289c141e915aba2f429e146a8134/.minions.p
[DEBUG   ] get_iter_returns for jid 20200525163928599898 sent to {'prometheus'} will timeout at 18:39:33.605811
[DEBUG   ] jid 20200525163928599898 return from prometheus
[DEBUG   ] return event: {'prometheus': {'ret': {}, 'out': 'highstate', 'retcode': 2, 'jid': '20200525163928599898'}}
[DEBUG   ] LazyLoaded highstate.output
prometheus:

Summary for prometheus
-----------
Succeeded: 0
Failed:   0
-----------
Total states run:    0
Total run time:  0.000 ms
[DEBUG   ] jid 20200525163928599898 found all minions {'prometheus'}
[DEBUG   ] Closing IPCMessageSubscriber instance
ERROR: Minions returned with non-zero exit code
@b-a-t b-a-t added the Bug broken, incorrect, or confusing behavior label May 25, 2020
@ometol
Copy link

ometol commented May 26, 2020

I ran into this as well. Indeed if you add any bit of random data to the sls outside of the if statement, so it will never be empty, will make it work.

@Akm0d Akm0d added P3 Priority 3 Regression The issue is a bug that breaks functionality known to work in previous releases. State-Compiler and removed Bug broken, incorrect, or confusing behavior labels May 27, 2020
@Akm0d Akm0d added this to the Approved milestone May 27, 2020
@sagetherage sagetherage added the Magnesium Mg release after Na prior to Al label May 27, 2020
@sagetherage sagetherage added this to Planning in Magnesium May 27, 2020
@sagetherage sagetherage added severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around severity-high 2nd top severity, seen by most users, causes major problems P3 Priority 3 and removed P3 Priority 3 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around labels May 29, 2020
@sagetherage
Copy link
Contributor

apologies! I can't seem to get the labels correct, today, my mistake in removing P3, merely wanted to add the severity with the definition that includes "hanging" which ATM is Critical

@edgan
Copy link
Contributor

edgan commented Jun 2, 2020

I can seeing the same thing on Ubuntu 20.04. This also smells like #56332 .

It seems anything that touches salt:// hangs.

Hangs, where testfile is an empty file:

 test_file:
   file.managed:
     - name: /tmp/testfile
     - source: salt://test/files/testfile

Doesn't hang:

volume_dir:
  file.directory:
    - name: /opt
    - user: root
    - group: root
    - mode: 0755

Doesn't hang:

 test_file:
   file.managed:
     - name: /tmp/testfile
     - contents: ''

Doesn't hang:

{%- set include_dir = grains['fqdn'] | replace('.', '_') %}

Hangs:

{%- set include_dir = grains['fqdn'] | replace('.', '_') %}
{#- test: slspath + '/' + include_dir #}
{%- if salt['cp.cache_dir']("salt://{0}/{1}".format(slspath, include_dir)) %}
include:
  - .{{ include_dir }}
{%- endif %}

Hangs:

{%- if salt['cp.cache_dir']("salt://") %}
{%- endif %}

Doesn't hang:

{%- if '' == '' %}
{%- endif %}

Hangs and salt://testtest doesn't exist:

{%- if salt['cp.cache_dir']("salt://testtest") %}
{%- endif %}

Hangs and salt://saltstack does exist:

{%- if salt['cp.cache_dir']("salt://saltstack") %}
{%- endif %}
Salt Version:
           Salt: 3000.3
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.7.3
      docker-py: 4.1.0
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.10.1
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: 1.4.4
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: 3.6.1
         pygit2: Not Installed
         Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
   python-gnupg: 0.4.5
         PyYAML: 5.3.1
          PyZMQ: 18.1.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2
 
System Versions:
           dist: ubuntu 20.04 focal
         locale: utf-8
        machine: x86_64
        release: 5.4.0-1009-aws
         system: Linux
        version: Ubuntu 20.04 focal
 

@edgan
Copy link
Contributor

edgan commented Jun 2, 2020

I thought this might relate to gitfs, but a simple git clone with file_roots does the same thing. I also tried stripping it down. No pillars, no custom grains, a super basic top.sls, and an example state from above. Even the most basic of setups hangs with 3000.3, py3, and salt-call.

@sagetherage sagetherage removed the P3 Priority 3 label Jun 3, 2020
@OrangeDog
Copy link
Collaborator

Ubuntu 18.04, salt 3000.3 - no hang.
Ubuntu 20.04 salt 3001rc1 - hangs.

It's only a state.apply that does this, other functions I've used have been fine.

@b-a-t
Copy link
Author

b-a-t commented Jun 5, 2020

Ubuntu 18.04, salt 3000.3 - no hang.
Ubuntu 20.04 salt 3001rc1 - hangs.

Can you provide, at minimum, verbose versions output for both platforms? I feel that this is PY2 vs PY3 difference that makes this difference.

@b-a-t
Copy link
Author

b-a-t commented Jun 5, 2020

I thought this might relate to gitfs, but a simple git clone with file_roots does the same thing. I also tried stripping it down. No pillars, no custom grains, a super basic top.sls, and an example state from above. Even the most basic of setups hangs with 3000.3, py3, and salt-call.

I feel that it's somewhere on the low level, in the way how event loop is handled.

It looks like that the final poll() call doesn't have any timeout setting and waits only for any activity on the fds. As there is nothing to send back due empty state, it hangs forever.

Indeed:
int poll(struct pollfd *fds, nfds_t nfds, int timeout);

poll([{fd=33, events=POLLIN}], 1, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)

The timeout is set to -1, which is infinity, which we observe in this case.

Next question where is that poll() happens in the Python code...

@OrangeDog
Copy link
Collaborator

PY2 vs PY3 difference

Nope

Works fine
Salt Version:
           Salt: 3000.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.10
        libgit2: 0.26.0
       M2Crypto: 0.32.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: 0.26.2
         Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
   python-gnupg: 0.4.1
         PyYAML: 3.12
          PyZMQ: 16.0.2
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.2.5

System Versions:
           dist: Ubuntu 18.04 bionic
         locale: ISO-8859-1
        machine: x86_64
        release: 4.15.0-101-generic
         system: Linux
        version: Ubuntu 18.04 bionic
Hangs
Salt Version:
           Salt: 3001rc1

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.7.3
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.10.1
        libgit2: 0.28.3
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: 3.6.1
         pygit2: 1.0.3
         Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
   python-gnupg: 0.4.5
         PyYAML: 5.3.1
          PyZMQ: 18.1.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2

System Versions:
           dist: ubuntu 20.04 focal
         locale: iso8859-1
        machine: x86_64
        release: 5.4.0-33-generic
         system: Linux
        version: Ubuntu 20.04 focal

@OrangeDog
Copy link
Collaborator

In case it's relevant, with debug logging the end is

Total states run:     140
Total run time:   107.656 s
[DEBUG] Closing AsyncZeroMQReqChannel instance

@piterpunk
Copy link
Collaborator

Is this still a thing?

I tried to reproduce with code from 2020-06-09 git and looks OK:

root@marvin:/srv/salt/empty_formula# cat init.sls 
{%- set include_dir = grains['fqdn'] | replace('.', '_') %}
{#- test: slspath + '/' + include_dir #}
{%- if salt['cp.cache_dir']("salt://{0}/{1}".format(slspath, include_dir)) %}
include:
  - .{{ include_dir }}
{%- endif %}
root@marvin:/srv/salt/empty_formula# salt-call state.apply empty_formula
local:

Summary for local                                                                                                                   
-----------                                                                                                                         
Succeeded: 0
Failed:   0
-----------
Total states run:    0                                                                                                              
Total run time:  0.000 ms

The versions being used are:

salt --versions-report
Salt Version:
           Salt: 3001
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.2
        libgit2: Not Installed
       M2Crypto: 0.35.2
           Mako: 1.1.3
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: 3.9.7
         pygit2: Not Installed
         Python: 3.8.3 (default, May 15 2020, 05:51:00)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 18.1.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2
 
System Versions:
           dist: slackware 14.2 current
         locale: utf-8
        machine: i686
        release: 5.4.45
         system: Linux
        version: Slackware 14.2 current

@max-arnold
Copy link
Contributor

Did a couple of tests with different python versions (using pyenv and pip-installed Salt 3001) on Ubuntu 20.04:

  • 3.6.11 - ok
  • 3.7.8 - ok
  • 3.8.0, 3.8.2 (default on 20.04), 3.8.4 - hangs!
  • 3.9.0b4 - fails with AttributeError: _scope_id inside of ipaddress.py

When I place the ipdb breakpoint inside of salt-call:

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    import ipdb; ipdb.set_trace()
    sys.exit(
        load_entry_point('salt==3001', 'console_scripts', 'salt-call')()
    )

and step over the load_entry_point with n a couple of times, it hangs here:

/usr/lib/python3.8/threading.py:_shutdown()

    while True:
        with _shutdown_locks_lock:
            locks = list(_shutdown_locks)   # <---
            _shutdown_locks.clear()

but it is a debugger-induced problem and it hangs somewhere else without it. The interesting thing is that print(len(_shutdown_locks)) keeps growing (~1000) while program is stopped in the debugger session.

The hanging issue could be related to https://bugs.python.org/issue6721 (a pure guess).

@defanator
Copy link
Contributor

Just found this one while digging with the similar issue (Ubuntu 20.04, salt 3001). Also checked #53319 and #57954, but this one seems to be the most relevant.

My state is pretty short:

unbound package:
  pkg.latest:
    - name: unbound

unbound service:
  service.running:
    - name: unbound
    - enable: True
    - full_restart: True
    - reload_grains: True
  require:
    - pkg: unbound package

Minion output:

[INFO    ] Loading fresh modules for state activity
[INFO    ] Running state [unbound] at time 16:25:02.058393
[INFO    ] Executing state pkg.latest for [unbound]
[INFO    ] Executing command ['dpkg-query', '--showformat', '${Status} ${Package} ${Version} ${Architecture}', '-W'] in directory '/root'
[INFO    ] Executing command ['apt-get', '-q', 'update'] in directory '/root'
[INFO    ] Executing command ['apt-cache', '-q', 'policy', 'unbound'] in directory '/root'
[INFO    ] Package unbound is already up-to-date
[INFO    ] Completed state [unbound] at time 16:25:04.260411 (duration_in_ms=2202.018)
[INFO    ] Running state [unbound] at time 16:25:04.262321
[INFO    ] Executing state service.running for [unbound]
[INFO    ] Executing command ['systemctl', 'status', 'unbound.service', '-n', '0'] in directory '/root'
[INFO    ] Executing command ['systemctl', 'is-active', 'unbound.service'] in directory '/root'
[INFO    ] Executing command ['systemctl', 'is-enabled', 'unbound.service'] in directory '/root'
[INFO    ] The service unbound is already running
[INFO    ] Completed state [unbound] at time 16:25:04.510928 (duration_in_ms=248.606)
local:

Summary for local
------------
Succeeded: 2
Failed:    0
------------
Total states run:     2
Total run time:   2.451 s
^C

Last line in debug:
2020-07-22 16:25:04,526 [salt.transport.zeromq:285 ][DEBUG   ][168205] Closing AsyncZeroMQReqChannel instance

This is 100% reproducible in my current setup.

@sagetherage
Copy link
Contributor

Thank you @defanator, this is a critical issue and we attempted to fix it the 3001.1 point release, but other competing priorities and some core team members not being able to reproduce it, caused us to delay.

The issue and the fix is moved to be committed in the Magnesium release. The engineer assigned has reproduced it and will provide a fix. We tried here and didn't succeed.

@defanator
Copy link
Contributor

@sagetherage thanks!

@cmcmarrow please let me know if there's anything I could provide in order to help address this one as soon as possible. I'd also be happy to run a few tests once a fix is available. TIA!

@max-arnold
Copy link
Contributor

It hangs at the garbage collection stage:

(gdb) bt
#0  0x00007ffff7ee596f in __GI___poll (fds=0x7fffffffde60, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007ffff661756b in ?? () from /lib/x86_64-linux-gnu/libzmq.so.5
#2  0x00007ffff65f22be in ?? () from /lib/x86_64-linux-gnu/libzmq.so.5
#3  0x00007ffff65e015c in ?? () from /lib/x86_64-linux-gnu/libzmq.so.5
#4  0x00007ffff66387ce in zmq_ctx_term () from /lib/x86_64-linux-gnu/libzmq.so.5
#5  0x00007ffff600b36b in ?? () from /usr/lib/python3/dist-packages/zmq/backend/cython/context.cpython-38-x86_64-linux-gnu.so
#6  0x00000000005a59b9 in subtype_dealloc (self=<optimized out>) at ../Objects/typeobject.c:1289
#7  0x00000000005cb374 in _Py_Dealloc (op=<optimized out>) at ../Objects/object.c:2215
#8  _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at ../Include/object.h:478
#9  _Py_XDECREF (op=<optimized out>) at ../Include/object.h:541
#10 dict_dealloc (mp=0x7ffff4acf640) at ../Objects/dictobject.c:1990
#11 0x00000000005a1bae in _Py_Dealloc (op=<optimized out>) at ../Objects/object.c:2215
#12 _Py_DECREF (filename=0x87a8e0 "../Objects/typeobject.c", lineno=1140, op=<optimized out>) at ../Include/object.h:478
#13 subtype_clear (self=<AsyncReqMessageClient at remote 0x7ffff4ab6b20>) at ../Objects/typeobject.c:1140
#14 0x00000000004ec8b9 in delete_garbage (state=<optimized out>, state=<optimized out>, old=0x934840 <_PyRuntime+416>, collectable=0x7fffffffe1c0) at ../Modules/gcmodule.c:948
#15 collect.constprop.0 (generation=generation@entry=2, n_collected=n_collected@entry=0x0, n_uncollectable=n_uncollectable@entry=0x0, nofail=nofail@entry=1,
    state=0x9347f8 <_PyRuntime+344>) at ../Modules/gcmodule.c:1123
#16 0x000000000066babb in _PyGC_CollectNoFail () at ../Modules/gcmodule.c:1866
#17 0x000000000067dd60 in PyImport_Cleanup () at ../Python/import.c:598
#18 0x0000000000678450 in Py_FinalizeEx () at ../Python/pylifecycle.c:1240
#19 0x0000000000677138 in Py_Exit (sts=0) at ../Python/pylifecycle.c:2295
#20 0x000000000067716b in handle_system_exit () at ../Python/pythonrun.c:658
#21 0x0000000000677392 in _PyErr_PrintEx (set_sys_last_vars=1, tstate=0x95a270) at ../Python/pythonrun.c:763
#22 PyErr_PrintEx (set_sys_last_vars=1) at ../Python/pythonrun.c:763
#23 0x00000000006775b4 in PyErr_Print () at ../Python/pythonrun.c:769
#24 PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffffffe3e8) at ../Python/pythonrun.c:434
#25 0x00000000006ae99e in pymain_run_file (cf=0x7fffffffe3e8, config=0x959620) at ../Modules/main.c:381
#26 pymain_run_python (exitcode=0x7fffffffe3e0) at ../Modules/main.c:565
#27 Py_RunMain () at ../Modules/main.c:644
#28 0x00000000006aed29 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at ../Modules/main.c:698
#29 0x00007ffff7df70b3 in __libc_start_main (main=0x4ebd20 <main>, argc=4, argv=0x7fffffffe5c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffffffe5b8) at ../csu/libc-start.c:308
#30 0x00000000005f62ee in _start () at ../Objects/unicodeobject.c:5388

Potential references (links are intentionally broken to avoid spamming upstream issues):

https://github.com/zeromq/ pyzmq/issues/1167
https://github.com/zeromq/ pyzmq/issues/1003

@max-arnold
Copy link
Contributor

Workaround: set transport: tcp in both master and minion config files.

@defanator
Copy link
Contributor

@max-arnold yeah, same here:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fefa16cf96f in __GI___poll (fds=fds@entry=0x7fffa1c23160, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29	../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0  0x00007fefa16cf96f in __GI___poll (fds=fds@entry=0x7fffa1c23160, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fefa004a56b in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fffa1c23160) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  zmq::signaler_t::wait (this=this@entry=0x2534658, timeout_=<optimized out>) at src/signaler.cpp:242
#3  0x00007fefa00252be in zmq::mailbox_t::recv (this=this@entry=0x25345f0, cmd_=cmd_@entry=0x7fffa1c23240, timeout_=timeout_@entry=-1) at src/mailbox.cpp:81
#4  0x00007fefa001315c in zmq::ctx_t::terminate (this=this@entry=0x25344c0) at src/ctx.cpp:181
#5  0x00007fefa006b7ce in zmq_ctx_term (ctx_=0x25344c0) at src/zmq.cpp:156
#6  0x00007fef9fa3e36b in ?? () from /usr/lib/python3/dist-packages/zmq/backend/cython/context.cpython-38-x86_64-linux-gnu.so
#7  0x00000000005a59b9 in subtype_dealloc (self=<optimized out>) at ../Objects/typeobject.c:1289
#8  0x00000000005cb374 in _Py_Dealloc (op=<optimized out>) at ../Objects/object.c:2215
#9  _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at ../Include/object.h:478
#10 _Py_XDECREF (op=<optimized out>) at ../Include/object.h:541
#11 dict_dealloc (mp=0x7fef9c4e9c00) at ../Objects/dictobject.c:1990
#12 0x00000000005a1bae in _Py_Dealloc (op=<optimized out>) at ../Objects/object.c:2215
#13 _Py_DECREF (filename=0x87a8e0 "../Objects/typeobject.c", lineno=1140, op=<optimized out>) at ../Include/object.h:478
#14 subtype_clear (self=<AsyncReqMessageClient at remote 0x7fef9c4d1880>) at ../Objects/typeobject.c:1140
#15 0x00000000004ec8b9 in delete_garbage (state=<optimized out>, state=<optimized out>, old=0x934840 <_PyRuntime+416>, collectable=0x7fffa1c234d0) at ../Modules/gcmodule.c:948
#16 collect.constprop.0 (generation=generation@entry=2, n_collected=n_collected@entry=0x0, n_uncollectable=n_uncollectable@entry=0x0, nofail=nofail@entry=1, state=0x9347f8 <_PyRuntime+344>)
    at ../Modules/gcmodule.c:1123
#17 0x000000000066babb in _PyGC_CollectNoFail () at ../Modules/gcmodule.c:1866
#18 0x000000000067dd60 in PyImport_Cleanup () at ../Python/import.c:598
#19 0x0000000000678450 in Py_FinalizeEx () at ../Python/pylifecycle.c:1240
#20 0x0000000000677138 in Py_Exit (sts=0) at ../Python/pylifecycle.c:2295
#21 0x000000000067716b in handle_system_exit () at ../Python/pythonrun.c:658
#22 0x0000000000677392 in _PyErr_PrintEx (set_sys_last_vars=1, tstate=0x1a77200) at ../Python/pythonrun.c:763
#23 PyErr_PrintEx (set_sys_last_vars=1) at ../Python/pythonrun.c:763
#24 0x00000000006775b4 in PyErr_Print () at ../Python/pythonrun.c:769
#25 PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffa1c236f8) at ../Python/pythonrun.c:434
#26 0x00000000006ae99e in pymain_run_file (cf=0x7fffa1c236f8, config=0x1a765b0) at ../Modules/main.c:381
#27 pymain_run_python (exitcode=0x7fffa1c236f0) at ../Modules/main.c:565
#28 Py_RunMain () at ../Modules/main.c:644
#29 0x00000000006aed29 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at ../Modules/main.c:698
#30 0x00007fefa15e10b3 in __libc_start_main (main=0x4ebd20 <main>, argc=4, argv=0x7fffa1c238d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffa1c238c8) at ../csu/libc-start.c:308
#31 0x00000000005f62ee in _start () at ../Objects/unicodeobject.c:5388
(gdb) 

zmq is waiting for something. forever.

@waynew
Copy link
Contributor

waynew commented Jul 24, 2020

I'm not sure about this particular case, but I've been hunting down the one where an incorrect state is provided, e.g.

do something:
    test.ping

vs

do something:
    test.nop

The first one stalls out, at a poll(10<eventfd. The second works fine. I was able to track things down into the bowels of the loader - if, within _inner_load something checks for a non-existent key it will trigger the failure case. In any module. So it's not necessarily that test.ping doesn't exist - you can change the code around to stop failing on that issue and it will "fail" properly. But the _inner_load will try and load every Salt module looking for a matching state name, and if any of the __virtual__ functions do 'something.nonexistent' in __salt__ it will then stall at exit on the same poll call.

I'm pretty sure that it is something within zeromq that's calling it, because if I simply manually do an os.close(10) (or whatever the file descriptor is) I get a segfault within ZMQ code, heh.

I don't know if the loader is a red herring or not, though.

I was also able to workaround it at the end of def sls by just doing a st_.client.destroy. That's probably not an ideal workaround, but it did stop the freezing.

🤷 There are a lot of things that I know at this point, but I think there are actually more things that I don't know than when I started 😝

@DmitryKuzmenko
Copy link
Contributor

@waynew @max-arnold Works for me after downgrading pyzmq to 18.0.1 and started hanging with pyzmq>=18.1.0.

@defanator
Copy link
Contributor

@DmitryKuzmenko @waynew @max-arnold @cmcmarrow unfortunately I can't invest much time into further debugging of this one, but a number of quick experiments shows that downgrading pyzmq to 18.0.1 also resolves an issue for us (Ubuntu 20.04 / Python 3.8.2 / libzmq 4.3.2). We've switched to TCP transport until there's no clean fix.

@cmcmarrow
Copy link
Contributor

cmcmarrow commented Aug 5, 2020

same here

[Thread 0x7fffeb7fe700 (LWP 22345) exited]
[Thread 0x7fffeafbd700 (LWP 22346) exited]
^C--Type <RET> for more, q to quit, c to continue without paging--Quit
(gdb) info threads
  Id   Target Id                                   Frame
* 1    Thread 0x7ffff7bf9740 (LWP 22285) "python3" 0x00007ffff7eda96f in __GI___poll (fds=0x7fffffffd930, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  20   Thread 0x7fffcaffd700 (LWP 22336) "python3" 0x00007ffff7ee743e in epoll_wait (epfd=13, events=0x7fffcaffc1d0, maxevents=256, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  21   Thread 0x7ffff0f06700 (LWP 22337) "python3" 0x00007ffff7ee743e in epoll_wait (epfd=16, events=0x7ffff0f051d0, maxevents=256, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
(gdb) bt
#0  0x00007ffff7eda96f in __GI___poll (fds=0x7fffffffd930, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007ffff5c79945 in ?? () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-1358af2c.so.5.2.2
#2  0x00007ffff5c4fc21 in ?? () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-1358af2c.so.5.2.2
#3  0x00007ffff5c2f163 in ?? () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-1358af2c.so.5.2.2
#4  0x00007ffff5ca5cfa in zmq_ctx_term () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-1358af2c.so.5.2.2
#5  0x00007ffff5ca5eb7 in zmq_ctx_destroy () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-1358af2c.so.5.2.2
#6  0x00007ffff507d0fb in ?? () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/context.cpython-38-x86_64-linux-gnu.so
#7  0x00000000005a59b9 in ?? ()
#8  0x00000000005cb374 in ?? ()
#9  0x00000000005a1bae in ?? ()
#10 0x00000000004ec8b9 in ?? ()
#11 0x000000000066babb in _PyGC_CollectNoFail ()
#12 0x000000000067dd60 in PyImport_Cleanup ()
#13 0x0000000000678450 in Py_FinalizeEx ()
#14 0x00000000006aeab9 in Py_RunMain ()
#15 0x00000000006aed29 in Py_BytesMain ()
#16 0x00007ffff7dec0b3 in __libc_start_main (main=0x4ebd20 <main>, argc=4, argv=0x7fffffffe018, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe008)
    at ../csu/libc-start.c:308
#17 0x00000000005f62ee in _start ()
(gdb)

@cmcmarrow
Copy link
Contributor

This hack will stop the hang but will skip cleanup and may leak resources.
salt/scripts/salt-call.py

from salt.scripts import salt_call
from os import _exit as e
if __name__ == "__main__":
    salt_call()
    e(0)

@cmcmarrow
Copy link
Contributor

The bug does not sit in pyzmq but rather libzmq. I built pyzmq 18.0.0 with zmq 4.3.2 and got the hang! The break happened when they went from zmq 4.3.1 to 4.3.2. Note: When I ran libzmq tests I go no fails.

@cmcmarrow cmcmarrow added the Upstream-Bug is a result of an upstream issue, not in salt label Aug 7, 2020
@cmcmarrow
Copy link
Contributor

cmcmarrow commented Aug 7, 2020

Even though this is an upstream bug we are working on and patch and will open a pull request with libzmq

@disaster123
Copy link

The bug does not sit in pyzmq but rather libzmq. I built pyzmq 18.0.0 with zmq 4.3.2 and got the hang! The break happened when they went from zmq 4.3.1 to 4.3.2. Note: When I ran libzmq tests I go no fails.

i'm sorry but i downgraded from libzmq 4.3.2 to libzmq 4.3.1 and it still hangs.

@OrangeDog
Copy link
Collaborator

@disaster123 I think you need to rebuild pyzmq against the downgraded version.

@disaster123
Copy link

may be #57856 is still a different bug.

I did:

  • downgrade libzmq5
  • downgrade libzmq3-dev
  • rebuild pyzmq
  • verified python3 reports ZMQ 40301 as zmq version

but salt still hangs at the end of execution.

@disaster123
Copy link

OK working fix is just to install pyzmq=18.0.1 no matter which version of libzmq is installed it's working with 4.3.2 and 4.3.1 for me. Important is that you use:

pip3 install --no-binary=:all: pyzmq=18.0.1

@cmcmarrow
Copy link
Contributor

cmcmarrow commented Aug 12, 2020

@disaster123 thanks for info. What state are you using to test this? I want to see if your state cause the hang on my box?

@disaster123
Copy link

disaster123 commented Aug 13, 2020

@cmcmarrow i'm coming from #57856 where salt-call hangs on EVERY state. So for me it didn't matter what kind of state it is.

@OrangeDog
Copy link
Collaborator

@cmcmarrow where is the upstream issue being tracked?

@fepitre
Copy link
Contributor

fepitre commented Sep 15, 2020

* 3.9.0b4 - fails with `AttributeError: _scope_id` inside of `ipaddress.py`

I solved this issue (fepitre@4c5e18b) by defining the attribute for compatibility with newer python3.9 python/cpython@21da76d.

@cmcmarrow
Copy link
Contributor

#58364

Magnesium automation moved this from In progress to Done Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Magnesium Mg release after Na prior to Al Regression The issue is a bug that breaks functionality known to work in previous releases. salt-call severity-high 2nd top severity, seen by most users, causes major problems State-Compiler Upstream-Bug is a result of an upstream issue, not in salt ZMQ
Projects
No open projects
Magnesium
  
Done
Development

No branches or pull requests