Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Salt-master process fails and exits under high load #66407

Open
jtraub91 opened this issue Apr 19, 2024 · 1 comment
Open

[BUG] Salt-master process fails and exits under high load #66407

jtraub91 opened this issue Apr 19, 2024 · 1 comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage

Comments

@jtraub91
Copy link
Contributor

Description

Perhaps as some sort of regression with onedir salt, high load which traditionally would just result in timeouts and master is unresponsive messages, now additionally causes the salt-master process to fail and exit

Setup

Install salt-master and salt-minion on a machine. Note that in my test case I used a machine with a single CPU (number of CPUs may have a big effect on the results shown below).

The following configuration was used.

# /etc/salt/master.d/master.conf
user: root
# /etc/salt/minion.d/minion.conf
master: localhost

The following states were used to demonstrate this behavior.

# /srv/salt/orch.sls
run_state:
  salt.state:
    - tgt: {{ grains.id.split("_")[0] }}
    - sls: state
    - concurrent: True
# /srv/salt/state.sls
sleep 30:
  cmd.run

Steps to Reproduce the behavior

We are trying to produce a situation where the master is under heavy load, so the following bash commands can be used to launch 10 orchestrations concurrently

for i in $(seq 1 10); do
  salt-run state.orch orch --async
done

Notice the following error messages in the master logs

2024-04-19 19:57:24,903 [salt.loaded.int.states.saltmod:384 ][WARNING ][11443] Output from salt state not highstate                                                                               
2024-04-19 19:57:25,973 [salt.state       :323 ][ERROR   ][11443] {'out': 'highstate', 'ret': {'salt-3006': False}}                                                                               
2024-04-19 19:57:50,409 [salt.client      :1912][ERROR   ][11888] Message timed out                                                                                                               
2024-04-19 19:57:54,875 [salt.client      :1912][ERROR   ][11496] Message timed out                                                                                                               
2024-04-19 19:58:07,149 [salt.client      :1912][ERROR   ][11593] Message timed out                                                                                                               
2024-04-19 19:58:07,149 [salt.state       :323 ][ERROR   ][11888] An exception occurred in this state: Traceback (most recent call last):                                                         
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1910, in pub                                                                                              
    payload = channel.send(payload_kwargs, timeout=timeout)                                                                                                                                       
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 125, in wrap                                                                                           
    raise exc_info[1].with_traceback(exc_info[2])                                                                                                                                                 
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 131, in _target                                                                                        
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))                                                                                                                    
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync                                                                                       
    return future_cell[0].result()                                                                                                                                                                
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)                                                                                                                                                                
  File "<string>", line 4, in raise_exc_info                                                                                                                                                      
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run                                                                                              
    yielded = self.gen.throw(*exc_info)                                                                                                                                                           
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 338, in send                                                                                               
    ret = yield self._uncrypted_transfer(load, timeout=timeout)                                                                                                                                   
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run                                                                                              
    value = future.result()                                                                                                                                                                       
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result                                                                                     
    raise_exc_info(self._exc_info)                                                                                                                                                                
  File "<string>", line 4, in raise_exc_info                                                                                                                                                      
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run                                                                                              
    yielded = self.gen.throw(*exc_info)                                                                                                                                                           
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 309, in _uncrypted_transfer                                                                                
    ret = yield self.transport.send(                                                                                                                                                              
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run 
    value = future.result()                                                                                                                                                                       
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result                                                                                     
    raise_exc_info(self._exc_info)                                                                                                                                                                
  File "<string>", line 4, in raise_exc_info                                                                                                                                                      
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run                                                                                              
    yielded = self.gen.throw(*exc_info)                                                                                                                                                           
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 915, in send                                                                                             
    ret = yield self.message_client.send(load, timeout=timeout)                                                                                                                                   
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()                                                                                                                                                                       
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)                                                               
  File "<string>", line 4, in raise_exc_info                                                     
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)                                                          
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 594, in send
    recv = yield future                         
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()                                                                      
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)                                                               
  File "<string>", line 4, in raise_exc_info                                                     
salt.exceptions.SaltReqTimeoutError: Message timed out

During handling of the above exception, another exception occurred:                                                                                                                               

Traceback (most recent call last):                                                               
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 387, in run_job
    pub_data = self.pub(                                                                         
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1913, in pub
    raise SaltReqTimeoutError(                                                                   
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--asy
nc`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache
 later.     

and now notice that the salt-master process has failed and exited

# systemctl status salt-master
× salt-master.service - The Salt Master Server
     Loaded: loaded (/lib/systemd/system/salt-master.service; enabled; vendor preset: enabled)
     Active: failed (Result: oom-kill) since Fri 2024-04-19 19:58:08 UTC; 4min 59s ago
       Docs: man:salt-master(1)
             file:///usr/share/doc/salt/html/contents.html
             https://docs.saltproject.io/en/latest/contents.html
    Process: 9044 ExecStart=/usr/bin/salt-master (code=exited, status=0/SUCCESS)
   Main PID: 9044 (code=exited, status=0/SUCCESS)
        CPU: 54.873s

Expected behavior

Master should keep running, ideally allowing all minions to respond without timeouts, when the cpu capacity frees up. The master failing and exiting under these circumstances was not observed on classic installations of salt.

Versions Report

Salt Version:
          Salt: 3006.7
 
Python Version:
        Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]
 
Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.15.1
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: ubuntu 22.04.2 jammy
        locale: utf-8
       machine: x86_64
       release: 5.15.0-67-generic
        system: Linux
       version: Ubuntu 22.04.2 jammy
@jtraub91 jtraub91 added Bug broken, incorrect, or confusing behavior needs-triage labels Apr 19, 2024
@whytewolf
Copy link
Contributor

looking at the results. you ran out of memory which caused the oom-kill function in linux to kill salt.

can you check the dmesg to validate this?

this is what Result: oom-kill means.

and given that you were spinning up multiple processes each concurrently it was adding memory usage for each process. so it isn't a surprising result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage
Projects
None yet
Development

No branches or pull requests

2 participants