-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Salt Mine memory leak #31454
Comments
Salt version is 2015.8.7 running on Ubuntu 14.04.3. |
@johje349, thanks for reporting. I have tried to reproduce this with a single minion without success. I will try more minions later. jmoney-main ~ master # cat /etc/salt/minion
master: localhost
log_fmt_console: '%(colorlevel)s %(colormsg)s'
mine_functions:
test.fib:
- 19
mine_interval: 3 jmoney-main ~ master # cat /srv/salt/test.sls
{% for host, ip in salt['mine.get']('*', 'test.fib').iteritems() %}
test-{{ host }}:
cmd.run:
- name: "echo {{ host }}"
{% endfor %} jmoney-main ~ master # for i in {1..11} ; do salt jmoney-main state.apply test &> /dev/null ; free -m ; done
total used free shared buffers cached
Mem: 2005 1671 333 40 190 999
-/+ buffers/cache: 480 1524
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1671 333 40 190 999
-/+ buffers/cache: 481 1523
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1672 332 40 190 999
-/+ buffers/cache: 481 1523
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1672 332 40 190 999
-/+ buffers/cache: 482 1522
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1672 332 40 190 1000
-/+ buffers/cache: 482 1522
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1672 332 40 190 1000
-/+ buffers/cache: 482 1522
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1673 331 40 190 1000
-/+ buffers/cache: 482 1522
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1673 331 40 190 1000
-/+ buffers/cache: 482 1522
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1673 331 40 190 1000
-/+ buffers/cache: 482 1522
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1672 332 40 190 1000
-/+ buffers/cache: 482 1522
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1673 331 40 190 1000
-/+ buffers/cache: 482 1522
Swap: 0 0 0 jmoney-main ~ master # salt --versions
Salt Version:
Salt: 2015.8.7
Dependency Versions:
Jinja2: 2.7.3
M2Crypto: 0.21.1
Mako: 1.0.0
PyYAML: 3.11
PyZMQ: 14.4.0
Python: 2.7.9 (default, Mar 1 2015, 12:57:24)
RAET: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5
cffi: 0.8.6
cherrypy: Not Installed
dateutil: Not Installed
gitdb: 0.5.4
gitpython: 0.3.2 RC1
ioflo: Not Installed
libgit2: Not Installed
libnacl: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.2
mysql-python: Not Installed
pycparser: 2.10
pycrypto: 2.6.1
pygit2: Not Installed
python-gnupg: Not Installed
smmap: 0.8.2
timelib: Not Installed
System Versions:
dist: debian 8.3
machine: x86_64
release: 3.16.0-4-amd64
system: debian 8.3 |
@johje349, I've retried with 20 minions and the master's memory profile increases by about 4 MiB with each state run. jmoney-main ~ master # salt-key -L
Accepted Keys:
CentOS-01-brain-so9r
CentOS-01-brain-vste
CentOS-01-curly-6wij
CentOS-01-curly-c367
CentOS-01-curly-fh6k
CentOS-01-curly-x0r7
CentOS-01-larry-0o7u
CentOS-01-larry-85rj
CentOS-01-moe-1tfn
CentOS-01-moe-6prr
CentOS-01-moe-jgvl
CentOS-01-pinky-89ui
CentOS-01-pinky-8b01
CentOS-01-pinky-izmh
CentOS-01-pinky-qhg6
jmoney-main
Denied Keys:
Unaccepted Keys:
Rejected Keys: jmoney-main ~ master # for i in {1..11} ; do salt 'CentOS-*' state.apply test &> /dev/null ; free -m ; done
total used free shared buffers cached
Mem: 2005 1720 284 40 190 990
-/+ buffers/cache: 539 1465
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1725 279 40 190 990
-/+ buffers/cache: 544 1460
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1728 276 40 190 990
-/+ buffers/cache: 547 1457
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1732 272 40 190 991
-/+ buffers/cache: 550 1454
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1735 269 40 190 991
-/+ buffers/cache: 554 1450
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1739 265 40 190 991
-/+ buffers/cache: 557 1447
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1743 261 40 190 991
-/+ buffers/cache: 561 1443
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1746 258 40 190 992
-/+ buffers/cache: 564 1440
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1750 254 40 190 992
-/+ buffers/cache: 567 1437
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1754 250 40 190 992
-/+ buffers/cache: 571 1433
Swap: 0 0 0
total used free shared buffers cached
Mem: 2005 1757 247 40 190 992
-/+ buffers/cache: 573 1431
Swap: 0 0 0 jmoney-main ~ master # salt 'CentOS-*' test.version
CentOS-01-pinky-89ui:
2015.8.7
CentOS-01-brain-vste:
2015.8.7
CentOS-01-moe-1tfn:
2015.8.7
CentOS-01-curly-c367:
2015.8.7
CentOS-01-larry-85rj:
2015.8.7
CentOS-01-pinky-qhg6:
2015.8.7
CentOS-01-curly-x0r7:
2015.8.7
CentOS-01-moe-6prr:
2015.8.7
CentOS-01-larry-0o7u:
2015.8.7
CentOS-01-brain-so9r:
2015.8.7
CentOS-01-curly-6wij:
2015.8.7
CentOS-01-pinky-8b01:
2015.8.7
CentOS-01-pinky-izmh:
2015.8.7
CentOS-01-moe-jgvl:
2015.8.7
CentOS-01-curly-fh6k:
2015.8.7 |
Additionally, the memory profile remains constant several minutes later. jmoney-main ~ master # free -m
total used free shared buffers cached
Mem: 2005 1798 206 40 190 996
-/+ buffers/cache: 611 1393
Swap: 0 0 0 If I stop the master, the memory use drops: jmoney-main ~ master # free -m
total used free shared buffers cached
Mem: 2005 1454 550 40 190 996
-/+ buffers/cache: 267 1737
Swap: 0 0 0 |
How many worker threads did you use on the master? |
@johje349 what transport do you use? Zeromq or tcp? |
Zeromq |
@johje349 thnx |
@johje349 I've run a test overnight with debug python under valgrind massif tool. My results: salt-master has grown by 35MB after 12 hours that looks more like not leak and just valgrind data and python cache. I've analyzed massif output and found nothing critical. @johje349 could you please provide additional information about which of salt-master processes eats the memory:
$ python -c "import setproctitle" || echo pip install setproctitle
$ ps a | grep salt-master
7079 pts/2 S+ 0:00 python ./scripts/salt-master ProcessManager
7087 pts/2 S+ 0:00 python ./scripts/salt-master MultiprocessingLoggingQueue
7090 pts/2 Sl+ 0:00 python ./scripts/salt-master ZeroMQPubServerChannel
7091 pts/2 S+ 0:00 python ./scripts/salt-master EventPublisher
7094 pts/2 Sl+ 0:01 python ./scripts/salt-master Reactor
7095 pts/2 S+ 0:00 python ./scripts/salt-master Maintenance
7096 pts/2 S+ 0:00 python ./scripts/salt-master ReqServer_ProcessManager
7097 pts/2 Sl+ 0:00 python ./scripts/salt-master MWorkerQueue
7099 pts/2 Sl+ 0:00 python ./scripts/salt-master MWorker-0
salt --summary '*' test.ping
ps aux | grep salt-master > ps.start
ps aux | grep salt-master > ps.end Additionally please provide the count of executed |
@johje349 we'd like to make additional improvements here if you can provide the information requested by Dmitry above. |
Requested information provided below. I now have 66 minions and the master is configured with 24 worker threads. I do 10 state.apply during the test.
Execution:
|
The memory consumption seems pretty stable during subsequent runs though. A clear improvement since the issue was reported. The version of master and minions is 2016.3.2.
|
@johje349 thank you, this makes sense. I'll continue work on this as soon as possible. |
Hopefully it's done: at least in my tests the memory don't leak anymore. |
Awesome |
It has been observed that when running this command: ``` salt "*" test.ping ``` sometimes the command would return `Minion did not return. [No response]` for some of the minions even though the minions did indeed respond (reproduced running Windows salt-master on Python 3 using the TCP transport). After investigating this further, it seems that there is a race condition where if the response via event happens before events are being listened for, the response is lost. For instance, in `salt.client.LocalClient.cmd_cli` which is what is invoked in the command above, it won't start listening for events until `get_cli_event_returns` which invokes `get_iter_returns` which invokes `get_returns_no_block` which invokes `self.event.get_event` which will connect to the event bus if it hasn't connected yet (which is the case the first time it hits this code). But events may be fired anytime after `self.pub()` is executed which occurs before this code. We need to ensure that events are being listened for before it is possible they return. We also want to avoid issue saltstack#31454 which is what PR saltstack#36024 fixed but in turn caused this issue. This is the approach I have taken to try to tackle this issue: It doesn't seem possible to generically discern if events can be returned by a given function that invokes `run_job` and contains an event searching function such as `get_cli_event_returns`. So for all such functions that could possibly need to search the event bus, we do the following: - Record if the event bus is currently being listened to. - When invoking `run_job`, ensure that `listen=True` so that `self.pub()` will ensure that the event bus is listed to before sending the payload. - When all possible event bus activities are concluded, if the event bus was not originally being listened to, stop listening to it. This is designed so that issue saltstack#31454 does not reappear. We do this via a try/finally block in all instances of such code. Signed-off-by: Sergey Kizunov <sergey.kizunov@ni.com>
It has been observed that when running this command: ``` salt "*" test.ping ``` sometimes the command would return `Minion did not return. [No response]` for some of the minions even though the minions did indeed respond (reproduced running Windows salt-master on Python 3 using the TCP transport). After investigating this further, it seems that there is a race condition where if the response via event happens before events are being listened for, the response is lost. For instance, in `salt.client.LocalClient.cmd_cli` which is what is invoked in the command above, it won't start listening for events until `get_cli_event_returns` which invokes `get_iter_returns` which invokes `get_returns_no_block` which invokes `self.event.get_event` which will connect to the event bus if it hasn't connected yet (which is the case the first time it hits this code). But events may be fired anytime after `self.pub()` is executed which occurs before this code. We need to ensure that events are being listened for before it is possible they return. We also want to avoid issue saltstack#31454 which is what PR saltstack#36024 fixed but in turn caused this issue. This is the approach I have taken to try to tackle this issue: It doesn't seem possible to generically discern if events can be returned by a given function that invokes `run_job` and contains an event searching function such as `get_cli_event_returns`. So for all such functions that could possibly need to search the event bus, we do the following: - Record if the event bus is currently being listened to. - When invoking `run_job`, ensure that `listen=True` so that `self.pub()` will ensure that the event bus is listed to before sending the payload. - When all possible event bus activities are concluded, if the event bus was not originally being listened to, stop listening to it. This is designed so that issue saltstack#31454 does not reappear. We do this via a try/finally block in all instances of such code. Signed-off-by: Sergey Kizunov <sergey.kizunov@ni.com>
I have a test state which looks like this:
The master has 36 minions. Each time I run the state, memory usage on the master increases with about 100M.
Running the state enough times the memory gets full and everything stops working.
The text was updated successfully, but these errors were encountered: