Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in Reactor #37938

Closed
johje349 opened this issue Nov 29, 2016 · 19 comments
Closed

Memory leak in Reactor #37938

johje349 opened this issue Nov 29, 2016 · 19 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt P2 Priority 2 Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged Reactor severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around ZD The issue is related to a Zendesk customer support ticket. ZRELEASED - 2016.3.6 ZRELEASED - 2016.11.3
Milestone

Comments

@johje349
Copy link

Description of Issue/Question

Memory leak, most notable in Reactor and EventPublisher. The Reactor process has over 1G reserved memory. The leak seems to occur during orchestration following minion creation using salt-cloud.
Total used memory after a day of uptime has gone from 2G to 10G.

Possibly due to queuing the states in the orchestration. There are between 70 to 100 minions in the setup (we do auto scaling).

# ps -xauwww | egrep "2286|1882" | grep -v egrep
root      1882  0.1  2.0 547848 343016 ?       S    Nov28   1:28 /usr/bin/python /usr/bin/salt-master EventPublisher
root      2286  6.9  6.6 2011440 1098436 ?     Rl   Nov28  87:39 /usr/bin/python /usr/bin/salt-master Reactor

Setup

A reactor which fires on salt/cloud/*/created event:

#  cat minion-created.sls 
startup-orchestration:
  runner.state.orchestrate:
    - mods: orchestration.minion-created
    - pillar:
        id: {{ data['name'] }}

The orchestration SLS:

# cat minion-created.sls 
{% set host = salt.pillar.get('id') %}

notify-slack:
  salt.function:
    - name: notify.slack
    - tgt: 'xxxxx'
    - arg:
      - "[salt-cloud] DigitalOcean droplet created: {{ host }}"

update-hosts:
  salt.state:
    - tgt: '*'
    - sls: hosts
    - queue: True

update-iptables:
  salt.state:
    - tgt: '*'
    - sls: iptables
    - queue: True

update-dns-server:
  salt.function:
    - name: cmd.run
    - tgt: 'xxxxx'
    - arg:
      - /srv/scripts/add_dns_record.sh {{ host }}

run-highstate:
  salt.state:
    - tgt: {{ host }}
    - highstate: True

{% if salt['match.pcre']("xxxxx", host) %}
update-haproxy:
  salt.state:
    - tgt: 'xxxxx*'
    - sls: haproxy
    - queue: True
{% endif %}

{% if salt['match.pcre']("xxxxx", host) %}
update-usersync-haproxy:
  salt.state:
    - tgt: 'xxxxx*'
    - sls: haproxy
    - queue: True
{% endif %}

{% if salt['match.pcre']("xxxxx", host) %}
update-usersync-apache2-ssl:
  salt.state:
    - tgt: 'xxxxx*'
    - sls: apache2-ssl 
    - queue: True
{% endif %}

Steps to Reproduce Issue

Create a few minions using salt-cloud with the above setup.

Versions Report

Salt Version:
           Salt: 2016.3.4
 
Dependency Versions:
           cffi: 0.8.6
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.9.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3
      pycparser: 2.10
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.5.0
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5
 
System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-68-generic
         system: Linux
        version: Ubuntu 14.04 trusty
@gtmanfred
Copy link
Contributor

@DmitryKuzmenko Can you take a look at this?

I was able to confirm the memory leak with just the salt-api webhook triggering highstates on minions.

@gtmanfred gtmanfred added severity-critical top severity, seen by most users, serious issues Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P2 Priority 2 Reactor TEAM Core labels Nov 30, 2016
@gtmanfred gtmanfred added this to the Approved milestone Nov 30, 2016
@DmitryKuzmenko
Copy link
Contributor

I can't say anything without a deep analyze. Now I'm working on another memleak issue. I'll add this one to my ToDo list.
@meggiebot could you please confirm?

@DmitryKuzmenko DmitryKuzmenko self-assigned this Dec 1, 2016
@meggiebot meggiebot modified the milestones: Nitrogen 1, Approved Dec 1, 2016
@meggiebot
Copy link

@DmitryKuzmenko yes, let's make this a top priority since it is a blocker for an upcoming release.

@johje349
Copy link
Author

johje349 commented Dec 6, 2016

Same behaviour on 2016.11.0 (Carbon)

@gtmanfred
Copy link
Contributor

@johje349 I am having trouble replicating this to get to over 1G of memory.

Are you sure that your ps command shows the command at 10G of Residential memory? I believe ps returns the RES number back in KiB, which would mean that you are at just over 1GiB of memory in the ps command in your first comment. Also, the percentage looks low, the Mem percentage looks like it is 6.6%, which would mean you have about 15-16G of ram in the server. Are you sure you are seeing this go up to 10x this amount of ram?

I have a docker-compose environment simulating spinning up and down, but it is going to take a while today to get up to 1G.

Thanks,
Daniel

@gtmanfred gtmanfred removed ZRELEASED - 2016.11.2 ZRELEASED - 2016.3.5 severity-critical top severity, seen by most users, serious issues labels Dec 6, 2016
@gtmanfred gtmanfred modified the milestones: Blocked, Nitrogen 1 Dec 6, 2016
@gtmanfred
Copy link
Contributor

@johje349 I think this may be a leak in the RunnerClient.

Would you be able to convert your orchestrate script over to just a reactor sls that runs local.state.apply and local.cmd.run and see if you still see the memory leak?

Thanks
Daniel

@johje349
Copy link
Author

johje349 commented Dec 7, 2016

@gtmanfred I have 16G RAM in the in the server. Currently 13G - 6G Cached is used.
From top:
21356 root 20 0 1981676 1.184g 3812 S 1.7 7.6 378:23.91 /usr/bin/python

# ps -xauwww | grep 21356 | grep -v grep
root     21356 13.0  7.5 1981676 1241508 ?     Sl   Dec05 378:28 /usr/bin/python /usr/bin/salt-master Reactor

I will test your suggestions.

BR,
/JJ

@rickh563 rickh563 added the ZD The issue is related to a Zendesk customer support ticket. label Dec 7, 2016
@rickh563
Copy link

rickh563 commented Dec 7, 2016

ZD-1056

@johje349
Copy link
Author

johje349 commented Dec 8, 2016

@gtmanfred Currently I have only beacon load events coming in, which triggers a Reactor that runs local.state.sls. These are small events but I can see memory usage is increasing with about 4k per event. It would take quite a while to reach 1G+ with this configuration.

BR,

/JJ

@johje349
Copy link
Author

I have now an uptime of 20 days, and the Reactor process has consumed more than 2G of memory:
root 2168 8.2 26.7 3055820 2188964 ? Sl Jan04 2363:14 /usr/bin/python /usr/bin/salt-master Reactor

@DmitryKuzmenko
Copy link
Contributor

@johje349 I'm working on this issue. It's actually not as easy.

@DmitryKuzmenko
Copy link
Contributor

@johje349 is it possible for you to apply the fix #38951 manually and check would it help or not?

@johje349
Copy link
Author

johje349 commented Jan 26, 2017

@DmitryKuzmenko change applied, let's wait and see.

@meggiebot meggiebot added the fixed-pls-verify fix is linked, bug author to confirm fix label Jan 26, 2017
@johje349
Copy link
Author

johje349 commented Feb 1, 2017

The fix does not seem to have improved the situation unfortunately.

@DmitryKuzmenko
Copy link
Contributor

@johje349 bad news. =( This means I've reproduced not yours issue. Sorry.
Could you please share the SLS files used by your orchestration SLS?

@meggiebot meggiebot removed the fixed-pls-verify fix is linked, bug author to confirm fix label Feb 2, 2017
@meggiebot meggiebot modified the milestones: Nitrogen 4, Nitrogen 3 Feb 6, 2017
@DmitryKuzmenko DmitryKuzmenko added the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label Feb 7, 2017
@johje349
Copy link
Author

johje349 commented Feb 9, 2017

@DmitryKuzmenko salt-master has been running for 4 days now and the Reactor process consumes 800M, it seems to be pretty stable around that value though.

@DmitryKuzmenko
Copy link
Contributor

@johje349 glad to hear it. Could we close this issue then?

@johje349
Copy link
Author

@DmitryKuzmenko yes go ahead and close it.

@DmitryKuzmenko
Copy link
Contributor

@johje349 thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt P2 Priority 2 Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged Reactor severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around ZD The issue is related to a Zendesk customer support ticket. ZRELEASED - 2016.3.6 ZRELEASED - 2016.11.3
Projects
None yet
Development

No branches or pull requests

5 participants