Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

salt-master crashes every few days #34215

Closed
rvora opened this issue Jun 22, 2016 · 27 comments
Closed

salt-master crashes every few days #34215

rvora opened this issue Jun 22, 2016 · 27 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix P2 Priority 2 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around ZRELEASED - 2016.3.2
Milestone

Comments

@rvora
Copy link

rvora commented Jun 22, 2016

Description of Issue/Question

Every few days (4 or 5 days) salt-master stops responding and we need to restart salt-master.

Setup

We have a number of salt-syndics that connect to this salt-master. Syndics are very old version (2014.7.5). We get a fair number of "Salt minion claiming to be has attempted tocommunicate with the master and could not be verified" messages in the logs but functionally everything works until salt-master crashes.

Steps to Reproduce Issue

2016-06-22 08:31:39,001 [salt.master                                          ][ERROR   ][7815] Unable to decrypt token: Unable to decrypt message
2016-06-22 08:31:39,006 [salt.master                                          ][ERROR   ][7815] Salt minion claiming to be <redacted> has attempted tocommunicate with the master and could not be verified
2016-06-22 08:31:39,006 [salt.master                                          ][WARNING ][7815] Minion id <redacted> is not who it says it is!
2016-06-22 08:31:39,455 [salt.transport.ipc                                   ][ERROR   ][7786] Exception occurred in Subscriber while handling stream: Already reading
2016-06-22 08:31:39,456 [salt.utils.process                                   ][ERROR   ][7786] An un-handled exception from the multiprocessing process 'Reactor-6' was caught:
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/utils/process.py", line 613, in _run
    return self._original_run()
  File "/usr/lib/python2.6/site-packages/salt/utils/reactor.py", line 215, in run
    for data in self.event.iter_events(full=True):
  File "/usr/lib/python2.6/site-packages/salt/utils/event.py", line 620, in iter_events
    data = self.get_event(tag=tag, full=full, match_type=match_type)
  File "/usr/lib/python2.6/site-packages/salt/utils/event.py", line 579, in get_event
    ret = self._get_event(wait, tag, match_func, no_block)
  File "/usr/lib/python2.6/site-packages/salt/utils/event.py", line 484, in _get_event
    raw = self.subscriber.read_sync(timeout=wait)
  File "/usr/lib/python2.6/site-packages/salt/transport/ipc.py", line 654, in read_sync
    return ret_future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.6/site-packages/salt/transport/ipc.py", line 631, in _read_sync
    raise exc_to_raise  # pylint: disable=E0702
AssertionError: Already reading
2016-06-22 08:31:39,475 [salt.master                                          ][ERROR   ][7814] Error in function _minion_event:
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1540, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1306, in _minion_event
    self.masterapi._minion_event(load)
  File "/usr/lib/python2.6/site-packages/salt/daemons/masterapi.py", line 752, in _minion_event
    self.event.fire_event(event, event['tag'])  # old dup event
  File "/usr/lib/python2.6/site-packages/salt/utils/event.py", line 662, in fire_event
    self.io_loop.run_sync(lambda: self.pusher.send(msg))
  File "/usr/lib64/python2.6/site-packages/tornado/ioloop.py", line 444, in run_sync
    return future_cell[0].result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/ipc.py", line 389, in send
    yield self.stream.write(pack)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
StreamClosedError

and the StreamClosedError exception keeps repeating about 15 times (within 2 or 3 seconds) and then the log stops completely.

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Salt-Master (master of masters that crashes every few days):

[root@salty salt]# salt --versions
Salt Version:
           Salt: 2016.3.1

Dependency Versions:
           cffi: Not Installed
       cherrypy: 3.2.2
       dateutil: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: 0.20.0
        libnacl: Not Installed
       M2Crypto: 0.20.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: 0.20.3
         Python: 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.5.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: 0.2.4
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: centos 6.8 Final
        machine: x86_64
        release: 2.6.32-642.1.1.el6.x86_64
         system: Linux
        version: CentOS 6.8 Final

On the Salt-syndics with older versions:

               Salt: 2014.7.5
             Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
             Jinja2: 2.2.1
           M2Crypto: 0.20.2
     msgpack-python: 0.4.6
       msgpack-pure: Not Installed
           pycrypto: 2.0.1
            libnacl: Not Installed
             PyYAML: 3.10
              ioflo: Not Installed
              PyZMQ: 14.3.1
               RAET: Not Installed
                ZMQ: 3.2.5
               Mako: Not Installed
@Ch3LL
Copy link
Contributor

Ch3LL commented Jun 23, 2016

@rvora would you mind sharing your master/syndic configs to help me try to replicate this issue?

@Ch3LL Ch3LL added the info-needed waiting for more info label Jun 23, 2016
@Ch3LL Ch3LL modified the milestones: Approved, Blocked Jun 23, 2016
@rvora
Copy link
Author

rvora commented Jun 27, 2016

Here's the /etc/salt/master for master-of-master (some of the sensitive information has been redacted or removed). Note that this is from a non-production setup (open_mode: True) where I am seeing this behavior. I have also removed external_auth, rest_cherrypi clauses below.

open_mode: True
keep_jobs: 2
worker_threads: 10
runner_dirs: [ '/srv/salt/runners' ]

fileserver_backend:
  - roots
  - git

gitfs_remotes:
  - {'ssh://git@internal-git-server:9999/binddns-formula.git': [{'pubkey': '/root/.ssh/id_rsa.pub'}, {'privkey': '/root/.ssh/id_rsa'}]}

order_masters: True

And /etc/salt/master for the sub-level masters running salt-syndic:

    open_mode: True
    file_recv: True
    #log_level: debug
    timeout: 10  # default 5
    gather_job_timeout: 10  # default 5
    runner_dirs: [ '/srv/salt/runners' ]

    pillar_roots:
      base:
        - /srv/pillar
        - /srv/external_pillar


    syndic_master: master-of-masters-ip-address

    nodegroups:
      monit-group: 'G@roles:monit or G@roles:logstash'

@deuscapturus
Copy link
Contributor

deuscapturus commented Jun 28, 2016

We are having the same problem with 2016.3.1. We typically have to restart our salt-master and salt-api each morning. We do not have syndic enabled.

These errors seem to start the problem. It starts with dozens of pillar render errors Cannot allocate memory. And once in the bad state the StreamClosedError repeats.

2016-06-27 22:59:47,067 [salt.pillar      ][CRITICAL][20256] Rendering SLS 'learner.narwhal' failed, render error:
[Errno 12] Cannot allocate memory
2016-06-27 22:59:47,068 [salt.pillar      ][CRITICAL][20256] Pillar render error: Rendering SLS 'learner.narwhal' failed. Please see master log for details.
2016-06-27 22:59:48,424 [salt.pillar      ][CRITICAL][20259] Rendering SLS 'tc-clips-service' failed, render error:
[Errno 12] Cannot allocate memory
2016-06-27 22:59:48,446 [salt.pillar      ][CRITICAL][20259] Rendering SLS 'clips-service-stage' failed, render error:
[Errno 12] Cannot allocate memory
2016-06-27 22:59:48,838 [salt.pillar      ][CRITICAL][20259] Rendering SLS 'stackdriver' failed, render error:
[Errno 12] Cannot allocate memory
2016-06-28 02:25:17,549 [salt.master      ][ERROR   ][2130] Error in function _return:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/master.py", line 1540, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.7/site-packages/salt/master.py", line 1353, in _return
    self.opts, load, event=self.event, mminion=self.mminion)
  File "/usr/lib/python2.7/site-packages/salt/utils/job.py", line 67, in store_job
    event.fire_event(load, tagify([load['jid'], 'ret', load['id']], 'job'))
  File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 656, in fire_event
    log.debug('Sending event - data = {0}'.format(data))
  File "/usr/lib64/python2.7/logging/init.py", line 1137, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib/python2.7/site-packages/salt/log/setup.py", line 310, in _log
    self, level, msg, args, exc_info=exc_info, extra=extra
  File "/usr/lib64/python2.7/logging/init.py", line 1267, in _log
    record = self.makeRecord(self.name, level, fn, lno, msg, args, exc_info, func, extra)
  File "/usr/lib/python2.7/site-packages/salt/log/setup.py", line 344, in makeRecord
    exc_info, func)
  File "/usr/lib/python2.7/site-packages/salt/log/setup.py", line 191, in init
    self.colormsg = '%s%s%s' % (cmsg, self.getMessage(), reset)
MemoryError
2016-06-28 02:26:05,072 [salt.transport.ipc][ERROR   ][2097] Exception occurred in Subscriber while handling stream: Already reading
2016-06-28 02:26:05,093 [salt.utils.process][ERROR   ][2097] An un-handled exception from the multiprocessing process 'Reactor-6' was caught:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/utils/process.py", line 613, in _run
    return self._original_run()
  File "/usr/lib/python2.7/site-packages/salt/utils/reactor.py", line 215, in run
    for data in self.event.iter_events(full=True):
  File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 620, in iter_events
    data = self.get_event(tag=tag, full=full, match_type=match_type)
  File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 579, in get_event
    ret = self._get_event(wait, tag, match_func, no_block)
  File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 484, in _get_event
    raw = self.subscriber.read_sync(timeout=wait)
  File "/usr/lib/python2.7/site-packages/salt/transport/ipc.py", line 654, in read_sync
    return ret_future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/site-packages/salt/transport/ipc.py", line 631, in _read_sync
    raise exc_to_raise  # pylint: disable=E0702
AssertionError: Already reading
2016-06-28 02:26:21,294 [salt.master      ][ERROR   ][2128] Error in function _return:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/master.py", line 1540, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.7/site-packages/salt/master.py", line 1353, in _return
    self.opts, load, event=self.event, mminion=self.mminion)
  File "/usr/lib/python2.7/site-packages/salt/utils/job.py", line 67, in store_job
    event.fire_event(load, tagify([load['jid'], 'ret', load['id']], 'job'))
  File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 662, in fire_event
    self.io_loop.run_sync(lambda: self.pusher.send(msg))
  File "/usr/lib64/python2.7/site-packages/tornado/ioloop.py", line 444, in run_sync
    return future_cell[0].result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/site-packages/salt/transport/ipc.py", line 389, in send
    yield self.stream.write(pack)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
StreamClosedError

We have MemTotal: 3689160 kB on this salt master. Which I feel that should be enough. I'll increase the memory for now to see if that fixes anything.

Salt Version:
           Salt: 2016.3.1

Dependency Versions:
           cffi: 0.8.6
       cherrypy: 3.2.2
       dateutil: Not Installed
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: 0.21.0
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.7
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
         pygit2: 0.21.4
         Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
   python-gnupg: 0.3.8
         PyYAML: 3.11
          PyZMQ: 14.7.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: centos 7.2.1511 Core
        machine: x86_64
        release: 3.10.0-327.22.2.el7.x86_64
         system: Linux
        version: CentOS Linux 7.2.1511 Core

@deuscapturus
Copy link
Contributor

I doubled the memory on the salt-master, this has not helped.

@DrMerlin
Copy link

@markahopper this is our other flavor of the salt master needing a restart several times a day.

@Ch3LL
Copy link
Contributor

Ch3LL commented Jun 28, 2016

@rvora thanks for the additional information. I'm setting up a test case right now and will report back the results today/tomorrow.

@Ch3LL
Copy link
Contributor

Ch3LL commented Jun 28, 2016

Also @DrMerlin @deuscapturus or @rvora are all of you using salt-api when this occurs? I'm trying to narrow this down because I have not seen this in any of my environments.

Also seems to maybe be correlated to pillars @deuscapturus is there anything unique about your pillars? maybe git pillars, external pillar of some sorts? Anything that will help me to replicate this issue.

@cachedout
Copy link
Contributor

I wonder if @skizunov has any insight here. This might be pretty tricky to track down.

Does memory usage increase at all prior to this happening or do these errors just seemingly come out of nowhere?

@rvora
Copy link
Author

rvora commented Jun 28, 2016

I'm using salt-api.

Rajul

Rajul Vora - typos by Android
On Jun 28, 2016 1:58 PM, "Megan Wilhite" notifications@github.com wrote:

Also @DrMerlin https://github.com/DrMerlin @deuscapturus
https://github.com/deuscapturus or @rvora https://github.com/rvora
are all of you using salt-api when this occurs? I'm trying to narrow this
down because I have not seen this in any of my environments.

Also seems to maybe be correlated to pillars @deuscapturus
https://github.com/deuscapturus is there anything unique about your
pillars? maybe git pillars, external pillar of some sorts? Anything that
will help me to replicate this issue.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#34215 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AB_nRnoWPEK7s27m53nHozjI59QnENGRks5qQYrogaJpZM4I8DdA
.

@deuscapturus
Copy link
Contributor

Our master configuration

worker_threads: 15 
auto_accept: True
external_auth:
  ldap:
    manageiq:
      - '@runner'
    team1%:
      - 'G@roles:app1':
        - .*
      - 'G@roles:app2':
        - .*
      - 'G@roles:app3':
        - .*
      - 'G@roles:app4':
        - .*
      - 'G@roles:app5':
        - .*
      - 'G@roles:app6':
        - .*
      - 'G@roles:app7':
        - .*
    team2%:
      - 'G@roles:app8':
        - .*
      - 'G@roles:app9:
        - .*
      - 'G@roles:app10':
        - .*
    team3%:
      - 'G@roles:app11':
        - .*
auth.ldap.server: 10.1.1.1
auth.ldap.port: 389
auth.ldap.tls: False
auth.ldap.scope: 2
auth.ldap.no_verify: False
auth.ldap.anonymous: False
auth.ldap.accountattributename: sAMAccountName
auth.ldap.groupclass: group
auth.ldap.groupou: 'Groups'
auth.ldap.filter: sAMAccountName={{ username }}
auth.ldap.binddn: CN=saltstack,CN=Users,DC=corp,DC=company,DC=com
auth.ldap.basedn: dc=corp,dc=company,dc=com
auth.ldap.bindpw: *********
auth.ldap.activedirectory: True
rest_cherrypy:
  port: 8000
  ssl_crt: /etc/pki/tls/certs/localhost.crt
  ssl_key: /etc/pki/tls/certs/localhost.key
file_roots:
  base:
    - /srv/salt/states
fileserver_backend:
  - git
  - roots
gitfs_provider: pygit2
gitfs_base: staging
gitfs_remotes:
  - git@github.com:my-org/saltstack.git:
    - privkey: /root/.ssh/github-saltstack-id_rsa
    - pubkey: /root/.ssh/github-saltstack-id_rsa.pub
  - git@github.com:my-org/saltstack-team1.git:
    - root: states
    - base: master
    - privkey: /root/.ssh/my-org_saltstack-team1-id_rsa
    - pubkey: /root/.ssh/my-org_saltstack-team1-id_rsa.pub
  - git@github.com:my-org/saltstack-team2.git:
    - root: states
    - base: master
    - privkey: /root/.ssh/my-org_saltstack-team2-id_rsa
    - pubkey: /root/.ssh/my-org_saltstack-team2-id_rsa.pub
  - git@github.com:my-org/saltstack-team3.git:
    - privkey: /root/.ssh/github-saltstack-team3
    - pubkey: /root/.ssh/github-saltstack-team3.pub
    - root: states
    - base: master
  - git@github.com:my-org/saltstack-iis-formula.git:
    - privkey: /root/.ssh/github-saltstack-iis-formula-id_rsa
    - pubkey: /root/.ssh/github-saltstack-iis-formula-id_rsa.pub
  - git@github.com:my-org/saltstack-cassandra-formula.git:
    - privkey: /root/.ssh/github-saltstack-cassandra-id_rsa
    - pubkey: /root/.ssh/github-saltstack-cassandra-id_rsa.pub
  - git@github.com:my-org/apache-formula.git:
    - privkey: /root/.ssh/github-saltstack-id_rsa
    - pubkey: /root/.ssh/github-saltstack-id_rsa.pub
  - git@github.com:my-org/node-formula.git:
    - privkey: /root/.ssh/github-saltstack-node-formula-id_rsa
    - pubkey: /root/.ssh/github-saltstack-node-formula-id_rsa.pub
  - git@github.com:my-org/nvm-formula.git: 
    - privkey: /root/.ssh/github-saltstack-nvm-formula-id_rsa
    - pubkey: /root/.ssh/github-saltstack-nvm-formula-id_rsa.pub
pillar_roots:
  base:
    - /srv/salt/pillar
ext_pillar:
  - git:
    - staging git@github.com:my-org/saltstack-pillars.git:
      - privkey: /root/.ssh/github-saltstack-pillars-id_rsa
      - pubkey: /root/.ssh/github-saltstack-pillars-id_rsa.pub
      - env: base
  - git:
    - master git@github.com:my-org/saltstack-pillars-serverconfig.git:
      - privkey: /root/.ssh/github-saltstack-pillars-serverconfig-id_rsa
      - pubkey: /root/.ssh/github-saltstack-pillars-serverconfig-id_rsa.pub
      - env: base
  - git:
    - master git@github.com:my-org/saltstack-team1.git:
      - privkey: /root/.ssh/my-org_saltstack-team1-id_rsa
      - pubkey: /root/.ssh/my-org_saltstack-team1-id_rsa.pub
      - env: base
      - root: pillars
  - git:
    - master git@github.com:my-org/saltstack-team2.git:
      - privkey: /root/.ssh/my-org_saltstack-team2-id_rsa
      - pubkey: /root/.ssh/my-org_saltstack-team2-id_rsa.pub
      - env: base
      - root: pillars
  - git:
    - master git@github.com:my-org/saltstack-team3.git:
      - privkey: /root/.ssh/github-saltstack-team3
      - pubkey: /root/.ssh/github-saltstack-team3.pub
      - env: base
      - root: pillars
pillar_source_merging_strategy: recurse
pillar_opts: False
peer:
  WIN.*:
    - .*
log_level: warning
log_level_logfile: warning
sqs.region: us-west-2
sqs.message_format: json
sqs.key: **********************************
sqs.keyid: **********************

@deuscapturus
Copy link
Contributor

@cachedout We updated to version 2016.3.0 on June 8. Our memory utilization became erratic at that point in time.

image

Our logs don't show 100% memory utilization. In general memory utilization is around 50%, but It could be peaking to 100% for short periods of time.

@cachedout
Copy link
Contributor

@deuscapturus Good info, thanks. I'm mostly curious if this is a memory pressure problem or something else entirely.

@deuscapturus
Copy link
Contributor

We also see this error

2016-06-28 21:36:57,574 [tornado.general  ][ERROR   ][10940] Uncaught exception, closing connection.
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 433, in _handle_events
    self._handle_recv()
  File "/usr/lib64/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 465, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/lib64/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 407, in _run_callback
    callback(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/tornado/stack_context.py", line 274, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 909, in mark_future
    future.set_result(self.serial.loads(msg[0]))
  File "/usr/lib/python2.7/site-packages/salt/payload.py", line 117, in loads
    return msgpack.loads(msg, use_list=True)
  File "msgpack/_unpacker.pyx", line 143, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:143)
ExtraData: unpack(b) received extra data.
2016-06-28 21:36:57,574 [tornado.application][ERROR   ][10940] Exception in callback None

@cachedout
Copy link
Contributor

@deuscapturus At or around the same time you see the other or do they appear at largely different points in the log?

@deuscapturus
Copy link
Contributor

@cachedout the memory errors were higher in the log. We actually don't see the memory errors each time we get into this bad state. The error ExtraData: unpack(b) received extra data. error occurs when the master is in the bad state (we see the StreamClosedError error streaming into the log) and somebody tries to issue a salt command.

@cachedout
Copy link
Contributor

@deuscapturus K. That's helpful. Thanks.

@skizunov
Copy link
Contributor

@cachedout : I haven't seen such an issue before myself. The first error seems to happen in salt.pillar and I noticed that pillar_source_merging_strategy: recurse.

It may be an infinite (or very deep) recursion problem. Unfortunately, I am not familiar with the pillar logic, as I don't use that feature myself.

@Ch3LL
Copy link
Contributor

Ch3LL commented Jun 29, 2016

UPDATE: I have not been able to replicate this even having my master run through the night. Here is my master config:

worker_threads: 15
fileserver_backend:
  - roots
  - git
gitfs_provider: pygit2
gitfs_remotes:
  - https://github.com/Ch3LL/Ch3LLScripts.git:
    - root: salt/states
ext_pillar:
  - git:
    - master https://github.com/Ch3LL/pillar-test.git
pillar_opts: False
pillar_source_merging_strategy: recurse
#
rest_cherrypy:
  port: 8000
  disable_ssl: True
  debug: True
  host: 0.0.0.0
  webhook_url: /hook
  webhook_disable_auth: True
external_auth:
  pam:
    saltdev:
      - .*
      - '@runner'
      - '@wheel'
      - '@jobs'
    root:
      - .*
      - '@runner'
      - '@wheel'
      - '@jobs'

Are you seeing this issue when running a certain command? Or if anyone is willing to do a git bisect that would be useful as well. Also is there any other information that might be able to help me replicate this issue so we can track down how to fix it?

@Ch3LL Ch3LL added cannot-reproduce cannot be replicated with info/context provided Bug broken, incorrect, or confusing behavior P2 Priority 2 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around labels Jun 29, 2016
@Ch3LL
Copy link
Contributor

Ch3LL commented Jun 30, 2016

Can anyone help me out to track this down to a simple test case? We will need to be able to reproduce this error to fix it? Any ideas anyone?

@meggiebot meggiebot modified the milestones: C 7, Blocked Jul 5, 2016
@meggiebot meggiebot assigned cachedout and unassigned DmitryKuzmenko Jul 5, 2016
@cachedout
Copy link
Contributor

This is very mysterious. Right now, I am wondering if what's happening here is that we're switching into an IOLoop that's currently waiting on a read callback to complete. (Though, what's even stranger is that I haven't yet found a place where that callback would even be set.) It looks like we're in the context manager salt.async.current_ioloop every time we hit this bug which does seem suspect. I'm still not sure how to verify that this might be the case though. It's also a little unclear to me why it's necessary to switch the loop in the context, so I'll try to figure that bit out next...

cc: @skizunov (Any insight on the above would be greatly appreciated.)

@cachedout
Copy link
Contributor

OK, so I think we have a better idea of what's going on here.

What this looks like is that we're pushing data onto the Tornado write queue in the EventPublisher and that memory is never being reclaimed. There are a couple of possible factors in play here:

  1. According to this documentation: http://www.tornadoweb.org/en/stable/iostream.html?highlight=iostream#tornado.iostream.BaseIOStream.write, any write that happens while another write is going on, the previous future will be orphaned.

  2. Tornado iostreams, by default, have an unlimited write buffer. Setting the max_write_buffer_size when the stream is created, especially to a small size (like 1) will immediately quickly cause exactly the stacktraces seen above. Here's a diff that should show this:

diff --git a/salt/transport/ipc.py b/salt/transport/ipc.py
index b6737c3..9b58963 100644
--- a/salt/transport/ipc.py
+++ b/salt/transport/ipc.py
@@ -509,6 +511,7 @@ class IPCMessagePublisher(object):
             stream = IOStream(
                 connection,
                 io_loop=self.io_loop,
+                max_write_buffer_size=1
             )
             self.streams.add(stream)
         except Exception as exc:

I think that what we likely want to do here is to set a max_write_buffer_size dynamically, perhaps as some percentage of memory on the machine. We should also allow it to be set statically.

Upon filling the buffer, we'll need to drop messages. We should log these, hopefully batching the warnings together.

We may also wish to consider an option to allow users to select the use of ZMQ for the event bus. (Though we totally removed this a while ago.) Some users may wish to use the buffering algorithms already implemented there.

@silenius
Copy link
Contributor

silenius commented Jul 8, 2016

hello, just to add that we have the same problem here with a fresh 2016.3.1:

2016-07-08 13:12:17,995 [tornado.application                                        ][ERROR   ][2872] Exception in callback None
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/tornado/ioloop.py", line 883, in start
    handler_func(fd_obj, events)
  File "/usr/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/salt/transport/zeromq.py", line 909, in mark_future
    future.set_result(self.serial.loads(msg[0]))
  File "/usr/local/lib/python2.7/site-packages/salt/payload.py", line 117, in loads
    return msgpack.loads(msg, use_list=True)
  File "msgpack/_unpacker.pyx", line 143, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:143)
ExtraData: unpack(b) received extra data.

this is with a simple salt "somehost" test.ping

@cachedout
Copy link
Contributor

All right. I have a first-pass at a fix that does seem to resolve the leak. Please see #34683.

We do need extensive real-world testing of this. However be advised that messages will be dropped when this buffer is reached. Using this in production, especially with a high volume of events may be risky. We need further tuning and logging before this is truly ready to go.

That said, any testing that people can do here would be incredibly helpful. :]

@cachedout cachedout added fixed-pls-verify fix is linked, bug author to confirm fix and removed cannot-reproduce cannot be replicated with info/context provided info-needed waiting for more info labels Jul 14, 2016
@meggiebot meggiebot modified the milestones: C 6, C 7 Jul 18, 2016
@rvora
Copy link
Author

rvora commented Aug 8, 2016

This fix has not solved my problem after upgrading salt-master to 2016.3.2

Here's some new information / correlation I have found.

I have a cronjob that runs every night at 2:00 UTC:

  • 2 * * * flock -xn /tmp/manage-down.lock -c "salt-run manage.down removekeys=True >>/var/log/salt-run-manage-down 2>&1"

I also see the following error in the /var/log/salt-run-manage-down file though there are no timestamps in that file so it is hard to tell exactly what time this error happened.

Monitoring system also shows memory climbing steadily (over a 24 hour period) to 100%.

Exception occurred in runner manage.down: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/client/mixins.py", line 346, in low
    data['return'] = self.functions[fun](*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/salt/runners/manage.py", line 127, in down
    wheel.call_func('key.delete', match=minion)
  File "/usr/lib/python2.6/site-packages/salt/wheel/__init__.py", line 51, in call_func
    return self.low(fun, kwargs)
  File "/usr/lib/python2.6/site-packages/salt/client/mixins.py", line 255, in low
    self.mminion  # pylint: disable=W0104
  File "/usr/lib/python2.6/site-packages/salt/client/mixins.py", line 236, in mminion
    self._mminion = salt.minion.MasterMinion(self.opts, states=False, rend=False)
  File "/usr/lib/python2.6/site-packages/salt/minion.py", line 664, in __init__
    self.opts['grains'] = salt.loader.grains(opts)
  File "/usr/lib/python2.6/site-packages/salt/loader.py", line 704, in grains
    ret = fun()
  File "/usr/lib/python2.6/site-packages/salt/grains/core.py", line 1709, in hwaddr_interfaces
    ifaces = _get_interfaces()
  File "/usr/lib/python2.6/site-packages/salt/grains/core.py", line 1090, in _get_interfaces
    _INTERFACES = salt.utils.network.interfaces()
  File "/usr/lib/python2.6/site-packages/salt/utils/network.py", line 749, in interfaces
    return linux_interfaces()
  File "/usr/lib/python2.6/site-packages/salt/utils/network.py", line 627, in linux_interfaces
    stderr=subprocess.STDOUT).communicate()[0]
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1144, in _execute_child
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

@cachedout
Copy link
Contributor

Did you set the ipc_write_buffer config option? It's not enabled by default. Please see the linked PR above for details.

@rvora
Copy link
Author

rvora commented Aug 9, 2016

@cachedout Should I try ipc_write_buffer: 'dynamic'? If not, how should I go about estimating a value in number of bytes?

@deuscapturus
Copy link
Contributor

Following up. Our salt master has not had any issues in the past few days after upgrading to 2016.3.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix P2 Priority 2 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around ZRELEASED - 2016.3.2
Projects
None yet
Development

No branches or pull requests

9 participants