Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: maximum recursion depth exceeded while calling a Python object #37646

Closed
himadrisingh001 opened this issue Nov 11, 2016 · 19 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix P4 Priority 4 severity-critical top severity, seen by most users, serious issues severity-high 2nd top severity, seen by most users, causes major problems v2018.3.5 unsupported version v2019.2.1 unsupported version
Milestone

Comments

@himadrisingh001
Copy link

Description of Issue/Question

Running salt-minion on the master runs into following recursion and eats up the cpu, making master unresponsive.

Process Process-1:926:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/utils/schedule.py", line 737, in handle_func
    channel.send(load)
  File "/usr/lib/python2.7/dist-packages/salt/utils/async.py", line 74, in wrap
    ret = self._block_future(ret)
  File "/usr/lib/python2.7/dist-packages/salt/utils/async.py", line 84, in _block_future
    return future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 240, in send
    ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 208, in _crypted_transfer
    ret = yield _do_transfer()
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 192, in _do_transfer
    self._package_load(self.auth.crypticle.dumps(load)),
  File "/usr/lib/python2.7/dist-packages/salt/crypt.py", line 1187, in dumps
    return self.encrypt(self.PICKLE_PAD + self.serial.dumps(obj))
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 164, in dumps
    return msgpack.dumps(verylong_encoder(msg))
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 149, in verylong_encoder
    obj[key] = verylong_encoder(value)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 154, in verylong_encoder
    obj[idx] = verylong_encoder(entry)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 149, in verylong_encoder
    obj[key] = verylong_encoder(value)
  ...
  (1000 repeated lines)
  ...  
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 154, in verylong_encoder
    obj[idx] = verylong_encoder(entry)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 149, in verylong_encoder
    obj[key] = verylong_encoder(value)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 154, in verylong_encoder
    obj[idx] = verylong_encoder(entry)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 151, in verylong_encoder
    elif isinstance(obj, (list, tuple)):
RuntimeError: maximum recursion depth exceeded while calling a Python object

Setup

$ cat /etc/salt/minion.d/*
log_file: file:///dev/log
log_fmt_logfile: '%(name)s %(message)s'
log_fmt_console: '%(asctime)s,%(msecs)03.0f [%(name)-10s][%(levelname)-8s] %(message)s'
master: <master>
#scaling issues
random_reauth_delay: 180

recon_default: 3000
recon_max: 177000
recon_randomize: True

keysize: 2048
hash_type: sha256
schedule:
  __mine_interval: {function: mine.update, jid_include: true, maxrunning: 2, minutes: 60,
    return_job: false}
verify_master_pubkey_sign: True
cat /etc/salt/master
user: salt
auto_accept: True
worker_threads: 64
job_cache: False
hash_type: sha256
open_mode: True
keep_jobs: 1

state_aggregate:
  - pkg

Steps to Reproduce Issue

Its randomly happening so dont have exact steps to reproduce this. But its not uncommon.

Versions Report

$ salt-call  --versions-report
Salt Version:
           Salt: 2015.8.12

Dependency Versions:
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
           Mako: 0.9.1
         PyYAML: 3.10
          PyZMQ: 14.0.1
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.4
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
        libgit2: Not Installed
        libnacl: Not Installed
   msgpack-pure: 0.1.3
 msgpack-python: 0.3.0
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: 0.8.2
        timelib: Not Installed

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-63-generic
         system: Ubuntu 14.04 trusty
$ salt  --versions-report
Salt Version:
           Salt: 2015.8.12

Dependency Versions:
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
           Mako: 0.9.1
         PyYAML: 3.10
          PyZMQ: 14.0.1
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.4
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
        libgit2: Not Installed
        libnacl: Not Installed
   msgpack-pure: 0.1.3
 msgpack-python: 0.3.0
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: 0.8.2
        timelib: Not Installed

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-63-generic
         system: Ubuntu 14.04 trusty
@Ch3LL
Copy link
Contributor

Ch3LL commented Nov 11, 2016

@himadrisingh001 just a couple questions:

  1. how often do you see this error?
  2. Did this just start occuring and if so what is anything new that you added to your environment? for example did you just upgrade to .12? add a new config?

In the meantime I have a setup with your configs and i'll wait to see if I see the same behavior.

@Ch3LL Ch3LL added the info-needed waiting for more info label Nov 11, 2016
@Ch3LL Ch3LL added this to the Blocked milestone Nov 11, 2016
@himadrisingh001
Copy link
Author

  1. how often do you see this error?

We see around 270+ times in a day. Somehow its not printing the timestamp. We dont see high cpu usage too often though since the udpate.

for d in salt-minion.log.*; do { ls -la $d; zgrep RuntimeError $d| wc -l; } done
-rw-r----- 1 root root 318097 Nov 11 00:12 salt-minion.log.1.gz
278
-rw-r----- 1 root root 330092 Nov 10 00:12 salt-minion.log.2.gz
289
-rw-r----- 1 root root 329437 Nov  9 00:12 salt-minion.log.3.gz
289
-rw-r----- 1 root root 309261 Nov  8 00:12 salt-minion.log.4.gz
269
-rw-r----- 1 root root 271442 Nov  7 00:15 salt-minion.log.5.gz
235
-rw-r----- 1 root root 330250 Nov  6 00:14 salt-minion.log.6.gz
289
-rw-r----- 1 root root 272796 Nov  5 00:14 salt-minion.log.7.gz
236
  1. Did this just start occuring and if so what is anything new that you added to your environment? for example did you just upgrade to .12? add a new config?
    It has been happening since 2015.8.11 too. Update didn't fix the issue.
    We didn't noticed since the high cpu and master becoming unresponsive.

High cpu usage might be unrelated but salt-minion was taking up the most of the cpu.

@himadrisingh001
Copy link
Author

Ran it with debug logging

17:45:21,244 [salt.state                                           ][DEBUG   ] File /var/cache/salt/minion/accumulator/139909768094608 does not exist, no need to cleanup.
17:45:21,245 [salt.transport.zeromq                                ][DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', '<salt-master>', 'tcp://<salt-master-ip>:4506', 'aes')
17:45:21,246 [salt.crypt                                           ][DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', '<salt-master>', 'tcp://<salt-master-ip>:4506')
Process Process-1:3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
...
...
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 154, in verylong_encoder
    obj[idx] = verylong_encoder(entry)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 151, in verylong_encoder
    elif isinstance(obj, (list, tuple)):
RuntimeError: maximum recursion depth exceeded while calling a Python object

17:45:24,141 [salt.minion                                          ][DEBUG   ] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xda\x00,<salt-master>\xa6_stamp\xba2016-11-14T17:45:24.140986'
17:45:24,142 [salt.transport.zeromq                                ][DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', '<salt-master>', 'tcp://<salt-master-ip>:4506', 'aes')
17:45:24,142 [salt.crypt                                           ][DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', '<salt-master>', 'tcp://<salt-master-ip>:4506')

@Ch3LL
Copy link
Contributor

Ch3LL commented Nov 17, 2016

@himadrisingh001 still am having a hard time replicating this. But can you clarify a couple things as well:

  1. So just to clarify you didn't change anything with your salt environment? For example an upgrade of salt
  2. also to clarify you are only seeing this on the minion thats located on the master correct?

@himadrisingh001
Copy link
Author

  1. We aren't sure when this started occurring so its hard to decide the change which could have caused this. It was happening with 2015-11 versions too.
  2. This only happens on the salt-master node, rest of them are working fine.

@Ch3LL
Copy link
Contributor

Ch3LL commented Nov 19, 2016

Thanks for clarifying Do you have any scheduled jobs that are running?

@himadrisingh001
Copy link
Author

yes we do have one

schedule:
  __mine_interval: {function: mine.update, jid_include: true, maxrunning: 2, minutes: 60,
    return_job: false}

@Ch3LL
Copy link
Contributor

Ch3LL commented Nov 22, 2016

ping @rallytime or @cachedout i'm pretty stumped on what could be causing this. do either of you have any ideas on this one? I can't seem to replicate it and not sure of any further follow up questions to ensure we get @himadrisingh001 up and running.

@phsteve
Copy link
Contributor

phsteve commented Jul 6, 2017

@Ch3LL @himadrisingh001 I think I have a reproducer for this issue. I'm not seeing high CPU load, but I have a very similar RuntimeError about recursion depth exceeded in the same section of payload.py.

Try this state:

Perform a highstate:
  schedule.present:
    - function: state.highstate
    - job_kwargs:
        test: True
    - when: 1:55pm

but modify the when parameter to be a couple minutes in the future. Then do the following:

  1. Run a highstate, which should successfully schedule the job on the minion.
  2. You should see the exception occur when the minion attempts to do the scheduled highstate.
  3. Remove the - job_kwargs: test: True parameter and bump the when parameter to a few more minutes in the future.
  4. Do another highstate, which should succeed but will drop the kwargs param for the scheduled job.
  5. Watch the next scheduled highstate succeed.

I'm not sure if this is related, but if you just modify the when parameter and keep the test: True parameter, you'll see this diff when you run a highstate:

----------
          ID: Perform a highstate
    Function: schedule.present
      Result: True
     Comment: Modifying job Perform a highstate in schedule
     Started: 14:14:48.576346
    Duration: 25.37 ms
     Changes:
              ----------
              diff:
                  ---
                  +++
                  @@ -1,7 +1,7 @@
                   enabled:True
                   function:state.highstate
                   jid_include:True
                  -kwargs:{'test': True}
                  +kwargs:OrderedDict([('test', True)])
                   maxrunning:1
                   name:Perform a scheduled highstate
                  -when:2:19pm
                  +when:2:18pm

so it looks like something funny is happening with dicts vs OrderedDicts in kwargs.

Here is my version report:

# salt-call --versions-report
Salt Version:
           Salt: 2016.11.4

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.6.0
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8.1
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: 3.4.3
         pygit2: Not Installed
         Python: 2.6.6 (r266:84292, Aug 18 2016, 15:13:37)
   python-gnupg: Not Installed
         PyYAML: 3.09
          PyZMQ: 14.5.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: centos 6.4 Final
        machine: x86_64
        release: 2.6.32-358.el6.x86_64
         system: Linux
        version: CentOS 6.4 Final

Let me know if I can provide any more info!

@stale
Copy link

stale bot commented Nov 14, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Nov 14, 2018
@stale stale bot closed this as completed Nov 22, 2018
@thusoy
Copy link
Contributor

thusoy commented Jan 21, 2019

I'm able to see this consistently on all versions of salt from 2016.11 to latest RC if the message to be serialized contains a circular reference. Other encoders like json.dumps are able to detect this and raise an exception that accurately describes the problem, msgpack crashes with the error described above.

@cachedout
Copy link
Contributor

@Ch3LL or @sagetherage this needs to be re-opened, please.

@sagetherage sagetherage reopened this Jan 22, 2019
@stale
Copy link

stale bot commented Jan 22, 2019

Thank you for updating this issue. It is no longer marked as stale.

@stale stale bot removed the stale label Jan 22, 2019
@Ch3LL Ch3LL added severity-high 2nd top severity, seen by most users, causes major problems and removed info-needed waiting for more info labels Jan 22, 2019
@Ch3LL Ch3LL modified the milestones: Blocked, Approved Jan 22, 2019
@Ch3LL Ch3LL added Bug broken, incorrect, or confusing behavior P4 Priority 4 labels Jan 22, 2019
@Ch3LL Ch3LL added severity-critical top severity, seen by most users, serious issues v2018.3.5 unsupported version v2019.2.1 unsupported version labels Jan 22, 2019
@waynew
Copy link
Contributor

waynew commented Jan 24, 2019

@thusoy do you have some specific states that can reproduce this? I'm trying with this:

Perform a highstate:
  schedule.present:
    - function: state.highstate
    - job_kwargs:
        test: True
    - when: 1:55pm

I'm not getting the max recursion depth, but I am seeing

minion_1  | [ERROR   ] Exception Parser must be a string or character stream, not int occurred in scheduled job

pretty repeatedly :suspect:

@thusoy
Copy link
Contributor

thusoy commented Jan 24, 2019

Try to run a state like this:

#!py

def run():
    array = []
    array.append(array)
    return {
        'test-circular-reference': {
            'file.managed': [
                {'name': '/tmp/foo'},
                {'source': 'salt://foo.txt'},
                {'template': 'jinja'},
                {'context': {
                    'array': array,
                }},
            ]
        }
    }

@waynew
Copy link
Contributor

waynew commented Jan 24, 2019

@thusoy yep, that'll do it. I'll see what I can figure out here 👍

@waynew waynew added the fixed-pls-verify fix is linked, bug author to confirm fix label Jan 25, 2019
@KChandrashekhar KChandrashekhar modified the milestones: March 29th, Priority Apr 5, 2019
@waynew waynew closed this as completed Jul 2, 2019
@thusoy
Copy link
Contributor

thusoy commented Jul 19, 2019

@waynew The fix from #51322 seems weird to me, I don't think anyone would want the value to be silently converted to a string, an explicit exception raised describing the recursion is probably preferable. The other thing I noticed while reading the fix is that the recursion detection seems flawed, any repeated object in the value would be detected as a recursion. Ie

repeated = 'repeated'
data = [repeated, repeated]

Would also be detected as recursion if the verylong_encoder is used (which I couldn't easily reproduce in a test, but I'm sure you'll be able to figure it out).

@thusoy
Copy link
Contributor

thusoy commented Jul 19, 2019

Note that this isn't likely to happen while using yaml states, but more complicated states written in python could easily trip on this.

@waynew
Copy link
Contributor

waynew commented Jul 25, 2019

I'm not sure that there's an reason necessarily to prohibit recursion - and we're never actually losing information, unless something is depending on a self-referential cycle to be present.

It's possible that a recursive structure indicates a deeper failure, but that's not clear to me right now.

You are right that we're not handling the case where we have multiple of the same immutable object in a list. It may cause a problem due to an implementation detail, like this:

a = 'a'
aa = 'a'
data = [a, aa, 123, 123]
print(verylong_encoder(data, set()))
['a', '<Recursion on str with id=4551544312>', 123, '<Recursion on int with id=4549087184>']

I'll go ahead and put in a PR to fix that, but without some evidence to the contrary, I'm going to leave the recursive str-ifying in there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix P4 Priority 4 severity-critical top severity, seen by most users, serious issues severity-high 2nd top severity, seen by most users, causes major problems v2018.3.5 unsupported version v2019.2.1 unsupported version
Projects
None yet
Development

No branches or pull requests

8 participants