Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] "Minion did not return. [No response]" with salt-proxy in 3005. #62694

Open
ichilton opened this issue Sep 16, 2022 · 7 comments
Open

[BUG] "Minion did not return. [No response]" with salt-proxy in 3005. #62694

ichilton opened this issue Sep 16, 2022 · 7 comments
Labels
Bug broken, incorrect, or confusing behavior info-needed waiting for more info needs-triage Performance Phosphorus v3005.0 Release code name and version Proxy-Minion Regression The issue is a bug that breaks functionality known to work in previous releases.

Comments

@ichilton
Copy link

Description

We have had a working Salt, salt-proxy, NAPALM network automation setup for around a year.

Suddenly, we find that any state.highstate call fails with Minion did not return. [No response].

$ sudo salt mydevice state.highstate test=true
mydevice:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20220916102020424551

If I run salt-run jobs.lookup_jid, it shows the expected output.

Other calls, like test.ping or querying grains etc works fine.

Looking at our unattended upgrades logs, it seems there was an update from 3004 to 3005 recently, so wondering if that's the cause?

Setup

Using Debian packages on Debian stable (bullseye).

Running on our own VMs.

The salt-proxy processes are on a separate VM, in the same VLAN.

Steps to Reproduce the behavior

Running $ sudo salt mydevice state.highstate test=true fails, as above.

Using -l debug, shows:

[DEBUG   ] get_iter_returns for jid 20220916103839235384 sent to {'mydevice'} will timeout at 11:38:44.249984
[DEBUG   ] Checking whether jid 20220916103839235384 is still running
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] retcode missing from client return
[DEBUG   ] return event: {'mydevice': {'failed': True}}

Versions Report

salt --versions-report
root@network-master:~# salt --versions-report
Salt Version:
          Salt: 3005

Dependency Versions:
          cffi: Not Installed
      cherrypy: Not Installed
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.11.3
       libgit2: 1.1.0
      M2Crypto: 0.37.1
          Mako: Not Installed
       msgpack: 1.0.0
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.20
      pycrypto: Not Installed
  pycryptodome: 3.9.7
        pygit2: 1.4.0
        Python: 3.9.2 (default, Feb 28 2021, 17:03:44)
  python-gnupg: Not Installed
        PyYAML: 5.3.1
         PyZMQ: 20.0.0
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 11 bullseye
        locale: utf-8
       machine: x86_64
       release: 5.10.0-18-amd64
        system: Linux
       version: Debian GNU/Linux 11 bullseye
@ichilton ichilton added Bug broken, incorrect, or confusing behavior needs-triage labels Sep 16, 2022
@OrangeDog
Copy link
Contributor

As an aside, you should not allow unattended-upgrades to install new major versions of anything. Things will break.
First step is to check the release notes: https://docs.saltproject.io/en/master/topics/releases/3005.html

Does the minion eventually return, allowing you to lookup the job result later?
Can you get logs from the minion to see what's happening?
Can you identify which state(s) are causing the issue?

@OrangeDog OrangeDog added the info-needed waiting for more info label Sep 16, 2022
@ichilton
Copy link
Author

Hi @OrangeDog,

Thanks for the reply.

Yes, this is the danger of unattended upgrades (I do maintain a blacklist of critical packages not to update, so it is mainly system/background packages which get updated). I need to review whether Salt should be on that list, but that is an aside, since if it is the upgrade which caused it, I would have had the same issue on a manual upgrade too.

I can't see anything obvious in the release notes that stands out as an obvious cause.

As I noted above, running salt-run jobs.lookup_jid <jid> a few seconds later does show the expected output - so the task does seem to be executing properly - it seems to be the communication between master and minion that's a problem - I assume the minion is either not replying quick enough, or the debug snippet I pasted above implies that the response may be invalid (retcode missing from client return).

With regards to logs, nothing is appended to the proxy minion logs when this issue happens. I included what looks like the relevant snippet from the master (with -l debug) above.

With regards to state, I can say with certainty which state is is, as I only use one :)

This is my state file:

switch:
  netconfig.managed:
    - template_name: salt://{{ slspath }}/templates/switch.jinja

Then switch.jinja has includes to many other jinja templates.

Thanks,

Ian

@ichilton
Copy link
Author

An update on this.

I downgraded salt on my proxy machine to 3004.2 and it fixes the problem.

root@network-proxy:~# dpkg -l |grep salt
ii  salt-common                    3004.2+ds-1                    all          shared libraries that salt requires for all packages
ii  salt-minion                    3004.2+ds-1                    all          client package for salt, the distributed remote execution system

The salt master is still running 3005:

root@network-master:~# dpkg -l |grep salt
ii  salt-common                    3005+ds-1                      all          shared libraries that salt requires for all packages
ii  salt-master                    3005+ds-1                      all          remote manager to administer servers via salt
ii  salt-minion                    3005+ds-1                      all          client package for salt, the distributed remote execution system

Interestingly, state.highstate commands feel like they take much longer to return than it did to give the 'Minion did not return' error prior to upgrade, so it could just be that a timeout has been lowered in 3005?

root@network-master:~# time salt mydevice state.highstate test=true
mydevice:

Summary for mydevice
------------
Succeeded: 1
Failed:    0
------------
Total states run:     1
Total run time:  22.714 s

real	0m25.221s
user	0m0.777s
sys	0m0.125s

@OrangeDog
Copy link
Contributor

(with -l debug)

That only sets the log level of the master, not the (proxy) minion.
You need to adjust the minion config to increase the log level, I believe.

whether Salt should be on that list

Almost certainly. Things are frequently changed, and you want to be able to test things first before you take down your whole infra. That's also why Salt provides repos for a single major version so you only get the minor updates.

@ichilton
Copy link
Author

Any suggestions as to whether there is a timeout which may be adjustable, which I could increase as a workaround, instead of having to use the older version?

@OrangeDog
Copy link
Contributor

I don't know which one specifically would apply, but you could start with
https://docs.saltproject.io/en/latest/ref/configuration/master.html#timeout

More useful would be getting the debug log from the minion, though if it already doesn't even have any warnings then it's probably just slowed down a bit.

@OrangeDog OrangeDog added Regression The issue is a bug that breaks functionality known to work in previous releases. Performance Phosphorus v3005.0 Release code name and version labels Sep 16, 2022
@network-shark
Copy link
Contributor

Are you using the default timers ? In my environment it does not work at all. If the job cache returns the correct result . You should increase the master timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior info-needed waiting for more info needs-triage Performance Phosphorus v3005.0 Release code name and version Proxy-Minion Regression The issue is a bug that breaks functionality known to work in previous releases.
Projects
None yet
Development

No branches or pull requests

3 participants