Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minion did not return. [No response] appears occasionally,but once happened,minion never returns #56467

Open
marilyn6483 opened this issue Mar 28, 2020 · 31 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior info-needed waiting for more info
Milestone

Comments

@marilyn6483
Copy link

marilyn6483 commented Mar 28, 2020

Description of Issue

Setup

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)

  • minion and master config file are as follows, sensitive info is replace by ***,because of more than one minion has been set up, we specify our conf file path as /etc/salt2019

cat minion | grep -v "^#" | grep -v "^$"

master: ***
user: root
pidfile: /var/run/salt2019-minion.pid
conf_file: /etc/salt2019/minion
pki_dir: /etc/salt2019/pki/minion
id: ***
cachedir: /var/cache/salt2019/minion
sock_dir: /var/run/salt2019/minion
file_roots: /srv/salt2019
log_file: /var/log/salt2019/minion
key_logfile: /var/log/salt2019/key
log_level: debug
tcp_keepalive: True
  • master config file
    cat master | grep -v "^#" | grep -v "^$"
user: ***
publisher_acl:
  user***:
    - test.*
    - state.*
    - cmd.*

Steps to Reproduce Issue

(Include debug logs if possible and relevant.)

The following is debug log,master and minion ip is replaced by 1.2.3.4 and 2.2.2.2. We are executing a sls file, the salt minion runs very well for a couple of days before it suddenly couldn't return data to master. I located the source code, found out that minion just stopped in “ Connecting the Minion to the Master URI (for the return server)”, and no more debug log info was printed.

  • Here is the log when minion does not return
 2020-03-20 14:50:40,465 [salt.minion                                                      :1465][INFO    ][1397] User saltapi Executing command state.sls with jid 20200320145808931182
2020-03-20 14:50:40,493 [salt.minion                                                      :1472][DEBUG   ][1397] Command details {'fun': 'state.sls', 'ret': '', 'tgt': '1.2.3.4', 'user': 'saltapi', 'tgt_type': 'glob', 'jid': '20200320145808931182', 'arg': ['------sensitive info is replaced ----------', 'pillar={"tasktype":"ent-serial-collect"}', {'queue': True, '__kwarg__': True}]}
2020-03-20 14:50:40,526 [salt.minion                                                      :1605][INFO    ][17093] Starting a new job 20200320145808931182 with PID 17093
2020-03-20 14:50:40,550 [salt.utils.lazy                                                  :107 ][DEBUG   ][17093] Could not LazyLoad {0}.allow_missing_func: '{0}.allow_missing_func' is not available.
2020-03-20 14:50:40,570 [salt.utils.lazy                                                  :104 ][DEBUG   ][17093] LazyLoaded state.sls
2020-03-20 14:50:40,582 [salt.utils.lazy                                                  :104 ][DEBUG   ][17093] LazyLoaded saltutil.is_running
2020-03-20 14:50:40,587 [salt.utils.lazy                                                  :104 ][DEBUG   ][17093] LazyLoaded grains.get
2020-03-20 14:50:40,589 [salt.loader.2.2.2.2.int.module.config                       :398 ][DEBUG   ][17093] key: test, ret: _|-
2020-03-20 14:50:40,611 [salt.transport.zeromq                                            :132 ][DEBUG   ][17093] Initializing new AsyncZeroMQReqChannel for ('/etc/salt2019/pki/minion', '1.2.3.4', 'tcp://2.2.2.2:4506', 'aes')
2020-03-20 14:50:40,611 [salt.crypt                                                       :463 ][DEBUG   ][17093] Initializing new AsyncAuth for ('/etc/salt2019/pki/minion', '1.2.3.4', 'tcp://2.2.2.2:4506')
2020-03-20 14:50:40,613 [salt.transport.zeromq                                            :203 ][DEBUG   ][17093] Connecting the Minion to the Master URI (for the return server): tcp://2.2.2.2:4506
2020-03-20 14:50:41,995 [salt.minion                                                      :1465][INFO    ][1397] User saltapi Executing command state.sls with jid 20200320145810461355
2020-03-20 14:50:41,995 [salt.minion                                                      :1472][DEBUG   ][1397] Command details {'fun': 'state.sls', 'ret': '', 'tgt': '1.2.3.4', 'user': 'saltapi', 'tgt_type': 'glob', 'jid': '20200320145810461355', 'arg': ['------sensitive info is replaced ----------', 'pillar={"tasktype":"ent-serial-collect"}', {'queue': True, '__kwarg__': True}]}
2020-03-20 14:50:42,006 [salt.minion                                                      :1605][INFO    ][17213] Starting a new job 20200320145810461355 with PID 17213
2020-03-20 14:50:42,010 [salt.utils.lazy                                                  :107 ][DEBUG   ][17213] Could not LazyLoad {0}.allow_missing_func: '{0}.allow_missing_func' is not available.
2020-03-20 14:50:42,014 [salt.utils.lazy                                                  :104 ][DEBUG   ][17213] LazyLoaded state.sls
2020-03-20 14:50:42,018 [salt.utils.lazy                                                  :104 ][DEBUG   ][17213] LazyLoaded saltutil.is_running
2020-03-20 14:50:42,020 [salt.utils.lazy                                                  :104 ][DEBUG   ][17213] LazyLoaded grains.get
2020-03-20 14:50:42,022 [salt.loader.2.2.2.2.int.module.config                       :398 ][DEBUG   ][17213] key: test, ret: _|-
2020-03-20 14:50:42,038 [salt.transport.zeromq                                            :132 ][DEBUG   ][17213] Initializing new AsyncZeroMQReqChannel for ('/etc/salt2019/pki/minion', '1.2.3.4', 'tcp://2.2.2.2:4506', 'aes')
2020-03-20 14:50:42,038 [salt.crypt                                                       :463 ][DEBUG   ][17213] Initializing new AsyncAuth for ('/etc/salt2019/pki/minion', '1.2.3.4', 'tcp://2.2.2.2:4506')
2020-03-20 14:50:42,040 [salt.transport.zeromq                                            :203 ][DEBUG   ][17213] Connecting the Minion to the Master URI (for the return server): tcp://2.2.2.2:4506
2020-03-20 15:01:15,337 [salt.utils.schedule                                              :1627][DEBUG   ][1397] schedule: Job __mine_interval was scheduled with jid_include, adding to cache (jid_include defaults to True)
2020-03-20 15:01:15,337 [salt.utils.schedule                                              :1630][DEBUG   ][1397] schedule: Job __mine_interval was scheduled with a max number of 2
2020-03-20 15:01:15,337 [salt.utils.schedule                                              :1647][INFO    ][1397] Running scheduled job: __mine_interval
2020-03-20 15:01:15,458 [salt.utils.schedule                                              :689 ][DEBUG   ][10760] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'pid': 10760, 'id': '1.2.3.4', 'jid': '20200320150115452799', 'schedule': '__mine_interval', 'fun_args': []}
2020-03-20 15:01:15,461 [salt.utils.lazy                                                  :104 ][DEBUG   ][10760] LazyLoaded mine.update
2020-03-20 15:01:15,463 [salt.utils.lazy                                                  :104 ][DEBUG   ][10760] LazyLoaded config.merge
2020-03-20 15:01:15,463 [salt.utils.schedule                                              :836 ][DEBUG   ][10760] schedule.handle_func: Removing /var/cache/salt2019/minion/proc/20200320150115452799

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

  • master version

[root@localhost chenyanyan]# salt-master -V

Salt Version:
           Salt: 2018.3.4

Dependency Versions:
           cffi: 1.6.0
       cherrypy: 5.6.0
       dateutil: 1.5
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.21.1
           Mako: 0.8.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Aug 27 2018, 16:21:36)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist:
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-693.21.1.el7.x86_64
         system: Linux
        version: Not Installed
  • minion version

linux-jc57:/var/log # /root/miniconda3/bin/salt-minion -V

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.10.1
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.1
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.1.5

System Versions:
           dist: SuSE 11 x86_64
         locale: UTF-8
        machine: x86_64
        release: 3.0.76-0.11-default
         system: Linux
        version: SUSE Linux Enterprise Server  11 x86_64
@xeacott
Copy link
Contributor

xeacott commented Mar 30, 2020

How many minions do you have connected to your master? And what I'm reading here is that some minions will no longer connect to the master after a few days, or all minions will no longer connect to the master after a few days?

@xeacott
Copy link
Contributor

xeacott commented Mar 30, 2020

Also, I noticed your master is on 2018.3.4, which is a version we are no longer officially supporting. Is there any way you'd be able to try this setup in another environment with your master on 3000 or also on 2019.2?

@xeacott xeacott added this to the Blocked milestone Mar 30, 2020
@xeacott xeacott added the info-needed waiting for more info label Mar 30, 2020
@marilyn6483
Copy link
Author

marilyn6483 commented Mar 30, 2020 via email

@xeacott
Copy link
Contributor

xeacott commented Mar 30, 2020

Are you able to run basic commands such as ip ping <address> and also check the ports on the suse minions? It could be a configuration issue on the suse minion end since it seems your other in-house minions are running fine without any hiccups!

@marilyn6483
Copy link
Author

marilyn6483 commented Mar 31, 2020 via email

@marilyn6483
Copy link
Author

address

Are you able to run basic commands such as ip ping <address> and also check the ports on the suse minions? It could be a configuration issue on the suse minion end since it seems your other in-house minions are running fine without any hiccups!

I found some system logs:
Mar 20 14:50:40 linux kernel: [89579.613557] python[17093]: segfault at 5f436 ip 000000000005f436 sp 00007fff29e20658 error 14 in python3.5[400000+1000]
Mar 20 14:50:42 linux kernel: [89581.040805] python[17213]: segfault at 5f436 ip 000000000005f436 sp 00007fff29e20658 error 14 in python3.5[400000+1000]

At the time 'Mar 20 14:50:40', the minion‘s log is:
020-03-20 14:50:40,613 [salt.transport.zeromq :203 ][DEBUG ][17093] Connecting the Minion to the Master URI (for the return server): tcp://xxxxx

Is there any connections?

@SaravanakumarSivasankaran
Copy link

I am facing the same issue in Salt: 2019.2.5. Where salt master and minion running inside centos 7 container. Almost 24 test.ping is worked. After that when do several tries also will getting "Minion did not return. [No response]".

If i rebooted the salt-minion, next almost 24 test.ping worked and then this problem occurred again.

@marilyn6483
Copy link
Author

marilyn6483 commented Jul 14, 2020 via email

@SaravanakumarSivasankaran
Copy link

Python 2.7.5

Found the below observation in salt-minion trace logs for not responded test.ping

  • Looks like salt-minion responded that test.ping
    [DEBUG ] minion return: {u'fun_args': [], u'jid': u'20200715071230671893', u'return': True, u'retcode': 0, u'success': True, u'fun': u'test.ping'}

  • Looks like salt-master not received the above response, it asking for that jid again to salt-minion
    [DEBUG ] Command details {u'tgt_type': u'list', u'jid': u'20200715071235735268', u'tgt': [u' node-name'], u'ret': u'', u'user': u'root', u'arg': [u'20200715071230671893'], u'fun': u'saltutil.find_job'}

  • Since the jid is responded already by salt-minion, it responded the below logs
    [DEBUG ] minion return: {u'fun_args': [u'20200715071230671893'], u'jid': u'20200715071235735268', u'return': {}, u'retcode': 0, u'success': True, u'fun': u'saltutil.find_job'}

@edhgoose
Copy link

I have the same problem in Salt 3000.3 (master and minons) too, but it's been going on for quite some time across prior versions to. I've struggled to find any ideas on how to identify it or how to even write a bug report for it.

We're running python 2.7 too and running on AWS Linux.

What I can determine is that there appears to be two cases:

  1. The minion responds to an event, and the master accepts the event was returned
  2. The minion responds (or at least says it responds), but the master never receives the event.

You can see it play out by using salt-run state.event pretty=True.

In the happy case, the following happens:

20200716191834527792    {
    "_stamp": "2020-07-16T19:18:34.528083",
    "minions": [
        "the-minion,"
    ]
}
salt/job/20200716191834527792/new   {
    "_stamp": "2020-07-16T19:18:34.528555",
    "arg": [],
    "fun": "test.ping",
    "jid": "20200716191834527792",
    "minions": [
        "the-minion,"
    ],
    "missing": [],
    "tgt": "the-minion,",
    "tgt_type": "glob",
    "user": "root"
}
salt/job/20200716191834527792/ret/the-minion, {
    "_stamp": "2020-07-16T19:18:34.642001",
    "cmd": "_return",
    "fun": "test.ping",
    "fun_args": [],
    "id": "the-minion,",
    "jid": "20200716191834527792",
    "retcode": 0,
    "return": true,
    "success": true
}

The job is created, the return is received and all is right with the world.

In the unhappy case:

20200716191533486948    {
    "_stamp": "2020-07-16T19:15:33.487146",
    "minions": [
        "the-minion,"
    ]
}
salt/job/20200716191533486948/new   {
    "_stamp": "2020-07-16T19:15:33.487613",
    "arg": [],
    "fun": "test.ping",
    "jid": "20200716191533486948",
    "minions": [
        "the-minion,"
    ],
    "missing": [],
    "tgt": "the-minion,",
    "tgt_type": "glob",
    "user": "root"
}
20200716191538601450    {
    "_stamp": "2020-07-16T19:15:38.601653",
    "minions": [
        "the-minion,"
    ]
}
salt/job/20200716191538601450/new   {
    "_stamp": "2020-07-16T19:15:38.602129",
    "arg": [
        "20200716191533486948"
    ],
    "fun": "saltutil.find_job",
    "jid": "20200716191538601450",
    "minions": [
        "the-minion,"
    ],
    "missing": [],
    "tgt": [
        "the-minion,"
    ],
    "tgt_type": "list",
    "user": "root"
}
salt/job/20200716191538601450/ret/the-minion, {
    "_stamp": "2020-07-16T19:15:38.711512",
    "cmd": "_return",
    "fun": "saltutil.find_job",
    "fun_args": [
        "20200716191533486948"
    ],
    "id": "the-minion,",
    "jid": "20200716191538601450",
    "retcode": 0,
    "return": {},
    "success": true
}

It's broadly the same, but you can see a find_job instance is created as @SaravanakumarSivasankaran suggests.

The minion logs similarly in both cases:

In the happy case:

2020-07-16 19:18:34,536 [salt.minion                                           :1491][INFO    ][8630] User root Executing command test.ping with jid 20200716191834527792
2020-07-16 19:18:34,619 [salt.minion                                           :1621][INFO    ][5442] Starting a new job 20200716191834527792 with PID 5442
2020-07-16 19:18:34,637 [salt.minion                                           :1949][INFO    ][5442] Returning information for job: 20200716191834527792

In the unhappy case, there is the additional find_job:

2020-07-16 19:15:33,494 [salt.minion                                           :1491][INFO    ][8630] User root Executing command test.ping with jid 20200716191533486948
2020-07-16 19:15:33,595 [salt.minion                                           :1621][INFO    ][5192] Starting a new job 20200716191533486948 with PID 5192
2020-07-16 19:15:33,616 [salt.minion                                           :1949][INFO    ][5192] Returning information for job: 20200716191533486948
2020-07-16 19:15:38,610 [salt.minion                                           :1491][INFO    ][8630] User root Executing command saltutil.find_job with jid 20200716191538601450
2020-07-16 19:15:38,701 [salt.minion                                           :1621][INFO    ][5195] Starting a new job 20200716191538601450 with PID 5195
2020-07-16 19:15:38,706 [salt.minion                                           :1949][INFO    ][5195] Returning information for job: 20200716191538601450

Note there are two returns - one for the first test.ping and one for the following find_job.

Some print debugging shows that the minion is, as best as I can tell, returning sensible messages. I modified the minion function here:

def _do_transfer():
# Yield control to the caller. When send() completes, resume by populating data with the Future.result
data = yield self.message_client.send(
self._package_load(self.auth.crypticle.dumps(load)),
timeout=timeout,
tries=tries,
)

to:

        def _do_transfer():
            log.debug("_do_transfer")
            log.debug(base64.b64encode(pickle.dumps(load))
            # Yield control to the caller. When send() completes, resume by populating data with the Future.result
            data = yield self.message_client.send(
                self._package_load(self.auth.crypticle.dumps(load)),
                timeout=timeout,
                tries=tries,
            )

And used this to dump the output in JSON:

print(json.dumps(pickle.loads(base64.b64decode("...."))))

Here's the successful and unsuccessful cases for two runs.

Successful case:

 {"fun_args": ["sleep 10"], "jid": "20200716224848367462", "return": "", "retcode": 0, "success": true, "cmd": "_return", "fun": "cmd.run", "id": "the-minion"}

Unsuccessful case:

{"fun_args": ["sleep 10"], "jid": "20200716224718720155", "return": "", "retcode": 0, "success": true, "cmd": "_return", "fun": "cmd.run", "id": "ip-172-20-5-15.eu-west-1.compute.internal"}
{"fun_args": ["20200716224718720155"], "jid": "20200716224723832936", "return": {}, "retcode": 0, "success": true, "cmd": "_return", "fun": "saltutil.find_job", "id": "the-minion"}

Nothing to suggest that the minions are behaving differently.

tcpdump shows that messages are going from the minion to the master - I can see no obvious differences between the two cases.

It looks to me that from running with -ltrace and some print debugging that the minions are doing exactly what they're supposed to and the problem lies with the master not picking up the response from the minion, or at least not doing anything with it.

I find that restarting the master almost always resolves the problem, which makes tracking it down really hard. As far as I can tell - my best guess is that something about the socket receiving logic means that the messages aren't received correctly. I found that the following code: https://github.com/saltstack/salt/blob/master/salt/ext/tornado/iostream.py#L1051-L1054
if modified to:

    def read_from_fd(self):
        try:
            chunk = self.socket.recv(self.read_chunk_size)
            print(chunk) # print the chunk out when received
        except socket.error as e:

would print the chunk message from the minion immediately when sending test.ping and it was working, but would never print those messages if not. But I'm unfamiliar how this socket is created, what it's used for and whether it's indicative of an error or not. I'm running out of ideas on how to try and track this down - my socket, Python, Tornado and IPC knowledge is very limited, and so I'm struggling to determine where to even look.

I don't think there's anything unusual about our environment - we've around ~50-70 nodes. Most are static, with ~15 being auto scaling. The rest don't change. It's all in AWS within peered VPCs - and as above the traffic regarding tcpdump it seems to be traversing the network perfectly fine - sometimes it works, sometimes it doesn't and it's proving impossible to tell why.

Any thoughts or recommendations on how best to proceed would be utterly and completely magnificent.

@sagetherage sagetherage modified the milestones: Blocked, Follow up Jul 29, 2020
@sagetherage sagetherage self-assigned this Jul 29, 2020
@sagetherage
Copy link
Contributor

@edhgoose I apologize it took me so long to actually follow up, here. Wondering if we should schedule a session with a Core Team member to attempt to hammer on this, I can help to schedule something, is there a time that is best for you?

@edhgoose
Copy link

edhgoose commented Aug 5, 2020

@sagetherage, that'd be great.

If you want to take a take a look at https://calend.ly/edhgoose that's probably a good start? Generally evenings (UK time) are pretty good too.

@edhgoose
Copy link

@sagetherage - hey, I'd really love to get that help. This issue is driving me mad, sorry to chase - can we try and setup that session with a team member?

@sagetherage sagetherage added Bug broken, incorrect, or confusing behavior needs-triage labels Aug 12, 2020
@sagetherage sagetherage removed this from the Follow up milestone Aug 12, 2020
@sagetherage
Copy link
Contributor

@sagetherage - hey, I'd really love to get that help. This issue is driving me mad, sorry to chase - can we try and setup that session with a team member?

yes! I have assigned @DmitryKuzmenko who can help here.

@sagetherage sagetherage added this to the Blocked milestone Aug 17, 2020
@sagetherage sagetherage removed this from the Blocked milestone Aug 26, 2020
@sagetherage sagetherage removed the info-needed waiting for more info label Aug 26, 2020
@sagetherage
Copy link
Contributor

@DmitryKuzmenko have you been able to get with @edhgoose on this issue?

@edhgoose
Copy link

@sagetherage @DmitryKuzmenko and I had a conversation about turning on certain debugging, but I'm on holiday atm and haven't seen the issue for a little while. Unfortunately it seems extremely intermittent - but I'll be sure to confirm if it comes up again what the logs are.

@sagetherage
Copy link
Contributor

@edhgoose no worries! I was checking in because I didn't want things to be left undone, thank you for the extra effort and have a nice holiday.

@sagetherage sagetherage added the info-needed waiting for more info label Aug 27, 2020
@sagetherage sagetherage added this to the Blocked milestone Aug 27, 2020
@JustOneMoreBlock
Copy link

JustOneMoreBlock commented Aug 31, 2020

Unfortunately, I'm getting this same problem and manage about 1000s of servers.

[root@master]# salt-master -V
Salt Version:
           Salt: 3001.1
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.1
        libgit2: Not Installed
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.8 (default, Apr  2 2020, 13:34:55)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 17.0.0
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.1.4
 
System Versions:
           dist: centos 7 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-1127.19.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7 Core
[root@server salt]# salt-minion -V
Salt Version:
           Salt: 3001.1
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.1
        libgit2: Not Installed
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.8 (default, Apr  2 2020, 13:34:55)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 17.0.0
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.1.4
 
System Versions:
           dist: centos 7 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-1127.19.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7 Core

Like @edhgoose mentioned. Usually, a restart fixes it, however, it's been happening more and more lately. The company I work for adds new dedicated servers weekly and the issue persists even on new updated provisioned servers.

We use the python 3 versions of Salt.

@DmitryKuzmenko
Copy link
Contributor

@marilyn6483 do you have any chance to provide corresponding minion debug log?

@marilyn6483
Copy link
Author

marilyn6483 commented Sep 1, 2020 via email

@JustOneMoreBlock
Copy link

Well, I'm running python3.6 and the issue persists.

I do run a few cmd.script that execute shell scripts, however, it's never been this bad.

@JustOneMoreBlock
Copy link

Just a follow-up as it's been 10 days without a reply.

  • I'm running CentOS 7 x64.
  • Python 3.6
  • Salt is running on python3.6.

I checked this bug and ran:

strace python3.6 -c 'import select; select.epoll().poll(timeout=-1)' &
stat("/usr/lib64/python3.6", {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/usr/lib64/python3.6/lib-dynload", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/lib/python3.6/site-packages", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/lib64/python3.6/site-packages", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/lib/python3.6/site-packages", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
brk(NULL)                               = 0x1184000
brk(0x11a5000)                          = 0x11a5000
getcwd("/root", 1024)                   = 6
stat("/root", {st_mode=S_IFDIR|0550, st_size=246, ...}) = 0
stat("/root", {st_mode=S_IFDIR|0550, st_size=246, ...}) = 0
stat("/root", {st_mode=S_IFDIR|0550, st_size=246, ...}) = 0
openat(AT_FDCWD, "/root", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
getdents(3, /* 17 entries */, 32768)    = 528
getdents(3, /* 0 entries */, 32768)     = 0
close(3)                                = 0
stat("/usr/lib64/python3.6", {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/usr/lib64/python3.6/lib-dynload", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/lib64/python3.6/lib-dynload/select.cpython-36m-x86_64-linux-gnu.so", {st_mode=S_IFREG|0755, st_size=33536, ...}) = 0
futex(0x7f3acc3a80d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/usr/lib64/python3.6/lib-dynload/select.cpython-36m-x86_64-linux-gnu.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000 \0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=33536, ...}) = 0
mmap(NULL, 2127544, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f3ac5187000
mprotect(0x7f3ac518d000, 2093056, PROT_NONE) = 0
mmap(0x7f3ac538c000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5000) = 0x7f3ac538c000
close(3)                                = 0
mprotect(0x7f3ac538c000, 4096, PROT_READ) = 0
epoll_create1(EPOLL_CLOEXEC)            = 3
epoll_wait(3,

Just waits...

You mentioned nested shell scripts. We do something like this:

install_backups:
  cmd.script:
    - name: install_backups.sh
    - source: salt://backups/files/install_backups.sh
  • This happens on new servers too with all the updates applied.

@DmitryKuzmenko
Copy link
Contributor

@JustOneMoreBlock I've asked before and asking again. Could you please provide a hung minion debug log?

@ahan-ai
Copy link

ahan-ai commented Oct 24, 2020

I got the same problem, my os is centos, python version is 3.6.8.
sall-master version is 3001.1.

@sagetherage sagetherage added needs-triage and removed info-needed waiting for more info labels Oct 26, 2020
@sagetherage sagetherage removed this from the Blocked milestone Oct 26, 2020
@garethgreenaway
Copy link
Contributor

It seems like we might have a few different issues all in the same ticket here:

@marilyn6483 Just to confirm, was the original issue that you had posted related to Python 3.5 and with an upgrade has been resolved?

@JustOneMoreBlock Following on your the comments from @DmitryKuzmenko can you provide a minion log with information when the issue occurs.

@edhgoose Is the issue you're seeing still persisting? Does it appear to be a timeout issue between master and minion? Have you tried increasing the timeout values? Are you running Salt in the same AWS region or is between regions? From outside AWS into AWS?

Thanks!

@garethgreenaway garethgreenaway added this to the Blocked milestone Nov 3, 2020
@garethgreenaway garethgreenaway added info-needed waiting for more info and removed needs-triage labels Nov 3, 2020
@ahan-ai
Copy link

ahan-ai commented Jan 24, 2021

I think the issue still exists.

@voidmain
Copy link

I believe I'm seeing this same behavior. Though being new to Salt, I'm not sure if it is exactly the same.

The issue I'm seeing is that some of our Minions will eventually stop responding to the Master completely. Logging into the Minion, the salt-minion process is running and there aren't any errors in the logs. The Master is still running and we can confirm that some of the Minions are in this state using a test.ping to *.

We are running everything via systemd on Ubuntu minimal and I'd be happy to help debug this.

To start, here's my salt-minion -V for a Minion that was in this failure state:

Salt Version:
           Salt: 3001.1

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.7.3
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.10.1
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: 3.6.1
         pygit2: Not Installed
         Python: 3.8.5 (default, May 27 2021, 13:30:53)
   python-gnupg: 0.4.5
         PyYAML: 5.3.1
          PyZMQ: 18.1.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2

System Versions:
           dist: ubuntu 20.04 focal
         locale: utf-8
        machine: x86_64
        release: 5.4.0-1024-aws
         system: Linux
        version: Ubuntu 20.04 focal

Here's my salt-master -V:

Salt Version:
          Salt: 3003

Dependency Versions:
          cffi: Not Installed
      cherrypy: 8.9.1
      dateutil: 2.7.3
     docker-py: Not Installed
         gitdb: 2.0.6
     gitpython: 3.0.7
        Jinja2: 2.10.1
       libgit2: 0.28.3
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: Not Installed
  pycryptodome: 3.6.1
        pygit2: 1.0.3
        Python: 3.8.5 (default, May 27 2021, 13:30:53)
  python-gnupg: 0.4.5
        PyYAML: 5.3.1
         PyZMQ: 18.1.1
         smmap: 2.0.5
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.2

System Versions:
          dist: ubuntu 20.04 focal
        locale: utf-8
       machine: x86_64
       release: 5.4.0-1047-aws
        system: Linux
       version: Ubuntu 20.04 focal

We have other Minions that have gone into this failure state and they are running version 3003, so I that would imply it isn't a version problem.

I believe that if I update all of my Minions so that the systemd service file includes the -l debug, it should spit out a bunch of information to the system journal and the log file. Then we can leverage this debug information to determine why the Minions are entering this failure state.

Let me know if I should do anything else to assist in tracking this down and hopefully getting it fixed.

@sagetherage
Copy link
Contributor

@marilyn6483
I would still update any minions on v3001.1 (or masters) to at least v3001.7 since there have been several changes between versions including security fixes.

@iprashantjain
Copy link

iprashantjain commented Feb 17, 2022

Hello,

We faced the same issue with our salt setup.
salt-minion

root@hdfclife-mkt-prod3-1:~# salt-minion -V
Salt Version:
           Salt: 3000.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.5.3
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.5.3 (default, Apr  5 2021, 09:00:41)
   python-gnupg: 0.3.9
         PyYAML: 3.12
          PyZMQ: 19.0.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2

System Versions:
           dist: debian 9.12
         locale: ANSI_X3.4-1968
        machine: x86_64
        release: 4.9.0-14-amd64
         system: Linux
        version: debian 9.12

salt-master

root@salt-syndic-1:~# salt-master -V
/usr/lib/python2.7/dist-packages/salt/scripts.py:109: DeprecationWarning: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date.  Salt will drop support for Python 2.7 in the Sodium release or later.
Salt Version:
           Salt: 3000.5

Dependency Versions:
           cffi: Not Installed
       cherrypy: 3.5.0
       dateutil: 2.2
      docker-py: Not Installed
          gitdb: 2.0.6
      gitpython: 2.1.15
         Jinja2: 2.9.4
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.9 (default, Sep 14 2019, 20:00:08)
   python-gnupg: 0.3.6
         PyYAML: 3.11
          PyZMQ: 14.4.0
          smmap: 3.0.4
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.0.5

System Versions:
           dist: debian 8.11
         locale: ANSI_X3.4-1968
        machine: x86_64
        release: 3.16.0-4-amd64
         system: Linux
        version: debian 8.11

Can this issue be looked at?

@gregorg
Copy link

gregorg commented Jan 29, 2024

Same here, anything new on this ?

# salt --versions-report
Salt Version:
          Salt: 3006.5

Python Version:
        Python: 3.10.13 (main, Nov 15 2023, 04:34:27) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: 1.3.0
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: 1.7.0
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.14.2
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 11 bullseye
        locale: utf-8
       machine: x86_64
       release: 5.10.0-25-amd64
        system: Linux
       version: Debian GNU/Linux 11 bullseye

@m4djack
Copy link

m4djack commented Apr 2, 2024

Still an issue at our company we have a restart script as a procedure to resolve this issue for now.

Master:

salt --versions-report
Salt Version:
          Salt: 3006.3

Python Version:
        Python: 3.10.13 (main, Sep  6 2023, 02:11:27) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.13.10
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: centos 7.9.2009 Core
        locale: utf-8
       machine: x86_64
       release: 3.10.0-1160.36.2.el7.x86_64
        system: Linux
       version: CentOS Linux 7.9.2009 Core

Minion

Salt Version:
          Salt: 3006.3

Python Version:
        Python: 3.10.13 (main, Sep  6 2023, 02:15:03) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: 1.10.6
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.13.10
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: amzn 2
        locale: utf-8
       machine: aarch64
       release: 4.14.336-257.562.amzn2.aarch64
        system: Linux
       version: Amazon Linux 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior info-needed waiting for more info
Projects
None yet
Development

No branches or pull requests