Skip to content

Salt masters behind AWS ELB have flaky connection to minions #43368

Closed
@fzk-rec

Description

@fzk-rec

Description of Issue/Question

The connection of the salt-masters that run behind an AWS ELB to the salt-minions is flaky. Sometimes they work, most times they don't.
I would like to know if there is some flaw in my setup that I am not seeing, or if Salt only works with an HA Proxy as an ELB?

Or maybe Salt doesn't work at all behind an ELB?

Setup

I am running the following setup at AWS:

  • Elastic Loadbalancer in front of two EC2 machines (Amazon Linux) with a docker container that the salt-master runs in
  • Two EC2 instances with salt-minions installed
  • The 'master' value in the minion config is set to the dns of the loadbalancer (SaltMaster-env-vpc-test.szfegmankg.us-east-1.elasticbeanstalk.com)
  • The ELB accepts all traffic from the minions
  • The Salt-masters accept all traffic from the ELB as well as from the minions
  • The Salt-masters PKI Folder is shared between the two masters
  • The Salt-masters have the same private+public keys
  • The Salt-masters run on 2017.7.1
  • The Salt-minions run on 2016.11.5 (I tried it with 2017.7.1, but got the same results)
  • The Salt-minions accept all traffic from the ELB as well as from the masters
  • The master config looks as follows:
worker_threads: 20
auto_accept: True 
log_level: error 
log_level_logfile: debug 
extension_modules: srv/salt/ext 
rest_cherrypy:   
port: 8000   
disable_ssl: True   
debug: True 
external_auth:   
  pam:
    saltdev:
      - .*
      - '@runner'
# Setting the job_cache to redis.
# The redis config settings are generated at the start of the docker container and
# will be written into /etc/salt/master.d/redis.conf 
master_job_cache: redis 
cache: redis 
pki_dir: /etc/salt/pki/master/efs

The minion config looks as follows:

id: WIN-AB3GO7BJ72I
log_file: C:\salt.log
multiprocessing: False
log_level_logfile: debug
pki_dir: /conf/pki/minion
master: SaltMaster-env-vpc-test.szfegmankg.us-east-1.elasticbeanstalk.com
master_type: str
master_alive_interval: 30
open_mode: True
root_dir: c:\salt
ipc_mode: tcp
recon_default: 1000
recon_max: 199000
recon_randomize: True

In the master log files, I can see on both masters:
2017-09-05 10:06:18,118 [salt.utils.verify][DEBUG ][35] This salt-master instance has accepted 2 minion keys.

A salt-key -L on both masters yield the same result:

Accepted Keys:
WIN-AB3GO7BJ72I
WIN-EDMP9VB716B
Denied Keys:
Unaccepted Keys:
Rejected Keys:

So it looks like all is fine and everything should work. However, a test.ping is extremely flaky. Sometimes it works, but most of the time it doesnt. Most of the time neither master gets any return from the minion and on the minion side I can see in the log that the minion never receives the message to execute 'test.ping' from the master.
Example 1:
test.ping from Master1:

root@d7383ff8f8bf:/# salt 'WIN-EDMP9VB716B' test.ping
[ERROR   ] Exception raised when processing __virtual__ function for salt.loaded.int.cache.consul. Module will not be loaded: 'module' object has no attribute 'Consul'
[ERROR   ] An un-handled exception was caught by salt's global exception handler:
KeyError: 'redis.ls'
Traceback (most recent call last):
  File "/usr/bin/salt", line 10, in <module>
    salt_main()
  File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 476, in salt_main
    client.run()
  File "/usr/lib/python2.7/dist-packages/salt/cli/salt.py", line 173, in run
    for full_ret in cmd_func(**kwargs):
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 805, in cmd_cli
    **kwargs):
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 1597, in get_cli_event_returns
    connected_minions = salt.utils.minions.CkMinions(self.opts).connected_ids()
  File "/usr/lib/python2.7/dist-packages/salt/utils/minions.py", line 577, in connected_ids
    search = self.cache.ls('minions')
  File "/usr/lib/python2.7/dist-packages/salt/cache/__init__.py", line 244, in ls
    return self.modules[fun](bank, **self._kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/loader.py", line 1113, in __getitem__
    func = super(LazyLoader, self).__getitem__(item)
  File "/usr/lib/python2.7/dist-packages/salt/utils/lazy.py", line 101, in __getitem__
    raise KeyError(key)
KeyError: 'redis.ls'
Traceback (most recent call last):
  File "/usr/bin/salt", line 10, in <module>
    salt_main()
  File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 476, in salt_main
    client.run()
  File "/usr/lib/python2.7/dist-packages/salt/cli/salt.py", line 173, in run
    for full_ret in cmd_func(**kwargs):
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 805, in cmd_cli
    **kwargs):
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 1597, in get_cli_event_returns
    connected_minions = salt.utils.minions.CkMinions(self.opts).connected_ids()
  File "/usr/lib/python2.7/dist-packages/salt/utils/minions.py", line 577, in connected_ids
    search = self.cache.ls('minions')
  File "/usr/lib/python2.7/dist-packages/salt/cache/__init__.py", line 244, in ls
    return self.modules[fun](bank, **self._kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/loader.py", line 1113, in __getitem__
    func = super(LazyLoader, self).__getitem__(item)
  File "/usr/lib/python2.7/dist-packages/salt/utils/lazy.py", line 101, in __getitem__
    raise KeyError(key)
KeyError: 'redis.ls'

I am aware that the redis error will be fixed soon #43295

Example 2:
test.ping from Master1, ~ 1 Minute after Example 1:

root@d7383ff8f8bf:/# salt 'WIN-EDMP9VB716B' test.ping
WIN-EDMP9VB716B:
    True

Also during my tests, a test.ping from Master2 never succeeded.

Steps to Reproduce Issue

  • Create an internal ELB at AWS
  • Launch two salt-masters and attach them to the ELB
  • Launch two salt-minion machines and configure the ELB dns name as the master
  • Create a Redis cluster and configure the salt-masters job cache and minion data cache to use redis
  • Create an EFS volume and mount it on both salt-masters
  • Configure the PKI folder to be put on the mounted EFS volume
  • Ensure that the network communication between the minions, the ELB and the masters is OK
  • Try to send test.ping commands from each of the salt-masters

Versions Report

Salt Version:
           Salt: 2017.7.1

Dependency Versions:
           cffi: Not Installed
       cherrypy: unknown
       dateutil: 2.4.2
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.12 (default, Nov 19 2016, 06:48:10)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: Ubuntu 16.04 xenial
         locale: ANSI_X3.4-1968
        machine: x86_64
        release: 4.9.43-17.38.amzn1.x86_64
         system: Linux
        version: Ubuntu 16.04 xenial

Metadata

Metadata

Assignees

No one assigned

    Labels

    pending-discussionThe issue or pull request needs more discussion before it can be closed or mergedstale

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions