Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Salt Deltaproxy NAPALM Issues - PR #60090 #60177

Closed
ggiesen opened this issue May 12, 2021 · 16 comments · Fixed by #60090
Closed

[BUG] Salt Deltaproxy NAPALM Issues - PR #60090 #60177

ggiesen opened this issue May 12, 2021 · 16 comments · Fixed by #60090
Assignees
Labels
Bug broken, incorrect, or confusing behavior Delta-Proxy severity-high 2nd top severity, seen by most users, causes major problems Silicon v3004.0 Release code name

Comments

@ggiesen
Copy link
Contributor

ggiesen commented May 12, 2021

Description
A number of issues related to the open-sourcing of deltaproxy in PR #60090 when using NAPALM as the proxy type.

  1. Slow startup. Running in debug mode, the deltaproxy appears to connect to each device sequentially and issue a number of commands. When testing a 21 node setup on a mix of (admittedly slow) IOS and NX-OS devices, it took 13 minutes to connect to all of the devices
  2. Only the last device in the list stays connected after the intial connection setup:

router1:

router1#show users
    Line       User       Host(s)              Idle       Location
*  1 vty 0     real_user  idle                 00:00:00 198.51.100.254

router2:

router2#show users
    Line       User       Host(s)              Idle       Location
*  1 vty 0     real_user  idle                 00:00:00 198.51.100.254

router3:

router3#show users
    Line       User       Host(s)              Idle       Location
   1 vty 0     salt_user  idle                 00:00:27 198.51.100.1
*  2 vty 1     real_user  idle                 00:00:00 198.51.100.254
  1. All commands that target any device in the list of ids are responded to by the last device only, regardless of whether or not that device is included in the list:
$ salt router1 net.facts
router1:
    ----------
    comment:
    out:
        ----------
        fqdn:
            router3.example.com
        hostname:
            router3
        interface_list:
            - TenGigabitEthernet5/2
            - GigabitEthernet5/3
            - TenGigabitEthernet6/1
            - TenGigabitEthernet6/2
            - GigabitEthernet6/3
            - Loopback0
        model:
            WS-C6509
        os_version:
            s3223_rp Software (s3223_rp-ADVIPSERVICESK9_WAN-M), Version 12.2(33)SXJ10, RELEASE SOFTWARE (fc3)
        serial_number:
            ABC00000003
        uptime:
            48431880
        vendor:
            Cisco
    result:
        True
$ salt router2 net.facts
router2:
    ----------
    comment:
    out:
        ----------
        fqdn:
            router3.example.com
        hostname:
            router3
        interface_list:
            - TenGigabitEthernet5/2
            - GigabitEthernet5/3
            - TenGigabitEthernet6/1
            - TenGigabitEthernet6/2
            - GigabitEthernet6/3
            - Loopback0
        model:
            WS-C6509
        os_version:
            s3223_rp Software (s3223_rp-ADVIPSERVICESK9_WAN-M), Version 12.2(33)SXJ10, RELEASE SOFTWARE (fc3)
        serial_number:
            ABC00000003
        uptime:
            48432420
        vendor:
            Cisco
    result:
        True
$ salt router3 net.facts
router3:
    ----------
    comment:
    out:
        ----------
        fqdn:
            router3.example.com
        hostname:
            router3
        interface_list:
            - TenGigabitEthernet5/2
            - GigabitEthernet5/3
            - TenGigabitEthernet6/1
            - TenGigabitEthernet6/2
            - GigabitEthernet6/3
            - Loopback0
        model:
            WS-C6509
        os_version:
            s3223_rp Software (s3223_rp-ADVIPSERVICESK9_WAN-M), Version 12.2(33)SXJ10, RELEASE SOFTWARE (fc3)
        serial_number:
            ABC00000003
        uptime:
            48432480
        vendor:
            Cisco
    result:
        True

Setup
Salt master and proxy are running 3003 with patches from #60090

/etc/salt/proxy:

master: 198.51.100.1
...
metaproxy: deltaproxy

/srv/pillar/top.sls:

  deltaproxy:
    - deltaproxy
  router1:
    - router1
  router2:
    - router2
  router3:
    - router3

/srv/pillar/deltaproxy.sls:

proxy:
  proxytype: deltaproxy
  ids:
    - router1
    - router2
    - router3

/srv/pillar/router1.sls:

proxy:
  driver: ios
  host: 192.0.2.1
  proxytype: napalm
  username: salt_user
  passwd: salt_pass
  multiprocessing: False
  optional_args:
    global_delay_factor: 20

/srv/pillar/router2.sls:

proxy:
  driver: ios
  host: 192.0.2.2
  proxytype: napalm
  username: salt_user
  passwd: salt_pass
  multiprocessing: False
  optional_args:
    global_delay_factor: 20

/srv/pillar/router3.sls:

proxy:
  driver: ios
  host: 192.0.2.3
  proxytype: napalm
  username: salt_user
  passwd: salt_pass
  multiprocessing: False
  optional_args:
    global_delay_factor: 20

Steps to Reproduce the behavior
$ salt-proxy --proxyid=deltaproxy -l debug

Expected behavior

  1. Ideally, deltaproxies should setup/connect in parallel to speed up startup time
  2. All devices should stay connected
  3. Commands should execute on the correct device

Screenshots
If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3003

Dependency Versions:
          cffi: 1.14.0
      cherrypy: unknown
      dateutil: 2.6.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.11.2
       libgit2: 1.0.1
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.20
      pycrypto: Not Installed
  pycryptodome: Not Installed
        pygit2: 1.3.0
        Python: 3.6.8 (default, Aug 24 2020, 17:57:11)
  python-gnupg: 0.4.6
        PyYAML: 5.3.1
         PyZMQ: 19.0.0
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: centos 8
        locale: UTF-8
       machine: x86_64
       release: 4.18.0-240.15.1.el8_3.x86_64
        system: Linux
       version: CentOS Linux 8

Additional context
Using NAPALM 3.2.0

@ggiesen ggiesen added Bug broken, incorrect, or confusing behavior needs-triage labels May 12, 2021
@garethgreenaway
Copy link
Member

@ggiesen A couple things to look at here but can you update your proxy configurations and enable multiprocessing? Thanks!

@ggiesen
Copy link
Contributor Author

ggiesen commented May 12, 2021

Way ahead of you. Made no difference. FYI per docs here https://docs.saltproject.io/en/latest/ref/proxy/all/salt.proxy.napalm.html Multiprocessing must be set to False for SSH-based proxies. However I had the same thought you did and tried it for sh*ts and giggles.

@garethgreenaway
Copy link
Member

@ggiesen Can you pull down the latest from the adding_deltaproxy branch please? There are some changes to the napalm proxy minion that should address the issue of all commands going to the last proxy minion connected. Thanks!

@ggiesen
Copy link
Contributor Author

ggiesen commented May 13, 2021

Will do. Working on building a proper Vagrant lab and I'll test as soon as it's done

@ggiesen
Copy link
Contributor Author

ggiesen commented May 14, 2021

Initial impressions are that it is looking much, much better:

$ salt router1 net.facts
router1:
    ----------
    comment:
    out:
        ----------
        fqdn:
            csr1000v.localdomain
        hostname:
            csr1000v
        interface_list:
            - GigabitEthernet1
        model:
            CSR1000V
        os_version:
            Virtual XE Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 16.9.5, RELEASE SOFTWARE (fc1)
        serial_number:
            9XICW0VWS9P
        uptime:
            5940
        vendor:
            Cisco
    result:
        True
$ salt router2 net.facts
router2:
    ----------
    comment:
    out:
        ----------
        fqdn:
            csr1000v.localdomain
        hostname:
            csr1000v
        interface_list:
            - GigabitEthernet1
        model:
            CSR1000V
        os_version:
            Virtual XE Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 16.9.5, RELEASE SOFTWARE (fc1)
        serial_number:
            9VFH0G7B8OX
        uptime:
            5220
        vendor:
            Cisco
    result:
        True
$ salt router* net.facts
router1:
    ----------
    comment:
    out:
        ----------
        fqdn:
            csr1000v.localdomain
        hostname:
            csr1000v
        interface_list:
            - GigabitEthernet1
        model:
            CSR1000V
        os_version:
            Virtual XE Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 16.9.5, RELEASE SOFTWARE (fc1)
        serial_number:
            9XICW0VWS9P
        uptime:
            5820
        vendor:
            Cisco
    result:
        True
router2:
    ----------
    comment:
    out:
        ----------
        fqdn:
            csr1000v.localdomain
        hostname:
            csr1000v
        interface_list:
            - GigabitEthernet1
        model:
            CSR1000V
        os_version:
            Virtual XE Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 16.9.5, RELEASE SOFTWARE (fc1)
        serial_number:
            9VFH0G7B8OX
        uptime:
            5100
        vendor:
            Cisco
    result:
        True
$ salt router* net.ping 8.8.8.8
router1:
    ----------
    comment:
    out:
        ----------
        success:
            ----------
            packet_loss:
                0
            probes_sent:
                5
            results:
                |_
                  ----------
                  ip_address:
                      8.8.8.8
                  rtt:
                      0.0
                |_
                  ----------
                  ip_address:
                      8.8.8.8
                  rtt:
                      0.0
                |_
                  ----------
                  ip_address:
                      8.8.8.8
                  rtt:
                      0.0
                |_
                  ----------
                  ip_address:
                      8.8.8.8
                  rtt:
                      0.0
                |_
                  ----------
                  ip_address:
                      8.8.8.8
                  rtt:
                      0.0
            rtt_avg:
                20.0
            rtt_max:
                21.0
            rtt_min:
                19.0
            rtt_stddev:
                0.0
    result:
        True
router2:
    ----------
    comment:
    out:
        ----------
        success:
            ----------
            packet_loss:
                0
            probes_sent:
                5
            results:
                |_
                  ----------
                  ip_address:
                      8.8.8.8
                  rtt:
                      0.0
                |_
                  ----------
                  ip_address:
                      8.8.8.8
                  rtt:
                      0.0
                |_
                  ----------
                  ip_address:
                      8.8.8.8
                  rtt:
                      0.0
                |_
                  ----------
                  ip_address:
                      8.8.8.8
                  rtt:
                      0.0
                |_
                  ----------
                  ip_address:
                      8.8.8.8
                  rtt:
                      0.0
            rtt_avg:
                23.0
            rtt_max:
                37.0
            rtt_min:
                19.0
            rtt_stddev:
                0.0
    result:
        True

@garethgreenaway
Copy link
Member

@ggiesen Excellent! Thanks for testing that out. I'm looking into the other issue of being able to have the connections happen in parallel to speed things up.

@ggiesen
Copy link
Contributor Author

ggiesen commented May 19, 2021

@garethgreenaway

Another issue:

When the SSH session is dropped, it does not reconnect until the proxy is restarted:

router1#show users
    Line       User       Host(s)              Idle       Location
   1 vty 0     vagrant    idle                 00:00:53 192.168.33.163
*  2 vty 1     vagrant    idle                 00:00:00 192.168.33.2
router1#clear line 1
[confirm]
 [OK] 
router1#show users
    Line       User       Host(s)              Idle       Location  
*  2 vty 1     vagrant    idle                 00:00:00 192.168.33.2

Logs from when I disconnect the session:

# salt-proxy --proxyid=deltaproxy -l debug
...
[DEBUG   ] [chan 0] EOF received (0)
[DEBUG   ] [chan 0] EOF sent (0)
[DEBUG   ] EOF in transport thread
...
# salt router1 net.facts -l debug
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: salt.local
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] Override  __grains__: <module 'salt.loaded.int.log_handlers.sentry_mod' from '/usr/lib/python3.6/site-packages/salt/log/handlers/sentry_mod.py'>
[DEBUG   ] Configuration file path: /etc/salt/master
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: salt.local
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'salt.local_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG   ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] Reading minion list from /var/cache/salt/master/jobs/4b/38fccfcbea7caa584faffd095a73ca664a75af4ddfb0864eae907ef8c25509/.minions.p
[DEBUG   ] get_iter_returns for jid 20210519220732948048 sent to {'router1'} will timeout at 22:07:37.992524
[DEBUG   ] jid 20210519220732948048 return from router1
[DEBUG   ] return event: {'router1': {'ret': {'out': {}, 'result': False, 'comment': 'Cannot execute "get_facts" on 192.168.33.164 as vagrant. Reason: Socket is closed!', 'traceback': 'Traceback (most recent call last):\n  File "/usr/local/lib/python3.6/site-packages/napalm/ios/ios.py", line 205, in _send_command\n    output = self.device.send_command(command)\n  File "/usr/local/lib/python3.6/site-packages/netmiko/utilities.py", line 500, in wrapper_decorator\n    return func(self, *args, **kwargs)\n  File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 1475, in send_command\n    prompt = self.find_prompt(delay_factor=delay_factor)\n  File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 1179, in find_prompt\n    self.write_channel(self.RETURN)\n  File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 459, in write_channel\n    self._write_channel(out_data)\n  File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 417, in _write_channel\n    self.remote_conn.sendall(write_bytes(out_data, encoding=self.encoding))\n  File "/usr/local/lib/python3.6/site-packages/paramiko/channel.py", line 846, in sendall\n    sent = self.send(s)\n  File "/usr/local/lib/python3.6/site-packages/paramiko/channel.py", line 801, in send\n    return self._send(s, m)\n  File "/usr/local/lib/python3.6/site-packages/paramiko/channel.py", line 1198, in _send\n    raise socket.error("Socket is closed")\nOSError: Socket is closed\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/salt/utils/napalm.py", line 171, in call\n    out = getattr(napalm_device.get("DRIVER"), method)(*args, **kwargs)\n  File "/usr/local/lib/python3.6/site-packages/napalm/ios/ios.py", line 984, in get_facts\n    show_ver = self._send_command("show version")\n  File "/usr/local/lib/python3.6/site-packages/napalm/ios/ios.py", line 208, in _send_command\n    raise ConnectionClosedException(str(e))\nnapalm.base.exceptions.ConnectionClosedException: Socket is closed\n'}, 'retcode': 1, 'jid': '20210519220732948048'}}
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] LazyLoaded nested.output
router1:
    ----------
    comment:
        Cannot execute "get_facts" on 192.168.33.164 as vagrant. Reason: Socket is closed!
    out:
        ----------
    result:
        False
    traceback:
        Traceback (most recent call last):
          File "/usr/local/lib/python3.6/site-packages/napalm/ios/ios.py", line 205, in _send_command
            output = self.device.send_command(command)
          File "/usr/local/lib/python3.6/site-packages/netmiko/utilities.py", line 500, in wrapper_decorator
            return func(self, *args, **kwargs)
          File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 1475, in send_command
            prompt = self.find_prompt(delay_factor=delay_factor)
          File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 1179, in find_prompt
            self.write_channel(self.RETURN)
          File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 459, in write_channel
            self._write_channel(out_data)
          File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 417, in _write_channel
            self.remote_conn.sendall(write_bytes(out_data, encoding=self.encoding))
          File "/usr/local/lib/python3.6/site-packages/paramiko/channel.py", line 846, in sendall
            sent = self.send(s)
          File "/usr/local/lib/python3.6/site-packages/paramiko/channel.py", line 801, in send
            return self._send(s, m)
          File "/usr/local/lib/python3.6/site-packages/paramiko/channel.py", line 1198, in _send
            raise socket.error("Socket is closed")
        OSError: Socket is closed

        During handling of the above exception, another exception occurred:

        Traceback (most recent call last):
          File "/usr/lib/python3.6/site-packages/salt/utils/napalm.py", line 171, in call
            out = getattr(napalm_device.get("DRIVER"), method)(*args, **kwargs)
          File "/usr/local/lib/python3.6/site-packages/napalm/ios/ios.py", line 984, in get_facts
            show_ver = self._send_command("show version")
          File "/usr/local/lib/python3.6/site-packages/napalm/ios/ios.py", line 208, in _send_command
            raise ConnectionClosedException(str(e))
        napalm.base.exceptions.ConnectionClosedException: Socket is closed
[DEBUG   ] jid 20210519220732948048 found all minions {'router1'}
[DEBUG   ] Closing IPCMessageSubscriber instance
ERROR: Minions returned with non-zero exit code

Log from minion while executing above:

[INFO    ] User sudo_vagrant Executing command net.facts with jid 20210519222320025484
[DEBUG   ] Command details {'fun': 'net.facts', 'arg': [], 'tgt': 'router1', 'jid': '20210519222320025484', 'ret': '', 'tgt_type': 'glob', 'user': 'sudo_vagrant'}
[DEBUG   ] Deltaproxy minion_instance <salt.minion.ProxyMinion object at 0x7fa452dbf898>(ID: router1). Target: router1
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] LazyLoaded deltaproxy.thread_return
[INFO    ] Starting a new job with PID 477944
[DEBUG   ] Subprocess 20210519222320025484-Job-20210519222320025484 added
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] LazyLoaded deltaproxy.target_load
[DEBUG   ] LazyLoaded glob_match.match
[DEBUG   ] Could not LazyLoad {}.allow_missing_func: '{}.allow_missing_func' is not available.
[DEBUG   ] LazyLoaded net.facts
[DEBUG   ] Executors list ['direct_call']
[DEBUG   ] read_channel: 
[ERROR   ] Cannot execute "get_facts" on 192.168.33.164 as vagrant. Reason: Socket is closed!
[ERROR   ] Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/napalm/ios/ios.py", line 205, in _send_command
    output = self.device.send_command(command)
  File "/usr/local/lib/python3.6/site-packages/netmiko/utilities.py", line 500, in wrapper_decorator
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 1475, in send_command
    prompt = self.find_prompt(delay_factor=delay_factor)
  File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 1179, in find_prompt
    self.write_channel(self.RETURN)
  File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 459, in write_channel
    self._write_channel(out_data)
  File "/usr/local/lib/python3.6/site-packages/netmiko/base_connection.py", line 417, in _write_channel
    self.remote_conn.sendall(write_bytes(out_data, encoding=self.encoding))
  File "/usr/local/lib/python3.6/site-packages/paramiko/channel.py", line 846, in sendall
    sent = self.send(s)
  File "/usr/local/lib/python3.6/site-packages/paramiko/channel.py", line 801, in send
    return self._send(s, m)
  File "/usr/local/lib/python3.6/site-packages/paramiko/channel.py", line 1198, in _send
    raise socket.error("Socket is closed")
OSError: Socket is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/napalm.py", line 171, in call
    out = getattr(napalm_device.get("DRIVER"), method)(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/napalm/ios/ios.py", line 984, in get_facts
    show_ver = self._send_command("show version")
  File "/usr/local/lib/python3.6/site-packages/napalm/ios/ios.py", line 208, in _send_command
    raise ConnectionClosedException(str(e))
napalm.base.exceptions.ConnectionClosedException: Socket is closed

[DEBUG   ] Minion return retry timer set to 9 seconds (randomized)
[INFO    ] Returning information for job: 20210519222320025484
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/proxy', 'router1', 'tcp://127.0.1.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/proxy', 'router1', 'tcp://127.0.1.1:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.1.1:4506
[DEBUG   ] Trying to connect to: tcp://127.0.1.1:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] Subprocess 20210519222320025484-Job-20210519222320025484 cleaned up

@ggiesen
Copy link
Contributor Author

ggiesen commented May 19, 2021

Note that using salt router1 napalm.reconnect does successfully restart the connection. I also tested this with a proxy minion to be sure, and it appears we have the same behaviour with a regular proxy minion as well. I don't recall this ever happening on 3001. Is it possible your patch broke the auto-reconnect?

I'll try spinning up another unpatched vagrant lab to confirm.

@ggiesen
Copy link
Contributor Author

ggiesen commented May 20, 2021

Unpatched 3003 still exhibits same behaviour. So probably not deltaproxy related. Going to revalidate on 3001 for my own sanity ;)

@ggiesen
Copy link
Contributor Author

ggiesen commented May 20, 2021

Okay, I'm not crazy. Tried on 3001.7 and the regular proxy minion will reconnect:

Session is connected:

router1#show users  
    Line       User       Host(s)              Idle       Location
   1 vty 0     vagrant    idle                 00:00:24 192.168.33.167
*  2 vty 1     vagrant    idle                 00:00:00 192.168.33.2

Kill connection:

router1#clear line 1
[confirm]
 [OK]

Logs from salt-proxy confirms disconnect:

...
[DEBUG   ] [chan 0] EOF received (0)
[DEBUG   ] [chan 0] EOF sent (0)
[DEBUG   ] EOF in transport thread

User is definitely disconnected from router:

router1#show users  
    Line       User       Host(s)              Idle       Location
*  2 vty 1     vagrant    idle                 00:00:00 192.168.33.2

A few tens of seconds later:

...
[DEBUG   ] schedule: Job __proxy_keepalive was scheduled with jid_include, adding to cache (jid_include defaults to True)
[DEBUG   ] schedule: Job __proxy_keepalive was scheduled with a max number of 1
[INFO    ] Running scheduled job: __proxy_keepalive
[DEBUG   ] Subprocess Thread-96-Schedule-__proxy_keepalive added
[DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'id': 'router1', 'fun': 'status.proxy_reconnect', 'fun_args': [], 'schedule': '__proxy_keepalive', 'jid': '20210520034353581963', 'pid': 36808}
[DEBUG   ] LazyLoaded status.proxy_reconnect
[DEBUG   ] Is router1 still alive? No.
[INFO    ] router1 (napalm proxy) is down. Restarting.
[DEBUG   ] Setting up NAPALM connection
[DEBUG   ] starting thread (client mode): 0x82e4a7b8
[DEBUG   ] Local version/idstring: SSH-2.0-paramiko_2.7.2
[DEBUG   ] Remote version/idstring: SSH-2.0-Cisco-1.25
[INFO    ] Connected (version 2.0, client Cisco-1.25)
[DEBUG   ] kex algos:['diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1'] server key:['ssh-rsa'] client encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr'] server encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr'] client mac:['hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1', 'hmac-sha1-96'] server mac:['hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1', 'hmac-sha1-96'] client compress:['none'] server compress:['none'] client lang:[''] server lang:[''] kex follows?False
[DEBUG   ] Kex agreed: diffie-hellman-group-exchange-sha1
[DEBUG   ] HostKey agreed: ssh-rsa
[DEBUG   ] Cipher agreed: aes128-ctr
[DEBUG   ] MAC agreed: hmac-sha2-256
[DEBUG   ] Compression agreed: none
[DEBUG   ] Got server p (2048 bits)
[DEBUG   ] kex engine KexGex specified hash_algo <built-in function openssl_sha1>
[DEBUG   ] Switch to new keys ...
[DEBUG   ] Adding ssh-rsa host key for 192.168.33.168: b'1e128dac15b49082ad116da8a3207171'
[DEBUG   ] userauth is OK
[INFO    ] Authentication (password) successful!
[DEBUG   ] [chan 0] Max packet in: 32768 bytes
[DEBUG   ] [chan 0] Max packet out: 4096 bytes
[DEBUG   ] Secsh channel 0 opened.
[DEBUG   ] [chan 0] Sesch channel 0 request ok
[DEBUG   ] [chan 0] Sesch channel 0 request ok
[DEBUG   ] write_channel: b'terminal width 511\n'
[DEBUG   ] Pattern is: terminal width 511
[DEBUG   ] _read_channel_expect read_data:

Amd session and is reconnected:

router1#show users
    Line       User       Host(s)              Idle       Location
   1 vty 0     vagrant    idle                 00:00:08 192.168.33.167
*  2 vty 1     vagrant    idle                 00:00:00 192.168.33.2

Will try with 3002 next and then 3003 again.

@ggiesen
Copy link
Contributor Author

ggiesen commented May 20, 2021

Confirmed 3002.6 works properly as well:

Session is connected:

router1#show users
    Line       User       Host(s)              Idle       Location
   1 vty 0     vagrant    idle                 00:00:55 192.168.33.167
*  2 vty 1     vagrant    idle                 00:00:00 192.168.33.2

Kill connection:

router1#clear line 1
[confirm]
 [OK]

Logs from salt-proxy confirms disconnect:

...
[DEBUG   ] [chan 0] EOF received (0)
[DEBUG   ] [chan 0] EOF sent (0)
[DEBUG   ] EOF in transport thread

User is definitely disconnected from router:

router1#show users  
    Line       User       Host(s)              Idle       Location
*  2 vty 1     vagrant    idle                 00:00:00 192.168.33.2

A few tens of seconds later:

[DEBUG   ] schedule: Job __proxy_keepalive was scheduled with jid_include, adding to cache (jid_include defaults to True)
[DEBUG   ] schedule: Job __proxy_keepalive was scheduled with a max number of 1
[INFO    ] Running scheduled job: __proxy_keepalive
[DEBUG   ] Subprocess Thread-31-Schedule-__proxy_keepalive added
[DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'id': 'router1', 'fun': 'status.proxy_reconnect', 'fun_args': [], 'schedule': '__proxy_keepalive', 'jid': '20210520040132575662', 'pid': 44421}
[DEBUG   ] LazyLoaded status.proxy_reconnect
[DEBUG   ] Is router1 still alive? No.
[INFO    ] router1 (napalm proxy) is down. Restarting.
[DEBUG   ] Setting up NAPALM connection
[DEBUG   ] starting thread (client mode): 0x2189deb8
[DEBUG   ] Local version/idstring: SSH-2.0-paramiko_2.7.2
[DEBUG   ] Remote version/idstring: SSH-2.0-Cisco-1.25
[INFO    ] Connected (version 2.0, client Cisco-1.25)
[DEBUG   ] kex algos:['diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1'] server key:['ssh-rsa'] client encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr'] server encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr'] client mac:['hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1', 'hmac-sha1-96'] server mac:['hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1', 'hmac-sha1-96'] client compress:['none'] server compress:['none'] client lang:[''] server lang:[''] kex follows?False
[DEBUG   ] Kex agreed: diffie-hellman-group-exchange-sha1
[DEBUG   ] HostKey agreed: ssh-rsa
[DEBUG   ] Cipher agreed: aes128-ctr
[DEBUG   ] MAC agreed: hmac-sha2-256
[DEBUG   ] Compression agreed: none
[DEBUG   ] Got server p (2048 bits)
[DEBUG   ] kex engine KexGex specified hash_algo <built-in function openssl_sha1>
[DEBUG   ] Switch to new keys ...
[DEBUG   ] Adding ssh-rsa host key for 192.168.33.168: b'1e128dac15b49082ad116da8a3207171'
[DEBUG   ] userauth is OK
[INFO    ] Authentication (password) successful!
[DEBUG   ] [chan 0] Max packet in: 32768 bytes
[DEBUG   ] [chan 0] Max packet out: 4096 bytes
[DEBUG   ] Secsh channel 0 opened.
[DEBUG   ] [chan 0] Sesch channel 0 request ok
[DEBUG   ] [chan 0] Sesch channel 0 request ok
[DEBUG   ] write_channel: b'terminal width 511\n'
[DEBUG   ] Pattern is: terminal width 511
[DEBUG   ] _read_channel_expect read_data:


router1#
[DEBUG   ] _read_channel_expect read_data: t
[DEBUG   ] _read_channel_expect read_data: ermi
[DEBUG   ] _read_channel_expect read_data: nal 
[DEBUG   ] _read_channel_expect read_data: widt
[DEBUG   ] _read_channel_expect read_data: h 511
router1#
[DEBUG   ] Pattern found: terminal width 511


router1#terminal width 511
router1#
[DEBUG   ] In disable_paging
[DEBUG   ] Command: terminal length 0

[DEBUG   ] write_channel: b'terminal length 0\n'
[DEBUG   ] Pattern is: terminal\ length\ 0
[DEBUG   ] _read_channel_expect read_data: te
[DEBUG   ] _read_channel_expect read_data: r
[DEBUG   ] _read_channel_expect read_data: minal len
[DEBUG   ] _read_channel_expect read_data: gth 0
router1#
[DEBUG   ] Pattern found: terminal\ length\ 0 terminal length 0
router1#
[DEBUG   ] terminal length 0
router1#
[DEBUG   ] Exiting disable_paging
[DEBUG   ] read_channel: 
[DEBUG   ] write_channel: b'\n'
[DEBUG   ] read_channel: 
router1#
[DEBUG   ] read_channel: 
[DEBUG   ] [find_prompt()]: prompt is router1#
[DEBUG   ] write_channel: b'\n'
[DEBUG   ] Pattern is: router1
[DEBUG   ] _read_channel_expect read_data: 
router1#
[DEBUG   ] Pattern found: router1
router1#
[DEBUG   ] Restarted router1 (napalm proxy)!
[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/proxy/router1/proc/20210520040132575662
[DEBUG   ] LazyLoaded mine.update
[DEBUG   ] Subprocess Thread-31-Schedule-__proxy_keepalive cleaned up

Amd session and is reconnected:

router1#show users
    Line       User       Host(s)              Idle       Location
   1 vty 0     vagrant    idle                 00:00:00 192.168.33.167
*  2 vty 1     vagrant    idle                 00:00:00 192.168.33.2

@ggiesen
Copy link
Contributor Author

ggiesen commented May 20, 2021

Reconfirmed 3003 is broken. It's actually this bug: #60025 (and it's not cosmetic)

@garethgreenaway
Copy link
Member

@ggiesen To confirm, this is not an issue with the deltaproxy changes but that other issue that you linked?

@ggiesen
Copy link
Contributor Author

ggiesen commented May 26, 2021

@garethgreenaway that's correct. The patch made in the other issue has caused some new issues as well, and while they do solve the reconnect issue for regular proxy minions, they do not solve it for deltaproxy

@garethgreenaway
Copy link
Member

@ggiesen Found the issue and updated the adding_deltaproxy branch with the fix. The issue was when checking for the existence of ConnectionClosedException via an import.

@ggiesen
Copy link
Contributor Author

ggiesen commented May 27, 2021

@garethgreenaway After the latest changes in 4a1efd0, deltaproxy now reconnects automatically with no intervention on the part of the user:

Session is killed:

[DEBUG   ] [chan 0] EOF received (0)
[DEBUG   ] [chan 0] EOF sent (0)
[DEBUG   ] EOF in transport thread

Session is automatically reconnected:

[DEBUG   ] schedule: Job __proxy_keepalive was scheduled with jid_include, adding to cache (jid_include defaults to True)
[DEBUG   ] schedule: Job __proxy_keepalive was scheduled with a max number of 1
[INFO    ] Running scheduled job: __proxy_keepalive with jid 20210527060351684329
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] Subprocess Thread-77-Schedule-__proxy_keepalive added
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] LazyLoaded status.proxy_reconnect
[DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'id': 'router1', 'fun': 'status.proxy_reconnect', 'fun_args': [{'proxy_name': 'napalm'}], 'schedule': '__proxy_keepalive', 'jid': '20210527060351684329', 'pid': 67432}
[DEBUG   ] Is router1 still alive? No.
[INFO    ] router1 (napalm proxy) is down. Restarting.
[DEBUG   ] Setting up NAPALM connection
[DEBUG   ] starting thread (client mode): 0x5fcf128
[DEBUG   ] Local version/idstring: SSH-2.0-paramiko_2.7.2
[DEBUG   ] Remote version/idstring: SSH-2.0-Cisco-1.25
[INFO    ] Connected (version 2.0, client Cisco-1.25)
[DEBUG   ] kex algos:['diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1'] server key:['ssh-rsa'] client encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr'] server encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr'] client mac:['hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1', 'hmac-sha1-96'] server mac:['hmac-sha2-256', 'hmac-sha2-512', 'hmac-sha1', 'hmac-sha1-96'] client compress:['none'] server compress:['none'] client lang:[''] server lang:[''] kex follows?False
[DEBUG   ] Kex agreed: diffie-hellman-group-exchange-sha1
[DEBUG   ] HostKey agreed: ssh-rsa
[DEBUG   ] Cipher agreed: aes128-ctr
[DEBUG   ] MAC agreed: hmac-sha2-256
[DEBUG   ] Compression agreed: none
[DEBUG   ] Got server p (2048 bits)
[DEBUG   ] kex engine KexGex specified hash_algo <built-in function openssl_sha1>
[DEBUG   ] Switch to new keys ...
[DEBUG   ] Adding ssh-rsa host key for 192.168.33.130: b'1e128dac15b49082ad116da8a3207171'
[DEBUG   ] userauth is OK
[INFO    ] Authentication (password) successful!
[DEBUG   ] [chan 0] Max packet in: 32768 bytes
[DEBUG   ] [chan 0] Max packet out: 4096 bytes
[DEBUG   ] Secsh channel 0 opened.
[DEBUG   ] [chan 0] Sesch channel 0 request ok
[DEBUG   ] [chan 0] Sesch channel 0 request ok
[DEBUG   ] write_channel: b'terminal width 511\n'
[DEBUG   ] Pattern is: terminal width 511
[DEBUG   ] _read_channel_expect read_data:


router1#
[DEBUG   ] _read_channel_expect read_data: termi
[DEBUG   ] _read_channel_expect read_data: nal
[DEBUG   ] _read_channel_expect read_data:  w
[DEBUG   ] _read_channel_expect read_data: i
[DEBUG   ] _read_channel_expect read_data: dt
[DEBUG   ] _read_channel_expect read_data: h 5
[DEBUG   ] _read_channel_expect read_data: 1
[DEBUG   ] _read_channel_expect read_data: 1

[DEBUG   ] Pattern found: terminal width 511 


router1#terminal width 511

[DEBUG   ] In disable_paging
[DEBUG   ] Command: terminal length 0

[DEBUG   ] write_channel: b'terminal length 0\n'
[DEBUG   ] Pattern is: terminal\ length\ 0
[DEBUG   ] _read_channel_expect read_data: router1#
[DEBUG   ] _read_channel_expect read_data: t
[DEBUG   ] _read_channel_expect read_data: erm
[DEBUG   ] _read_channel_expect read_data: i
[DEBUG   ] _read_channel_expect read_data: nal le
[DEBUG   ] _read_channel_expect read_data: ng
[DEBUG   ] _read_channel_expect read_data: th 
[DEBUG   ] _read_channel_expect read_data: 0

[DEBUG   ] Pattern found: terminal\ length\ 0 router1#terminal length 0

[DEBUG   ] router1#terminal length 0

[DEBUG   ] Exiting disable_paging
[DEBUG   ] read_channel: router1#
[DEBUG   ] Clear buffer detects data in the channel
[DEBUG   ] LazyLoaded mine.update
[DEBUG   ] read_channel: 
[DEBUG   ] write_channel: b'\n'
[DEBUG   ] read_channel: 
router1#
[DEBUG   ] read_channel: 
[DEBUG   ] [find_prompt()]: prompt is router1#
[DEBUG   ] write_channel: b'\n'
[DEBUG   ] Pattern is: router1
[DEBUG   ] _read_channel_expect read_data: 

[DEBUG   ] _read_channel_expect read_data: router1#
[DEBUG   ] Pattern found: router1
router1#
[DEBUG   ] Restarted router1 (napalm proxy)!
[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/proxy/router1/proc/20210527060351684329
[DEBUG   ] Subprocess Thread-77-Schedule-__proxy_keepalive cleaned up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Delta-Proxy severity-high 2nd top severity, seen by most users, causes major problems Silicon v3004.0 Release code name
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants