Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[warm-reboot] warm-reboot breaks if ssh session in which it was started drops #7127

Closed
Hedgehog-Guru opened this issue Mar 23, 2021 · 6 comments · Fixed by sonic-net/sonic-utilities#1529
Labels
Triaged this issue has been triaged

Comments

@Hedgehog-Guru
Copy link

Hedgehog-Guru commented Mar 23, 2021

Description
If ssh session in which warm-reboot was started drops, warm-reboot breaks on syncd stage, and the switch become inoperable.

This is degradation in comparison to the 201911 version.

Steps to reproduce the issue

  1. Verify that warm-reboot enable on the DUT:
show platfrom mlnx issue
ISSU is enabled
  1. Apply VLAN, IP, IPv6, and BGP configuration to the DUT(to be close to real production environment):
config interface ip add Loopback0 1.1.1.1/32
sonic-cfggen -j vlan.json --write-to-db       <------RANDOM VLAN + IP + IPV6 configuration
config interface ip add Ethernet192 101.1.0.1/24
config interface ip add Ethernet192 2123::1/64
sonic-cfggen -j bgp_ipv4.json --write-to-db   <----IPV4 BGP CONFIGURATION
sonic-cfggen -j bgp_ipv6.json --write-to-db   <----IPV6 BGP CONFIGURATION

config save -y
config reload  -y
  1. Input command "warm-reboot -v" from ssh session and disconnect ssh session immediately to simulate ssh session drop.

Describe the results you received
After the ssh session disconnect in rcon sessoin we can observe that warm-reboot stops on synd shutdown stage and system became unavailable till reboot.

Describe the results you expected
Warm-reboot should be resistant to ssh disconnects and not stops even in case of ssh session disconnection(ssh client disruption by any reason(server hangs up for example), and not lead to system crash.
For a data center, it is dangerous because it is impossible to confidently update the system

Output of show version

SONiC Software Version: SONiC.sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023
Distribution: Debian 10.8
Kernel: 4.19.0-12-2-amd64
Build commit: 5cb07fad
Build date: Fri Mar 19 12:59:10 UTC 2021
Built by: vadymh@r-build-sonic03

Platform: x86_64-mlnx_msn3800-r0
HwSKU: ACS-MSN3800
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1937X00565
Uptime: 13:31:35 up 3 min,  1 user,  load average: 1.08, 0.84, 0.37

Docker images:
REPOSITORY                    TAG                                                   IMAGE ID            SIZE
docker-syncd-mlnx             latest                                                41b21b570cb4        662MB
docker-syncd-mlnx             sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   41b21b570cb4        662MB
docker-sflow                  latest                                                daec01591301        409MB
docker-sflow                  sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   daec01591301        409MB
docker-snmp                   latest                                                8b1fd0019321        439MB
docker-snmp                   sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   8b1fd0019321        439MB
docker-dhcp-relay             latest                                                6ccc99d7ad71        405MB
docker-dhcp-relay             sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   6ccc99d7ad71        405MB
docker-teamd                  latest                                                d99c06d2ed81        408MB
docker-teamd                  sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   d99c06d2ed81        408MB
docker-nat                    latest                                                422e6bfc371f        411MB
docker-nat                    sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   422e6bfc371f        411MB
docker-router-advertiser      latest                                                d97a9c6d2e9a        398MB
docker-router-advertiser      sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   d97a9c6d2e9a        398MB
docker-platform-monitor       latest                                                c30aaf3e6de1        689MB
docker-platform-monitor       sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   c30aaf3e6de1        689MB
docker-lldp                   latest                                                ecf251673b0a        438MB
docker-lldp                   sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   ecf251673b0a        438MB
docker-database               latest                                                3ee3d100394d        398MB
docker-database               sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   3ee3d100394d        398MB
docker-sonic-mgmt-framework   latest                                                4d5e159ef26e        617MB
docker-sonic-mgmt-framework   sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   4d5e159ef26e        617MB
docker-orchagent              latest                                                cd97498e4604        427MB
docker-orchagent              sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   cd97498e4604        427MB
docker-sonic-telemetry        latest                                                eadc019e1177        487MB
docker-sonic-telemetry        sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   eadc019e1177        487MB
docker-fpm-frr                latest                                                cecb8f7c5f6d        426MB
docker-fpm-frr                sonic_build_358_nbrmgrd_fix.0-dirty-20210319.125023   cecb8f7c5f6d        426MB

This is degradation in comparison to the 201911 release

sonic_dump_r-qa-sw-eth-2322_20210322_135917.tar.gz

@ghost
Copy link

ghost commented Mar 25, 2021

Investigating the issue.

@ghost
Copy link

ghost commented Mar 26, 2021

Investigation result.
Failures of warm reboot, when disconnecting the SSH session is caused by that the process of warm reboot is attached to the terminal session, in which it running now. The only way I found to solve the issue is to run the script in background mode with detaching from current terminal session and redirection of output to file.
I have created PR with the fix: sonic-net/sonic-utilities#1529

@Hedgehog-Guru, could I ask you to clarify what does the next your comment means: This is degradation in comparison to the 201911 version? I have tested the issue on 201911 and the behavior is the same as on master branch.

@liat-grozovik
Copy link
Collaborator

@maksymbelei95 thanks for the PR. I think the same issue will happen with the fastboot or any command that may take time and the ssh session can be closed in the middle.
May i suggest you do the same for the fastboot command as well? but let's first see it is approved for warm boot.

@ghost
Copy link

ghost commented Mar 30, 2021

@liat-grozovik, ok, I will note this for myself.

@Hedgehog-Guru
Copy link
Author

@Hedgehog-Guru, could I ask you to clarify what does the next your comment means: This is degradation in comparison to the 201911 version? I have tested the issue on 201911 and the behavior is the same as on master branch.

@maksymbelei95 , according to my analysis the issue does not exist under 201911 branch.
Perhaps it is in master, but I didn't test it. The 202012 was more important for me.
Thanks.

@anshuv-mfst anshuv-mfst added the Triaged this issue has been triaged label Mar 31, 2021
@stepanblyschak
Copy link
Collaborator

@maksymbelei95 This might be more general issue. As I understand any SONiC command will be killed when SSH session drops. @Hedgehog-Guru Need an analisis of disruptive commands that could lead to switch failure in case SSH session that invokes the command drops (e.g does config reload leaves switch in bad shape if interupted?)

liat-grozovik pushed a commit to sonic-net/sonic-utilities that referenced this issue Nov 25, 2021
#1529)

Starting the script in background mode and detaching the process from terminal session to prevent failures, caused by closing or sudden disconnecting of terminal session.
Redirecting output of the script to the file to prevent failures, when the script tries to write an output to file descriptor of
the nonexistent terminal session.
Adding new parameter to script to be able to explicitly run the script in foreground mode with output to the terminal.
Updating the command reference doc according to the changes.

- What I did
Resolves sonic-net/sonic-buildimage#7127
Fixed failures of warm reboot, when the SSH session is being disconnected.
As the script will now be executed in background mode by default, added parameter to explicitly run it as usual, in foreground mode.
Updated command reference according to the changes.

- How I did it
By restarting the script in background mode with detaching it from the terminal session.
All the output has redirected to file /var/log/warm-reboot.txt for warm-reboot case, or /var/log/fast-reboot.txt for fast-reboot, depends on REBOOT_TYPE. This will prevent crashes of the script in case, when it will try to write some data to the file descriptor of the disconnected terminal session.

- How to verify it
1. Connect to the switch with SSH;
2. Execute sudo warm-reboot -v;
3. See the current progress of warm reboot with cat /var/log/warm-reboot.txt;
4. Close SSH connection before warm reboot finish;
    Warm reboot should finish successfully, in spite of status of the SSH session.

- New command output (if the output of a command-line utility has changed)
The script will be running in background detached mode with output to the file. The related log will be shown in terminal before restarting in background mode:

admin@sonic:~$ sudo warm-reboot
Detaching the process from the terminal session. Redirecting output to /var/log/warm-reboot.txt.

All the usual logs will be written to warm-reboot.txt.
abdosi pushed a commit to sonic-net/sonic-utilities that referenced this issue Dec 8, 2021
#1529)

Starting the script in background mode and detaching the process from terminal session to prevent failures, caused by closing or sudden disconnecting of terminal session.
Redirecting output of the script to the file to prevent failures, when the script tries to write an output to file descriptor of
the nonexistent terminal session.
Adding new parameter to script to be able to explicitly run the script in foreground mode with output to the terminal.
Updating the command reference doc according to the changes.

- What I did
Resolves sonic-net/sonic-buildimage#7127
Fixed failures of warm reboot, when the SSH session is being disconnected.
As the script will now be executed in background mode by default, added parameter to explicitly run it as usual, in foreground mode.
Updated command reference according to the changes.

- How I did it
By restarting the script in background mode with detaching it from the terminal session.
All the output has redirected to file /var/log/warm-reboot.txt for warm-reboot case, or /var/log/fast-reboot.txt for fast-reboot, depends on REBOOT_TYPE. This will prevent crashes of the script in case, when it will try to write some data to the file descriptor of the disconnected terminal session.

- How to verify it
1. Connect to the switch with SSH;
2. Execute sudo warm-reboot -v;
3. See the current progress of warm reboot with cat /var/log/warm-reboot.txt;
4. Close SSH connection before warm reboot finish;
    Warm reboot should finish successfully, in spite of status of the SSH session.

- New command output (if the output of a command-line utility has changed)
The script will be running in background detached mode with output to the file. The related log will be shown in terminal before restarting in background mode:

admin@sonic:~$ sudo warm-reboot
Detaching the process from the terminal session. Redirecting output to /var/log/warm-reboot.txt.

All the usual logs will be written to warm-reboot.txt.
malletvapid23 added a commit to malletvapid23/Sonic-Utility that referenced this issue Aug 3, 2023
…n (#1529)

Starting the script in background mode and detaching the process from terminal session to prevent failures, caused by closing or sudden disconnecting of terminal session.
Redirecting output of the script to the file to prevent failures, when the script tries to write an output to file descriptor of
the nonexistent terminal session.
Adding new parameter to script to be able to explicitly run the script in foreground mode with output to the terminal.
Updating the command reference doc according to the changes.

- What I did
Resolves sonic-net/sonic-buildimage#7127
Fixed failures of warm reboot, when the SSH session is being disconnected.
As the script will now be executed in background mode by default, added parameter to explicitly run it as usual, in foreground mode.
Updated command reference according to the changes.

- How I did it
By restarting the script in background mode with detaching it from the terminal session.
All the output has redirected to file /var/log/warm-reboot.txt for warm-reboot case, or /var/log/fast-reboot.txt for fast-reboot, depends on REBOOT_TYPE. This will prevent crashes of the script in case, when it will try to write some data to the file descriptor of the disconnected terminal session.

- How to verify it
1. Connect to the switch with SSH;
2. Execute sudo warm-reboot -v;
3. See the current progress of warm reboot with cat /var/log/warm-reboot.txt;
4. Close SSH connection before warm reboot finish;
    Warm reboot should finish successfully, in spite of status of the SSH session.

- New command output (if the output of a command-line utility has changed)
The script will be running in background detached mode with output to the file. The related log will be shown in terminal before restarting in background mode:

admin@sonic:~$ sudo warm-reboot
Detaching the process from the terminal session. Redirecting output to /var/log/warm-reboot.txt.

All the usual logs will be written to warm-reboot.txt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants