Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CLI configuration options for teamd retry count feature #2642

Merged
merged 28 commits into from
Jun 2, 2023

Conversation

saiarcot895
Copy link
Contributor

@saiarcot895 saiarcot895 commented Feb 1, 2023

Add a SONiC CLI to more easily configure the retry count for port channels. This effectively acts like a wrapper around the underlying teamdctl command.

Also add a python script that'll be installed into /usr/local/bin/teamd_increase_retry_count.py that will detect if the peer device likely supports this teamd feature (based on LLDP neighbor info) and increases the teamd retry count to 5 in preparation for warm upgrade. This script requires sudo to run.

This is tied to sonic-net/sonic-buildimage#13453.

Signed-off-by: Saikrishna Arcot sarcot@microsoft.com

What I did

How I did it

How to verify it

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

admin@vlab-02:~$ sudo config portchannel retry-count get PortChannel101
3
admin@vlab-02:~$ sudo config portchannel retry-count get PortChannel999
Usage: config portchannel retry-count get [OPTIONS] <portchannel_name>
Try "config portchannel retry-count get -h" for help.

Error: PortChannel999 is not present.
admin@vlab-02:~$ sudo config portchannel retry-count set PortChannel101 3
admin@vlab-02:~$ sudo config portchannel retry-count set PortChannel101 5

Add a SONiC CLI to more easily configure the retry count for port
channels. This effectively acts like a wrapper around the underlying
teamdctl command.

Also add a python script that'll be installed into
/usr/local/bin/teamd_increase_retry_count.py that will detect if the
peer device likely supports this teamd feature (based on LLDP neighbor
info) and increases the teamd retry count to 5 in preparation for warm
upgrade. This script requires sudo to run.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
yxieca
yxieca previously approved these changes Feb 10, 2023
… or set

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Scapy's IPv6 support appears to have caused some issues with older
versions of scapy, which may be present on older SONiC images.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Don't fail warm reboot if teamd retry count support doesn't happen to be
present.

Also use fastfast-reboot for Mellanox devices.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
config/main.py Outdated Show resolved Hide resolved
config/main.py Outdated Show resolved Hide resolved
config/main.py Outdated Show resolved Hide resolved
config/main.py Outdated Show resolved Hide resolved
scripts/fast-reboot Outdated Show resolved Hide resolved
scripts/teamd_increase_retry_count.py Show resolved Hide resolved
scripts/teamd_increase_retry_count.py Outdated Show resolved Hide resolved
scripts/teamd_increase_retry_count.py Show resolved Hide resolved
scripts/teamd_increase_retry_count.py Outdated Show resolved Hide resolved
scripts/fast-reboot Outdated Show resolved Hide resolved
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Also update a failure message in the warm-reboot script if the retry
count script fails.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
/usr/local/bin/teamd_increase_retry_count.py --probe-only || TEAMD_RETRY_COUNT_PROBE_RC=$?
if [[ TEAMD_RETRY_COUNT_PROBE_RC -ne 0 ]]; then
debug "Warning: Retry count feature support unknown for one or more neighbor devices; assuming that it's not available"
TEAMD_INCREASE_RETRY_COUNT=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we supposed to bail out from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the images we deploy to our internal environment, this will be changed to bail out and fail the warm-reboot if SONiC neighbors are not seen. For the public version, since we can't require using SONiC-only neighbors for warm reboot, this will only be a warning.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
…equired or not

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
This is so that in the case of the teamd retry count check failing,
there's fewer changes that happen on the system (it'll fail faster).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
This doesn't need to be after the point-of-no-return, since this will
detach and be sending LACPDUs on its own.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@saiarcot895 saiarcot895 requested a review from yxieca May 26, 2023 17:11
yxieca
yxieca previously approved these changes May 26, 2023
Copy link
Contributor

@vaibhavhd vaibhavhd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Posted minor concerns.

scripts/fast-reboot Show resolved Hide resolved
if [[ "${REBOOT_TYPE}" = "warm-reboot" || "${REBOOT_TYPE}" = "fastfast-reboot" ]]; then
TEAMD_RETRY_COUNT_PROBE_RC=0
/usr/local/bin/teamd_increase_retry_count.py --probe-only || TEAMD_RETRY_COUNT_PROBE_RC=$?
if [[ $TEAMD_RETRY_COUNT_PROBE_RC -ne 0 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: use ${} syntax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

/usr/local/bin/teamd_increase_retry_count.py --probe-only || TEAMD_RETRY_COUNT_PROBE_RC=$?
if [[ $TEAMD_RETRY_COUNT_PROBE_RC -ne 0 ]]; then
if [[ "${REQUIRE_TEAMD_RETRY_COUNT}" = "yes" ]]; then
error "Could not confirm that all neighbor devices are running SONiC with the retry count feature"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are missing an exit w/ error code here?
Wouldn't the logic detect the error state and still continue to warm-reboot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think error does the exit as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it doesn't, not sure why I didn't see this not failing before. Added an exit with a new error code here.

error "Could not confirm that all neighbor devices are running SONiC with the retry count feature"
else
debug "Warning: Retry count feature support unknown for one or more neighbor devices; assuming that it's not available"
TEAMD_INCREASE_RETRY_COUNT=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed to be set to 0 again? The variable is initially set to 0 anyway, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, removed this line.

@@ -663,6 +688,10 @@ if [[ "$REBOOT_TYPE" = "fast-reboot" ]]; then
fi
fi

if [[ ( "${REBOOT_TYPE}" = "warm-reboot" || "${REBOOT_TYPE}" = "fastfast-reboot" ) && "${TEAMD_INCREASE_RETRY_COUNT}" -eq 1 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this feature is needed even for fast-reboot?
For fast-reboot it will help avoid LAG flaps too - if the boot up is slow enough that 90s is reached before even syncd does ASIC reset call. So essentially, preventing LAG flap other than where it should really happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make decision on this later too.

if os.geteuid() != 0:
log.log_error("Root privileges required for this operation", also_print_to_console=True)
sys.exit(1)
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return will never be reached after sys.exit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the extra returns.

return False
portChannels = getPortChannels()
if not portChannels:
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this main returning True/False? What is checking on these return values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at one point of time, I had some logic that would indicate success/failure via the True/False returns, but this isn't used at all. Removed the True/False returns.

activePortChannels = []
for portChannel in portChannels:
state = portChannelTable.get(portChannel)
if not state[0]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risk of index error here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think libswsscommon returns whether the call is a success or failure in the first index of the tuple that's returned, but regardless, I added a check to see if an empty tuple/list was returned.

log.log_warning("WARNING: No peer description available via LLDP for {}; skipping".format(portName))
continue
portChannelChecked = True
if "SONiC" not in peerInfo["descr"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Check w/ ignore case here might be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
vaibhavhd
vaibhavhd previously approved these changes May 30, 2023
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@saiarcot895 saiarcot895 merged commit 72ca484 into sonic-net:master Jun 2, 2023
5 checks passed
@saiarcot895 saiarcot895 deleted the teamd-retry-count-cli branch June 2, 2023 20:34
dprital added a commit to dprital/sonic-buildimage that referenced this pull request Jun 5, 2023
Update sonic-utilities submodule pointer to include the following:
* 5c9b2177 Fix issue: out of range sflow polling interval is accepted and stored in config_db ([sonic-net#2847](sonic-net/sonic-utilities#2847))
* 72ca4848 Add CLI configuration options for teamd retry count feature ([sonic-net#2642](sonic-net/sonic-utilities#2642))
* 359dfc0c [Clock] Implement clock CLI ([sonic-net#2793](sonic-net/sonic-utilities#2793))
* b316fc27 Add transceiver status CLI to show output from TRANSCEIVER_STATUS table ([sonic-net#2772](sonic-net/sonic-utilities#2772))
* dc59dbd2 Replace pickle by json ([sonic-net#2849](sonic-net/sonic-utilities#2849))
* a66f41c4 [show] replace shell=True, replace xml by lxml, replace exit by sys.exit ([sonic-net#2666](sonic-net/sonic-utilities#2666))
* 57500572 [utilities_common] replace shell=True ([sonic-net#2718](sonic-net/sonic-utilities#2718))
* 6e0ee3e7 [CRM][DASH] Extend CRM utility to support DASH resources. ([sonic-net#2800](sonic-net/sonic-utilities#2800))
* b2c29b0b [config] Generate sysinfo in single asic ([sonic-net#2856](sonic-net/sonic-utilities#2856))

Signed-off-by: dprital <drorp@nvidia.com>
StormLiangMS pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 5, 2023
…nic-utilities submodule on master (#15193)

Dependency:
sonic-net/sonic-utilities#2718

Why I did it
This PR sonic-net/sonic-utilities#2718 reduce shell=True usage in utilities_common.cli.run_command() function.

Work item tracking
Microsoft ADO (number only): 15022050
How I did it
Replace strings commands using utilities_common.cli.run_command() function to list of strings

due to circular dependency, advance sonic-utilities submodule
72ca4848 (HEAD -> master, upstream/master, upstream/HEAD) Add CLI configuration options for teamd retry count feature (sonic-net/sonic-utilities#2642)
359dfc0c [Clock] Implement clock CLI (sonic-net/sonic-utilities#2793)
b316fc27 Add transceiver status CLI to show output from TRANSCEIVER_STATUS table (sonic-net/sonic-utilities#2772)
dc59dbd2 Replace pickle by json (sonic-net/sonic-utilities#2849)
a66f41c4 [show] replace shell=True, replace xml by lxml, replace exit by sys.exit (sonic-net/sonic-utilities#2666)
57500572 [utilities_common] replace shell=True (sonic-net/sonic-utilities#2718)
6e0ee3e7 [CRM][DASH] Extend CRM utility to support DASH resources. (sonic-net/sonic-utilities#2800)
b2c29b0b [config] Generate sysinfo in single asic (sonic-net/sonic-utilities#2856)
pdhruv-marvell pushed a commit to pdhruv-marvell/sonic-utilities that referenced this pull request Aug 23, 2023
…t#2642)

* Add CLI configuration options for teamd retry count feature

Add a SONiC CLI to more easily configure the retry count for port
channels. This effectively acts like a wrapper around the underlying
teamdctl command.

Also add a python script that'll be installed into
/usr/local/bin/teamd_increase_retry_count.py that will detect if the
peer device likely supports this teamd feature (based on LLDP neighbor
info) and increases the teamd retry count to 5 in preparation for warm
upgrade. This script requires sudo to run.

This is tied to sonic-net/sonic-buildimage#13453.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add test for error case from teamd when it's not running

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix up test cases

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add some error handling if teamdctl doesn't exist

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add probe functionality and sending current LACPDU packet functionality

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Check to see if the retry count feature is enabled before doing a get or set

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add option to only send probe packets or only change retry count

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Call the teamd retry count script if doing a warm-reboot

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix pycheck errors, and disable scapy's IPv6 and verbose mode

Scapy's IPv6 support appears to have caused some issues with older
versions of scapy, which may be present on older SONiC images.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Make teamd retry count support optional

Don't fail warm reboot if teamd retry count support doesn't happen to be
present.

Also use fastfast-reboot for Mellanox devices.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address review comments, and restructure code to increase code coverage

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address some review comments

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Replace tabs with spaces

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Verify that expected keys are present in the data returned from teamdctl

Also update a failure message in the warm-reboot script if the retry
count script fails.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix TimeoutExpired undefined error

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add ability to mock subprocess calls (at a limited level)

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Return an actual subprocess object, and add a test for checking timeout

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Change variable syntax

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix set being accessed with an index

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add option to warm-reboot script to control if teamd retry count is required or not

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Move the teamd retry count check to before orchagent

This is so that in the case of the teamd retry count check failing,
there's fewer changes that happen on the system (it'll fail faster).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Move retry count script start to be prior to point-of-no-return

This doesn't need to be after the point-of-no-return, since this will
detach and be sending LACPDUs on its own.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Set executable bit

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address PR comments

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Change to case-insensitive string contains check

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Make sure the global abort variable is used

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

---------

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…nic-utilities submodule on master (sonic-net#15193)

Dependency:
sonic-net/sonic-utilities#2718

Why I did it
This PR sonic-net/sonic-utilities#2718 reduce shell=True usage in utilities_common.cli.run_command() function.

Work item tracking
Microsoft ADO (number only): 15022050
How I did it
Replace strings commands using utilities_common.cli.run_command() function to list of strings

due to circular dependency, advance sonic-utilities submodule
72ca4848 (HEAD -> master, upstream/master, upstream/HEAD) Add CLI configuration options for teamd retry count feature (sonic-net/sonic-utilities#2642)
359dfc0c [Clock] Implement clock CLI (sonic-net/sonic-utilities#2793)
b316fc27 Add transceiver status CLI to show output from TRANSCEIVER_STATUS table (sonic-net/sonic-utilities#2772)
dc59dbd2 Replace pickle by json (sonic-net/sonic-utilities#2849)
a66f41c4 [show] replace shell=True, replace xml by lxml, replace exit by sys.exit (sonic-net/sonic-utilities#2666)
57500572 [utilities_common] replace shell=True (sonic-net/sonic-utilities#2718)
6e0ee3e7 [CRM][DASH] Extend CRM utility to support DASH resources. (sonic-net/sonic-utilities#2800)
b2c29b0b [config] Generate sysinfo in single asic (sonic-net/sonic-utilities#2856)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants