Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CLI configuration options for teamd retry count feature #2642

Merged
merged 28 commits into from
Jun 2, 2023
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
42b20c1
Add CLI configuration options for teamd retry count feature
saiarcot895 Feb 1, 2023
b42b635
Add test for error case from teamd when it's not running
saiarcot895 Feb 1, 2023
e9e9af0
Fix up test cases
saiarcot895 Feb 1, 2023
f834b8a
Add some error handling if teamdctl doesn't exist
saiarcot895 Feb 7, 2023
bd40c1b
Merge commit 'd433b2f954e446db7a655e882a7274cd5bce3a50' into teamd-re…
saiarcot895 Apr 20, 2023
7fc5ebd
Add probe functionality and sending current LACPDU packet functionality
saiarcot895 Apr 26, 2023
b5b372b
Check to see if the retry count feature is enabled before doing a get…
saiarcot895 May 4, 2023
c3c6b2e
Add option to only send probe packets or only change retry count
saiarcot895 May 4, 2023
ad54c4c
Call the teamd retry count script if doing a warm-reboot
saiarcot895 May 4, 2023
fc9195f
Fix pycheck errors, and disable scapy's IPv6 and verbose mode
saiarcot895 May 15, 2023
44a6712
Make teamd retry count support optional
saiarcot895 May 15, 2023
5aa89b5
Address review comments, and restructure code to increase code coverage
saiarcot895 May 17, 2023
1a4e17f
Address some review comments
saiarcot895 May 17, 2023
bbad1e3
Replace tabs with spaces
saiarcot895 May 17, 2023
5acd304
Verify that expected keys are present in the data returned from teamdctl
saiarcot895 May 17, 2023
0f4f822
Merge commit '7d2ca0b' into teamd-retry-count-cli
saiarcot895 May 18, 2023
e6acfe0
Fix TimeoutExpired undefined error
saiarcot895 May 18, 2023
9e2d7a3
Add ability to mock subprocess calls (at a limited level)
saiarcot895 May 19, 2023
65c1bdb
Return an actual subprocess object, and add a test for checking timeout
saiarcot895 May 19, 2023
edea4ce
Change variable syntax
saiarcot895 May 22, 2023
f76e1e9
Fix set being accessed with an index
saiarcot895 May 22, 2023
e547f50
Add option to warm-reboot script to control if teamd retry count is r…
saiarcot895 May 23, 2023
a5966f2
Move the teamd retry count check to before orchagent
saiarcot895 May 23, 2023
1670cb7
Move retry count script start to be prior to point-of-no-return
saiarcot895 May 23, 2023
ee72908
Set executable bit
saiarcot895 May 23, 2023
c719f32
Address PR comments
saiarcot895 May 30, 2023
01852c6
Change to case-insensitive string contains check
saiarcot895 May 30, 2023
18fba7c
Make sure the global abort variable is used
saiarcot895 Jun 2, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions config/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -2199,6 +2199,93 @@ def del_portchannel_member(ctx, portchannel_name, port_name):
except JsonPatchConflict:
ctx.fail("Invalid or nonexistent portchannel or interface. Please ensure existence of portchannel member.")

@portchannel.group(cls=clicommon.AbbreviationGroup, name='retry-count')
@click.pass_context
def portchannel_retry_count(ctx):
pass

def check_if_retry_count_is_enabled(ctx, portchannel_name):
try:
proc = subprocess.Popen(["teamdctl", portchannel_name, "state", "item", "get", "runner.enable_retry_count_feature"], text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, err = proc.communicate(timeout=10)
if proc.returncode != 0:
ctx.fail("Unable to determine if the retry count feature is enabled or not: {}".format(err.strip()))
return output.strip() == "true"
except subprocess.TimeoutExpired as e:
proc.kill()
proc.communicate()
ctx.fail("Unable to determine if the retry count feature is enabled or not: {}".format(e))

@portchannel_retry_count.command('get')
@click.argument('portchannel_name', metavar='<portchannel_name>', required=True)
@click.pass_context
def get_portchannel_retry_count(ctx, portchannel_name):
"""Get the retry count for a port channel"""
db = ValidatedConfigDBConnector(ctx.obj['db'])

# Don't proceed if the port channel name is not valid
if is_portchannel_name_valid(portchannel_name) is False:
ctx.fail("{} is invalid!, name should have prefix '{}' and suffix '{}'"
.format(portchannel_name, CFG_PORTCHANNEL_PREFIX, CFG_PORTCHANNEL_NO))

# Don't proceed if the port channel does not exist
if is_portchannel_present_in_db(db, portchannel_name) is False:
ctx.fail("{} is not present.".format(portchannel_name))

try:
is_retry_count_enabled = check_if_retry_count_is_enabled(ctx, portchannel_name)
if not is_retry_count_enabled:
ctx.fail("Retry count feature is not enabled!")

proc = subprocess.Popen(["teamdctl", portchannel_name, "state", "item", "get", "runner.retry_count"], text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, err = proc.communicate(timeout=10)
if proc.returncode != 0:
ctx.fail("Unable to get the retry count: {}".format(err.strip()))
click.echo(output.strip())
except FileNotFoundError:
ctx.fail("Unable to get the retry count: teamdctl could not be run")
except subprocess.TimeoutExpired as e:
proc.kill()
proc.communicate()
ctx.fail("Unable to get the retry count: {}".format(e))
except Exception as e:
ctx.fail("Unable to get the retry count: {}".format(e))

@portchannel_retry_count.command('set')
@click.argument('portchannel_name', metavar='<portchannel_name>', required=True)
@click.argument('retry_count', metavar='<retry_count>', required=True, type=click.IntRange(3,10))
@click.pass_context
def set_portchannel_retry_count(ctx, portchannel_name, retry_count):
"""Set the retry count for a port channel"""
db = ValidatedConfigDBConnector(ctx.obj['db'])

# Don't proceed if the port channel name is not valid
if is_portchannel_name_valid(portchannel_name) is False:
ctx.fail("{} is invalid!, name should have prefix '{}' and suffix '{}'"
.format(portchannel_name, CFG_PORTCHANNEL_PREFIX, CFG_PORTCHANNEL_NO))

# Don't proceed if the port channel does not exist
if is_portchannel_present_in_db(db, portchannel_name) is False:
ctx.fail("{} is not present.".format(portchannel_name))

try:
is_retry_count_enabled = check_if_retry_count_is_enabled(ctx, portchannel_name)
if not is_retry_count_enabled:
ctx.fail("Retry count feature is not enabled!")

proc = subprocess.Popen(["teamdctl", portchannel_name, "state", "item", "set", "runner.retry_count", str(retry_count)], text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, err = proc.communicate(timeout=10)
if proc.returncode != 0:
ctx.fail("Unable to set the retry count: {}".format(err.strip()))
except FileNotFoundError:
ctx.fail("Unable to set the retry count: teamdctl could not be run")
except subprocess.TimeoutExpired as e:
proc.kill()
proc.communicate()
ctx.fail("Unable to set the retry count: {}".format(e))
except Exception as e:
ctx.fail("Unable to set the retry count: {}".format(e))


#
# 'mirror_session' group ('config mirror_session ...')
Expand Down
31 changes: 30 additions & 1 deletion scripts/fast-reboot
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ TAG_LATEST=yes
DETACH=no
LOG_PATH="/var/log/${REBOOT_TYPE}.txt"
UIMAGE_HDR_SIZE=64
REQUIRE_TEAMD_RETRY_COUNT=no

# Require 100M available on the hard drive for warm reboot temp files,
# Size is in 1K blocks:
Expand Down Expand Up @@ -78,13 +79,15 @@ function showHelpAndExit()
echo " -t : Don't tag the current kube images as latest"
echo " -D : detached mode - closing terminal will not cause stopping reboot"
echo " -u : include ssd-upgrader-part in boot options"
echo " -n : don't require peer devices to be running SONiC with retry count feature [default]"
echo " -N : require peer devices to be running SONiC with retry count feature"

exit "${EXIT_SUCCESS}"
}

function parseOptions()
{
while getopts "vfidh?rkxc:sDu" opt; do #TODO "t" is missing
while getopts "vfidh?rkxc:sDunN" opt; do #TODO "t" is missing
case ${opt} in
h|\? )
showHelpAndExit
Expand Down Expand Up @@ -125,6 +128,12 @@ function parseOptions()
u )
SSD_FW_UPDATE_BOOT_OPTION=yes
;;
n )
vaibhavhd marked this conversation as resolved.
Show resolved Hide resolved
REQUIRE_TEAMD_RETRY_COUNT=no
;;
N )
REQUIRE_TEAMD_RETRY_COUNT=yes
;;
esac
done
}
Expand Down Expand Up @@ -635,6 +644,22 @@ init_warm_reboot_states

setup_control_plane_assistant

TEAMD_INCREASE_RETRY_COUNT=0
if [[ "${REBOOT_TYPE}" = "warm-reboot" || "${REBOOT_TYPE}" = "fastfast-reboot" ]]; then
TEAMD_RETRY_COUNT_PROBE_RC=0
/usr/local/bin/teamd_increase_retry_count.py --probe-only || TEAMD_RETRY_COUNT_PROBE_RC=$?
if [[ $TEAMD_RETRY_COUNT_PROBE_RC -ne 0 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: use ${} syntax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

if [[ "${REQUIRE_TEAMD_RETRY_COUNT}" = "yes" ]]; then
error "Could not confirm that all neighbor devices are running SONiC with the retry count feature"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are missing an exit w/ error code here?
Wouldn't the logic detect the error state and still continue to warm-reboot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think error does the exit as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it doesn't, not sure why I didn't see this not failing before. Added an exit with a new error code here.

else
debug "Warning: Retry count feature support unknown for one or more neighbor devices; assuming that it's not available"
TEAMD_INCREASE_RETRY_COUNT=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed to be set to 0 again? The variable is initially set to 0 anyway, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, removed this line.

fi
else
TEAMD_INCREASE_RETRY_COUNT=1
fi
fi

if [[ "$REBOOT_TYPE" = "warm-reboot" || "$REBOOT_TYPE" = "fastfast-reboot" || "$REBOOT_TYPE" = "fast-reboot" ]]; then
# Freeze orchagent for warm restart
# Ask orchagent_restart_check to try freeze 5 times with interval of 2 seconds,
Expand Down Expand Up @@ -663,6 +688,10 @@ if [[ "$REBOOT_TYPE" = "fast-reboot" ]]; then
fi
fi

if [[ ( "${REBOOT_TYPE}" = "warm-reboot" || "${REBOOT_TYPE}" = "fastfast-reboot" ) && "${TEAMD_INCREASE_RETRY_COUNT}" -eq 1 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this feature is needed even for fast-reboot?
For fast-reboot it will help avoid LAG flaps too - if the boot up is slow enough that 90s is reached before even syncd does ASIC reset call. So essentially, preventing LAG flap other than where it should really happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make decision on this later too.

/usr/local/bin/teamd_increase_retry_count.py
fi

# We are fully committed to reboot from this point on because critical
# service will go down and we cannot recover from it.
set +e
Expand Down
Loading
Loading