Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chassis][midplane] Modify the chassisd to log expected/unexpected midplane connectivity messages #480

Merged
merged 4 commits into from
May 15, 2024

Conversation

mlok-nokia
Copy link
Contributor

@mlok-nokia mlok-nokia commented Apr 26, 2024

Modified the SUP chassisd check_midplane_reachability() function to use the CHASSIS_MODULE_REBOOT_INFO_TABLE data (which is set by Linecard "sudo reboot" command) log expected or unexpected module lost midplane connectivity. This address issue sonic-net/sonic-buildimage#18540

Description

Add a new method is_module_reboot_expected() to check if CHASSIS_MODULE_REBOOT_INFO_TABLE|LINECARD# entry exists in CHASSIS_STATE_DB when a linecard is not reachable from SUP. If entry exists, it is expected reboot. check_midplane_reachability() will log "pmon#chassisd: Expected: Module LINE-CARD1 lost midplane connectivity". If entry doesn't exist, it will log "pmon#chassisd: Unexpected: Module LINE-CARD1 lost midplane connectivity". The CHASSIS_MODULE_REBOOT_INFO_TABLE|LINECARD# entry created and insert by linecard "sudo reboot" command by PR. It means that Users issue a linecard reboot, "lost midplane connectivity" is expected. Otherwise, such a linecard crash or missing heartbeat reboot, etc is unexpected.
Add new method module_reboot_set_time() and is_module_reboot_system_up_expired() to check if an expected reboot of linecard is not able to be up and detected by SUP in 3 minutes, check_midplane_reachabikity() will log "pmon#chassisd: Unexpected: Module LINE-CARD1 lost midplane connectivity". This provides the log message to the monitoring tool to take any further action.
This PR is required and associated with the following PRs
PR sonic-net/sonic-buildimage#18805
sonic-net/sonic-utilities#3292
#480
sonic-net/sonic-buildimage#18862

Motivation and Context

This provides a proper log message whether a module "lost midplane connectivity" is expected or not. This provides an efficient information log to the monitoring tool to take any further action. Fixes sonic-net/sonic-buildimage#18540

How Has This Been Tested?

This PR requires PRhttps://github.com/sonic-net/sonic-utilities/pull/3292 and to work with

  1. Test expected log. Use the CLI command "sudo reboot" to reboot a linecard, then check the syslog on Supervisor. The below message is logged
Apr 25 19:44:40.818378 ixre-cpm-chassis7 WARNING pmon#chassisd: Expected: Module LINE-CARD0 lost midplane connectivity
  1. Test unepxpected log. Using "sudo /sbin/reboot" or reboot a linecard with any crash method, then ccheck the syslog on Supervusor. The below message is logged.
Apr 25 19:50:22.549416 ixre-cpm-chassis7 WARNING pmon#chassisd: Unexpected: Module LINE-CARD0 lost midplane connectivity
  1. Test the expexcted reboot with timeout case. Use the CLI command "sudo reboot" on linecard. and keep it down for more than 4 minutes. The below messages are logged.
Apr 25 01:25:53.877143 ixre-cpm-chassis7 WARNING sr_device_mgr: Unable to reach slot 1 (Linecard) via Midplane
Apr 25 01:25:58.402511 ixre-cpm-chassis7 WARNING pmon#chassisd: Module LINE-CARD0 went off-line!
Apr 25 01:26:01.658959 ixre-cpm-chassis7 WARNING pmon#chassisd: Expected: Module LINE-CARD0 lost midplane connectivity.
( 3 minutes after the first log)
Apr 25 01:29:10.259527 ixre-cpm-chassis7 WARNING pmon#chassisd: Unexpected: Module LINE-CARD0 midplane connectivity is not restored in 180 seconds

Additional Information (Optional)

This PR needs to be back ported to branchs:
[x] 202205

@mlok-nokia
Copy link
Contributor Author

@deepak-singhal0408 @judyjoseph This PR is for an issue of logging lost midplane connectivity log. Total 3 PRs. Please review them. Thanks

@bmridul
Copy link
Collaborator

bmridul commented Apr 30, 2024

Can you provide details (schema) on Chassis Module Reboot Info table which is introduced here.
A small write up/enhancement to PMON Chassisd HLD will also be useful.

@bmridul
Copy link
Collaborator

bmridul commented Apr 30, 2024

It is not clear why the Chassis module reboot info entry needs to be removed from platform specific code. Isn't this handled entirely in sonic common code.

https://github.com/sonic-net/sonic-buildimage/blob/9ad556336d7e8216daaf6bd8a769532434b2f7e4/device/nokia/x86_64-nokia_ixr7250e_36x400g-r0/platform_reboot#L16

sonic-chassisd/scripts/chassisd Outdated Show resolved Hide resolved
sonic-chassisd/scripts/chassisd Outdated Show resolved Hide resolved
@mlok-nokia mlok-nokia force-pushed the midplane-module-connectivity-log branch 2 times, most recently from d38ebe6 to 386748a Compare April 30, 2024 19:56
@mlok-nokia
Copy link
Contributor Author

It is not clear why the Chassis module reboot info entry needs to be removed from platform specific code. Isn't this handled entirely in sonic common code.

https://github.com/sonic-net/sonic-buildimage/blob/9ad556336d7e8216daaf6bd8a769532434b2f7e4/device/nokia/x86_64-nokia_ixr7250e_36x400g-r0/platform_reboot#L16

On Nokia platform, one of the unpexpect reboot (missing heartbeart reboot) is calling the "sudo reboot". Since "sudo reboot" creates the expected CHASSIS_MODULE_REBOOT_INFO_TABLE entry, we need to remove it for this case. This is platform specified behaviors.

@mlok-nokia mlok-nokia closed this Apr 30, 2024
@mlok-nokia mlok-nokia reopened this Apr 30, 2024
@mlok-nokia
Copy link
Contributor Author

Can you provide details (schema) on Chassis Module Reboot Info table which is introduced here. A small write up/enhancement to PMON Chassisd HLD will also be useful.

The CHASSIS_MODULE_REBOOT_INFO_TABLE defined as below:
keys: "CHASSIS_MODULE_REBOOT_INFO_TABLE|"
value: "reboot" : "expected"

Example:
{
"CHASSIS_MODULE_REBOOT_INFO_TABLE|LINE-CARD1": {
"expireat": 1714507996.2318723,
"ttl": -0.001,
"type": "hash",
"value": {
"reboot": "expected"
}
}
}

@deepak-singhal0408
Copy link

@mlok-nokia, could you please also add UT case?

…dplane connectivity messages

Signed-off-by: mlok <marty.lok@nokia.com>
@mlok-nokia mlok-nokia force-pushed the midplane-module-connectivity-log branch from 386748a to 918461f Compare May 3, 2024 13:48
Add mechanism to get the linecard_reboot_timeout value from platform_env.conf file.
This provides capabilitiy to different platform can have a different timeout value
Copy link

@deepak-singhal0408 deepak-singhal0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@deepak-singhal0408
Copy link

@bmridul @amulyan7 could you please have a look at the latest change share any further comments. Thanks!

@mlok-nokia
Copy link
Contributor Author

@mlok-nokia, could you please also add UT case?

UT has been added

Copy link
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@judyjoseph
Copy link
Contributor

@kenneth-arista could you review as well

@rlhui rlhui merged commit 88bf8ec into sonic-net:master May 15, 2024
5 checks passed
@deepak-singhal0408
Copy link

MSFT ADO: 28164958

rlhui pushed a commit to sonic-net/sonic-buildimage that referenced this pull request May 31, 2024
… for Nokia-IXR7250E platform (#18862)

This PR add the platform specified linecard_reboot_timeout value to the platform_evn.conf. It works PR sonic-net/sonic-platform-daemons#480 and sonic-net/sonic-utilities#3292 to address issue #18540

Signed-off-by: mlok <marty.lok@nokia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

T2-VOQ: Correlate and rephrase the Midplane/Module connectivity related logs if its a genuine scenario
7 participants