[chassisd] Capture DPU reboot cause via boot_id and add midplane-down reason for Smart Switch#850
Open
chartsai-nvidia wants to merge 1 commit into
Open
Conversation
NPU chassisd able to read boot_id and update and persist add subscribe to chassisd add tests add persist previous midplane down reason Signed-off-by: Charles Tsai <chartsai@nvidia.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
Hi, there are workflow run(s) waiting for approval, you may be first-time contributor. I will notify maintainers to help approve once PR is approved. Thanks! ---Powered by SONiC BuildBot
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why I did it
On Smart Switch platforms,
chassisdneeds a reliable way to detect when a DPU hasrebooted and to record the correct reboot cause for it. The previous approach inferred
reboots from midplane operational-status transitions and time-window heuristics, which was
fragile and could miss or misattribute reboots. This change switches to a deterministic
signal: the DPU publishes its kernel
boot_id, and the NPU captures the reboot causewhenever that
boot_idchanges. It also adds a proper midplane-down reason(Planned vs. Unplanned) per HLD rev 0.7.
Work item tracking
How I did it
DpuStateUpdater/DpuStateManagerTask): read the kernelboot_idfrom/proc/sys/kernel/random/boot_idand publish it into theDPU_STATEtable inCHASSIS_STATE_DB, updating it only when it changes.SmartSwitchModuleUpdater): addcheck_dpu_boot_id_changes()to capture andpersist the reboot cause when a DPU's
boot_idchanges (only while the midplane isreachable), and load persisted
boot_ids on startup fromprevious-reboot-cause.json.RebootCauseSubscriberTask, which subscribes toDPU_STATEchanges and triggersreboot-cause capture on
boot_idupdates, replacing the old status-transition /time-window reboot heuristic.
boot_idalongside the reboot-cause record and simplify the history file namingby removing the separate
prev_reboot_time.txtbookkeeping._resolve_midplane_down_reason()to classify midplane up->down transitions asPlanned (from the STATE_DB transition flag/type) or Unplanned (from the platform's
get_midplane_down_reason()), and write it intodpu_midplane_link_reason.How to verify it
tests/test_chassisd.py, with new mock support intests/mock_module_base.py(midplane-down reason constants) andtests/mock_platform.py(
get_module_state_transition/get_midplane_down_reason) to coverboot_idcaptureand the Planned/Unplanned midplane-down reason logic.