Skip to content

[chassisd] Capture DPU reboot cause via boot_id and add midplane-down reason for Smart Switch#850

Open
chartsai-nvidia wants to merge 1 commit into
sonic-net:masterfrom
chartsai-nvidia:chartsai/pmon-hld-update
Open

[chassisd] Capture DPU reboot cause via boot_id and add midplane-down reason for Smart Switch#850
chartsai-nvidia wants to merge 1 commit into
sonic-net:masterfrom
chartsai-nvidia:chartsai/pmon-hld-update

Conversation

@chartsai-nvidia

@chartsai-nvidia chartsai-nvidia commented Jul 2, 2026

Copy link
Copy Markdown

Why I did it

On Smart Switch platforms, chassisd needs a reliable way to detect when a DPU has
rebooted and to record the correct reboot cause for it. The previous approach inferred
reboots from midplane operational-status transitions and time-window heuristics, which was
fragile and could miss or misattribute reboots. This change switches to a deterministic
signal: the DPU publishes its kernel boot_id, and the NPU captures the reboot cause
whenever that boot_id changes. It also adds a proper midplane-down reason
(Planned vs. Unplanned) per HLD rev 0.7.

Work item tracking
  • Microsoft ADO (number only): N/A

How I did it

  • DPU side (DpuStateUpdater / DpuStateManagerTask): read the kernel boot_id from
    /proc/sys/kernel/random/boot_id and publish it into the DPU_STATE table in
    CHASSIS_STATE_DB, updating it only when it changes.
  • NPU side (SmartSwitchModuleUpdater): add check_dpu_boot_id_changes() to capture and
    persist the reboot cause when a DPU's boot_id changes (only while the midplane is
    reachable), and load persisted boot_ids on startup from previous-reboot-cause.json.
  • Add RebootCauseSubscriberTask, which subscribes to DPU_STATE changes and triggers
    reboot-cause capture on boot_id updates, replacing the old status-transition /
    time-window reboot heuristic.
  • Persist boot_id alongside the reboot-cause record and simplify the history file naming
    by removing the separate prev_reboot_time.txt bookkeeping.
  • Add _resolve_midplane_down_reason() to classify midplane up->down transitions as
    Planned (from the STATE_DB transition flag/type) or Unplanned (from the platform's
    get_midplane_down_reason()), and write it into dpu_midplane_link_reason.

How to verify it

  • Run the chassisd unit tests:
  • cd sonic-chassisd && pytest tests/test_chassisd.py
  • Tests added/updated in tests/test_chassisd.py, with new mock support in
    tests/mock_module_base.py (midplane-down reason constants) and tests/mock_platform.py
    (get_module_state_transition / get_midplane_down_reason) to cover boot_id capture
    and the Planned/Unplanned midplane-down reason logic.

NPU chassisd able to read boot_id and update and persist
add subscribe to chassisd
add tests
add persist previous midplane down reason

Signed-off-by: Charles Tsai <chartsai@nvidia.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

Hi, there are workflow run(s) waiting for approval, you may be first-time contributor. I will notify maintainers to help approve once PR is approved. Thanks!

---Powered by SONiC BuildBot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants