Skip to content

Add a patch for printing the AMD Zen CPU reset reason#514

Merged
saiarcot895 merged 2 commits intosonic-net:masterfrom
nexthop-ai:lotus.amd-log
Mar 3, 2026
Merged

Add a patch for printing the AMD Zen CPU reset reason#514
saiarcot895 merged 2 commits intosonic-net:masterfrom
nexthop-ai:lotus.amd-log

Conversation

@lotus-nexthop
Copy link
Copy Markdown
Contributor

@lotus-nexthop lotus-nexthop commented Oct 29, 2025

Upstream commits:

The patch had to be adapted to v6.1 we're using, that was basically adding the entire contents (5 constants) of fch.h as the file didn't exist in v6.1, and updating the patch for amd.c for context.

Testing

If we intentionally trigger a CPU soft reset (with sudo reboot -f) I see this:

admin@gold208-dut:~$ sudo dmesg | grep -i reason
[    0.635233] x86/amd: Previous system reset reason [0x00080800]: software wrote 0x6 to reset control register 0xCF9

If we intentionally trigger the CPU FCH Watchdog, I see this:

admin@gold208-dut:~$ sudo dmesg | grep reason
[    0.632563] x86/amd: Previous system reset reason [0x02000800]: hardware watchdog timer expired

To enable watchdog we create a
/etc/systemd/system.conf.d/override.conf‎
with the contents:

[Manager]
RuntimeWatchdogSec=default
WatchdogDevice=/dev/watchdog1

To trigger the watchdog:
sudo tee /dev/watchdog1 and enter just one character and let the device be for a minute or so.

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@lotus-nexthop lotus-nexthop marked this pull request as ready for review October 29, 2025 02:08
@lotus-nexthop lotus-nexthop requested a review from a team as a code owner October 29, 2025 02:08
@paulmenzel
Copy link
Copy Markdown
Contributor

How can these events be triggered?

From: Yazen Ghannam <yazen.ghannam@amd.com>
Date: Tue, 22 Apr 2025 18:48:30 -0500
Subject: [PATCH 1/2] x86/CPU/AMD: Print the reason for the last reset

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the upstream commit hash as done for stable series commits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paulmenzel , please take a look at 46ca756 to see if that is what you had in mind.

@nate-nexthop
Copy link
Copy Markdown
Contributor

How can these events be triggered?

As I understand it, writing 0x6 to 0xcf9 is a standard way of reboting an x86 CPU. I can trigger this with sudo reboot -f on SONiC with this CPU.

Triggering the FCH watchdog on SONiC with an AMD Zen3 CPU, I can do by enabling the watchdog and never petting it.
As a hack, I can do this:
sudo tee /dev/watchdog1 and enter just one character and let the device be for a minute or so.
After the reboot, I see this:
[ 0.613853] x86/amd: Previous system reset reason [0x02000800]: hardware watchdog timer expired

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

If I intentionally trigger a CPU soft reset I see this:
```
admin@gold208-dut:~$ sudo dmesg | grep -i reason
[    0.635233] x86/amd: Previous system reset reason [0x00080800]: software wrote 0x6 to reset control register 0xCF9
```

If I intentionally trigger the CPU FCH Watchdog, I see this:
```
admin@gold208-dut:~$ sudo dmesg | grep reason
[    0.632563] x86/amd: Previous system reset reason [0x02000800]: hardware watchdog timer expired
```

Upstream from here:

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=ab8131028710d009ab93d6bffd2a2749ade909b0

The patch had to be adapted to v6.1 we're using, that was basically
adding the entire contents (5 constants) of `fch.h` as the file didn't
exist in v6.1, and updating the patch for `amd.c` for context.

Signed-off-by: Nate White <nate@nexthop.ai>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@nate-nexthop
Copy link
Copy Markdown
Contributor

I've adapted the backported patches to the current (6.12) kernel sources, and can confirm the functionality still works:

admin@sonic:~$ sudo dmesg | grep -i reason
[    0.941066] x86/amd: Previous system reset reason [0x00080800]: software wrote 0x6 to reset control register 0xCF9
admin@sonic:~$ uname -a
Linux sonic 6.12.41+deb13-sonic-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.41-1 (2025-08-12) x86_64 GNU/Linux
admin@sonic:~$

@nate-nexthop
Copy link
Copy Markdown
Contributor

@saiarcot895 could you take a look please?

@saiarcot895
Copy link
Copy Markdown
Contributor

@lotus-nexthop Could you merge in the master branch into your branch?

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@nate-nexthop
Copy link
Copy Markdown
Contributor

@saiarcot895 Updated the branch with the latest master, looks like pipelines have started.

@saiarcot895 saiarcot895 merged commit 2afef96 into sonic-net:master Mar 3, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants