Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Stall: HARD LOCKUP Issue Observed in 202305 Release Branch [Kernel: 5.10.140-1] [Accton-AS7716-32X] #17363

Open
mithun2498 opened this issue Nov 30, 2023 · 2 comments
Labels
Triaged this issue has been triaged

Comments

@mithun2498
Copy link

Description

We took the 202305 release branch image from community builds and loaded it in Accton-AS7716-32X. Then executed T1 test cases. During this process of execution we could observe a CPU stall issue with the following details:

Issue Description:

Hardlockup, caused by "swapper" linux kernel daemon that moves processes between main memory and secondary storage. In current context, the system was IDLE and this triggered swapper daemon to free the swap space and on that instance got stuck into lockup state.

        Supporting Log Snippets:
              [ 5300.443503] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
              [ 5300.443571] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G OE 5.10.0-18-2-amd64 #1 Debian 5.10.140-1
        
        Console Call Trace:
              [ 5300.443584]  qi_flush_iotlb+0x83/0xb0
              ...
              [ 5300.443586]  ? fq_ring_free+0x100/0x100
              ...
              [ 5300.443590]  ? clockevents_program_event+0x8d/0xf0
              [ 5300.443591]  run_timer_softirq+0x26/0x50
              ...
              [ 5300.443595]  asm_sysvec_apic_timer_interrupt+0x12/0x20
              [ 5300.443595] RIP: 0010:cpuidle_enter_state+0xc7/0x350

Image Version:

SONiC-OS-202305.366435-a49860cc7
SONiC NOS Debian kernel (5.10.140-1)

Current Behavior: Board was hung with above CPU stall error message.

Expected Behavior: The board should not be hung and the stall issue should not be seen.

Please suggest if this is a known issue or any solution to avoid this CPU hard lockup stall error.

@prgeor
Copy link
Contributor

prgeor commented Dec 6, 2023

@mithun2498 could you check with Accton if this is platform specific issue

@prgeor prgeor added the Triaged this issue has been triaged label Dec 6, 2023
@mithun2498
Copy link
Author

mithun2498 commented Dec 11, 2023

Hi @prgeor ,

I have reported the same issue with Accton team and they responded with following analysis -

According to our understanding, usually "NMI watchdog: Watchdog detected" appears due to NOS operating the hardware, but has not waited for a response.
When NOS wants to handle an event interrupt, it will first disable the IRQ, then handle the current interrupt soon, and then enable the IRQ again.
If the action of handling the interrupt is abnormal and the IRQ is not enabled, when this period is the time set by the watchdog, this type of message will print.

They are suspecting that the issue is with the NOS -> os/kernel
Kindly help us on this.
DUT_Console_Logs_7716_27_11_23.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

2 participants