Skip to content

Conversation

@jserv
Copy link
Collaborator

@jserv jserv commented Nov 2, 2025

Problem

#110 introduced a regression where SMP=4 configurations hang during boot after SMP initialization completes.

Symptoms

Root Cause

The adaptive timer/UART fd registration logic excluded timer fd from poll() monitoring when all harts entered WFI state. During early boot, the kernel temporarily puts all harts in WFI while waiting for timer interrupts. This created a deadlock:

  1. All 4 harts enter WFI waiting for timer interrupts
  2. Logic sees idle_harts == started_harts, sets harts_active = false
  3. Timer fd not added to poll() fds
  4. poll() blocks indefinitely (or sleeps 10ms repeatedly)
  5. Timer interrupts never fire → harts never wake → deadlock

Fix

Introduce boot completion heuristic using peripheral_update_ctr:

const uint64_t BOOT_SETTLE_ITERATIONS = 5000;
bool boot_complete = all_harts_started && 
                     (emu->peripheral_update_ctr > BOOT_SETTLE_ITERATIONS);
bool harts_active = (vm->n_hart == 1) || !boot_complete || (idle_harts == 0);

During the first 5000 scheduler iterations after SMP initialization, always keep timer and UART fds active. This ensures harts can receive timer interrupts even when temporarily in WFI during early boot.


Summary by cubic

Fixes the SMP=4 boot hang introduced in PR #110 by keeping timer and UART polling active during early boot to avoid a WFI deadlock. Boots reliably on SMP=1 and SMP=4, with no impact to post-boot idle optimizations.

  • Bug Fixes
    • Root cause: timer fd was dropped from poll() when all harts entered WFI during early boot, blocking timer interrupts.
    • Change: add a boot completion heuristic using peripheral_update_ctr with a 5000-iteration settle period; treat boot as incomplete and keep timer/UART fds active until then or while any hart is active.
    • Result: SMP=4 now completes boot; SMP=1 remains stable.
    • Performance: preserves PR Implement event-driven UART coroutine with CPU optimization #110’s idle CPU optimization after boot.

Written for commit 523748e. Summary will update automatically on new commits.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Prompt for AI agents (all 1 issues)

Understand the root cause of the following 1 issues and fix them.


<file name="main.c">

<violation number="1" location="main.c:1217">
`peripheral_update_ctr` never exceeds 64 because it is a countdown reset to 64 each tick, so this `&gt; BOOT_SETTLE_ITERATIONS` check never succeeds and `boot_complete` stays false. That keeps the timer and UART fds permanently active, undoing the post-boot idle optimization this block is supposed to preserve.</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

const uint64_t BOOT_SETTLE_ITERATIONS = 5000;
bool boot_complete =
all_harts_started &&
(emu->peripheral_update_ctr > BOOT_SETTLE_ITERATIONS);
Copy link

@cubic-dev-ai cubic-dev-ai bot Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peripheral_update_ctr never exceeds 64 because it is a countdown reset to 64 each tick, so this > BOOT_SETTLE_ITERATIONS check never succeeds and boot_complete stays false. That keeps the timer and UART fds permanently active, undoing the post-boot idle optimization this block is supposed to preserve.

Prompt for AI agents
Address the following comment on main.c at line 1217:

<comment>`peripheral_update_ctr` never exceeds 64 because it is a countdown reset to 64 each tick, so this `&gt; BOOT_SETTLE_ITERATIONS` check never succeeds and `boot_complete` stays false. That keeps the timer and UART fds permanently active, undoing the post-boot idle optimization this block is supposed to preserve.</comment>

<file context>
@@ -1203,10 +1203,20 @@ static int semu_run(emu_state_t *emu)
+            const uint64_t BOOT_SETTLE_ITERATIONS = 5000;
+            bool boot_complete =
+                all_harts_started &amp;&amp;
+                (emu-&gt;peripheral_update_ctr &gt; BOOT_SETTLE_ITERATIONS);
+            bool harts_active = (vm-&gt;n_hart == 1) || !boot_complete ||
+                                (idle_harts == 0);
</file context>
Fix with Cubic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peripheral_update_ctr is a countdown timer (64→0→64) and will never exceed 5000, meaning boot_complete stays false permanently. This breaks the intended idle optimization.

@jserv jserv force-pushed the fix-smp4-boot-hang branch 3 times, most recently from d1eaca7 to 2a8451a Compare November 2, 2025 14:59
jserv

This comment was marked as duplicate.

@jserv jserv changed the title Fix SMP=4 boot hang regression from PR #110 Fix SMP=4 boot hang regression Nov 2, 2025
After the introduction of #110, the adaptive timer/UART fd registration
logic would exclude timer fd monitoring when all harts entered WFI state
during early boot. This created a deadlock: harts waited for timer
interrupts, but the timer fd was not being polled, preventing wakeup.

Symptom:
- SMP=4 hung after "smp: Brought up 1 node, 4 CPUs" (49 lines of output)
- Never reached "clocksource: Switched to clocksource" or login prompt
- SMP=1 continued to work correctly

This commit introduces boot completion heuristic using
peripheral_update_ctr. Consider boot "incomplete" for the first 5000
scheduler iterations after all harts start. During this period, always
keep timer and UART fds active to ensure harts can receive timer
interrupts even when temporarily in WFI.
@jserv jserv force-pushed the fix-smp4-boot-hang branch from 2a8451a to 523748e Compare November 2, 2025 15:20
@jserv jserv merged commit c2809ea into master Nov 2, 2025
10 checks passed
@jserv jserv deleted the fix-smp4-boot-hang branch November 2, 2025 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants