Skip to content

feat: Phase 0 architecture optimize and LLM heartbeat#38

Merged
zevorn merged 9 commits intomainfrom
feat/phase0-arch-optimize
Mar 16, 2026
Merged

feat: Phase 0 architecture optimize and LLM heartbeat#38
zevorn merged 9 commits intomainfrom
feat/phase0-arch-optimize

Conversation

@zevorn
Copy link
Copy Markdown
Owner

@zevorn zevorn commented Mar 16, 2026

Summary

  • Tool registry: add required_caps (SWARM_CAP_* bitmap) and flags (CLAW_TOOL_LOCAL_ONLY) fields for swarm routing decisions
  • Swarm: 20-byte heartbeat with node role/load, load-aware node selection, exponential-backoff RPC retry, required_caps matching
  • Gateway: service registry with type_mask bitmap dispatch, AI_REQ message type
  • ESP32-C3 shell: proper UTF-8 multi-byte character handling (backspace/cursor)
  • Scheduler: round-robin pending queue preventing task starvation; fix worker infinite loop caused by goto drain
  • Heartbeat: LLM connectivity probe (ai_ping() — max_tokens=1, ~200B) with online/offline state change notifications
  • Docs: architecture.md, tuning.md, CLAUDE.md updated for all changes
  • Cleanup: remove unused http_get_test(), guard ai_boot_test_thread() with matching #ifdef

Test plan

  • meson compile passes on vexpress-a9 (zero warnings)
  • make build-esp32c3-qemu passes (zero warnings)
  • scripts/check-patch.sh --staged passes on all commits
  • QEMU boot test: make run-esp32c3-qemu
  • Verify heartbeat ping logs state transitions on API up/down

🦞 Generated with Claude Code

zevorn added 9 commits March 16, 2026 10:10
Extend claw_tool_t with required_caps (SWARM_CAP_* bitmap) and
flags (CLAW_TOOL_LOCAL_ONLY) so the swarm RPC layer can match
tools to capable nodes and refuse to delegate local-only tools.

Update claw_tool_register() signature and all 29 call sites:
- GPIO tools: SWARM_CAP_GPIO
- LCD tools: SWARM_CAP_LCD
- Audio tools: SWARM_CAP_SPEAKER
- Net tools: SWARM_CAP_INTERNET
- System/sched/skill tools: CLAW_TOOL_LOCAL_ONLY

Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>
Extend heartbeat from 16 to 20 bytes with role and active_tasks
fields. Add enum swarm_role (WORKER/THINKER/COORDINATOR/OBSERVER)
with automatic self-detection based on capabilities.

Replace first-match node selection with load-aware strategy that
picks the online node with lowest load among those matching the
required capability bitmap.

Add exponential-backoff RPC retry (3 attempts, 500ms/1s/2s) and
refuse to delegate tools marked CLAW_TOOL_LOCAL_ONLY.

Replace tool_name_to_cap() prefix matching with tool registry
lookup via claw_tool_find()->required_caps, falling back to
prefix heuristic for unregistered tools.

Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>
Replace the routing skeleton with a working service registry and
type-based message dispatch. Services register with a type_mask
bitmap and their own message queue; gateway delivers incoming
messages to all matching consumers.

Add GW_MSG_AI_REQ message type for future AI request queuing.

Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>
The serial shell processed backspace byte-by-byte, requiring
multiple presses to delete a single CJK character (3-byte UTF-8).

Fix all line-editing operations to be UTF-8 aware:
- Backspace: walk back over continuation bytes, delete entire
  sequence, erase correct column count (2 for CJK/emoji)
- Delete key: detect UTF-8 lead byte to determine sequence length
- Left/right arrows: skip complete UTF-8 sequences
- Character input: read all continuation bytes atomically before
  inserting, preventing partial-character echo

Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>
The single AI worker thread dropped task callbacks when busy,
starving lower-frequency tasks. With 3 tasks (10s, 15s, 30s),
the 10s GPIO task monopolized the worker while the other two
never executed.

Add a pending flag to sched_ai_ctx_t. When the worker is busy,
mark the task as pending instead of discarding it. After each
task completes, the worker scans all contexts in round-robin
order and immediately executes the next pending task before
sleeping on the semaphore.

This ensures all scheduled tasks eventually execute regardless
of their interval, with zero additional memory or threads.

Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>
The goto next_task loop never returned to sem_take because
timer callbacks continuously set pending=1 during AI calls.
With a 10s GPIO task taking ~5s per AI call, the worker
drained one pending just as the next arrived, spinning
forever and flooding the console.

Remove the goto loop. The worker now processes exactly one
task per sem_take wakeup, then sets worker_busy=0 and sleeps.
The callback does sem_give when marking pending, so the worker
wakes up promptly for the next queued task without spinning.

Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>
Reflect the recent architecture optimizations across all docs:

architecture.md (en/zh):
- Gateway: service registry with type_mask dispatch, AI_REQ type
- Scheduler: round-robin pending queue, task starvation prevention
- AI Engine: tool capability declarations (SWARM_CAP_*, LOCAL_ONLY)
- Swarm: 20B heartbeat with role/load, load-aware node selection,
  exponential-backoff RPC retry, required_caps matching
- Resource budget: updated to measured 43% usage (100KB free heap)

tuning.md (en/zh):
- ESP32-C3 memory section: measured runtime data, NET_RESP_MAX
  reduction (16KB->4KB), heap-allocated sched buffers
- Add SWARM_RPC_MAX_RETRIES and SWARM_RPC_RETRY_BASE_MS params

CLAUDE.md:
- Key Paths: gateway and tools descriptions updated

Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>
Implement the "cheap checks first" pattern used by OpenClaw and
other Claw projects. When no events are pending, the heartbeat
tick performs a lightweight LLM ping (max_tokens=1, ~200B request)
instead of skipping entirely.

New ai_ping() in ai_engine sends a minimal API request without
acquiring s_api_lock, so it never blocks interactive ai_chat()
calls. Any HTTP response (including 4xx) counts as "online";
only network failures count as "offline".

State transitions (online<->offline) are logged and delivered
to IM/console. heartbeat_llm_online() exposes the current state
for other modules to query.

Ping thread uses a 4KB stack (vs 8KB for full heartbeat AI
thread), keeping memory overhead minimal.

Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>
Move ai_boot_test_thread() inside the same #ifdef guards that
protect its only call site (CONFIG_RTCLAW_AI_BOOT_TEST && no IM).

Remove http_get_test() from net_service.c — dead code with zero
callers, leftover from early bring-up debugging.

Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>
@zevorn zevorn merged commit e04a549 into main Mar 16, 2026
9 checks passed
@zevorn zevorn deleted the feat/phase0-arch-optimize branch March 16, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant