feat: Phase 0 architecture optimize and LLM heartbeat by zevorn · Pull Request #38 · zevorn/rt-claw

zevorn · 2026-03-16T02:11:18Z

Summary

Tool registry: add required_caps (SWARM_CAP_* bitmap) and flags (CLAW_TOOL_LOCAL_ONLY) fields for swarm routing decisions
Swarm: 20-byte heartbeat with node role/load, load-aware node selection, exponential-backoff RPC retry, required_caps matching
Gateway: service registry with type_mask bitmap dispatch, AI_REQ message type
ESP32-C3 shell: proper UTF-8 multi-byte character handling (backspace/cursor)
Scheduler: round-robin pending queue preventing task starvation; fix worker infinite loop caused by goto drain
Heartbeat: LLM connectivity probe (ai_ping() — max_tokens=1, ~200B) with online/offline state change notifications
Docs: architecture.md, tuning.md, CLAUDE.md updated for all changes
Cleanup: remove unused http_get_test(), guard ai_boot_test_thread() with matching #ifdef

Test plan

meson compile passes on vexpress-a9 (zero warnings)
make build-esp32c3-qemu passes (zero warnings)
scripts/check-patch.sh --staged passes on all commits
QEMU boot test: make run-esp32c3-qemu
Verify heartbeat ping logs state transitions on API up/down

🦞 Generated with Claude Code

Extend claw_tool_t with required_caps (SWARM_CAP_* bitmap) and flags (CLAW_TOOL_LOCAL_ONLY) so the swarm RPC layer can match tools to capable nodes and refuse to delegate local-only tools. Update claw_tool_register() signature and all 29 call sites: - GPIO tools: SWARM_CAP_GPIO - LCD tools: SWARM_CAP_LCD - Audio tools: SWARM_CAP_SPEAKER - Net tools: SWARM_CAP_INTERNET - System/sched/skill tools: CLAW_TOOL_LOCAL_ONLY Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>

Extend heartbeat from 16 to 20 bytes with role and active_tasks fields. Add enum swarm_role (WORKER/THINKER/COORDINATOR/OBSERVER) with automatic self-detection based on capabilities. Replace first-match node selection with load-aware strategy that picks the online node with lowest load among those matching the required capability bitmap. Add exponential-backoff RPC retry (3 attempts, 500ms/1s/2s) and refuse to delegate tools marked CLAW_TOOL_LOCAL_ONLY. Replace tool_name_to_cap() prefix matching with tool registry lookup via claw_tool_find()->required_caps, falling back to prefix heuristic for unregistered tools. Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>

Replace the routing skeleton with a working service registry and type-based message dispatch. Services register with a type_mask bitmap and their own message queue; gateway delivers incoming messages to all matching consumers. Add GW_MSG_AI_REQ message type for future AI request queuing. Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>

The serial shell processed backspace byte-by-byte, requiring multiple presses to delete a single CJK character (3-byte UTF-8). Fix all line-editing operations to be UTF-8 aware: - Backspace: walk back over continuation bytes, delete entire sequence, erase correct column count (2 for CJK/emoji) - Delete key: detect UTF-8 lead byte to determine sequence length - Left/right arrows: skip complete UTF-8 sequences - Character input: read all continuation bytes atomically before inserting, preventing partial-character echo Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>

The single AI worker thread dropped task callbacks when busy, starving lower-frequency tasks. With 3 tasks (10s, 15s, 30s), the 10s GPIO task monopolized the worker while the other two never executed. Add a pending flag to sched_ai_ctx_t. When the worker is busy, mark the task as pending instead of discarding it. After each task completes, the worker scans all contexts in round-robin order and immediately executes the next pending task before sleeping on the semaphore. This ensures all scheduled tasks eventually execute regardless of their interval, with zero additional memory or threads. Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>

The goto next_task loop never returned to sem_take because timer callbacks continuously set pending=1 during AI calls. With a 10s GPIO task taking ~5s per AI call, the worker drained one pending just as the next arrived, spinning forever and flooding the console. Remove the goto loop. The worker now processes exactly one task per sem_take wakeup, then sets worker_busy=0 and sleeps. The callback does sem_give when marking pending, so the worker wakes up promptly for the next queued task without spinning. Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>

Reflect the recent architecture optimizations across all docs: architecture.md (en/zh): - Gateway: service registry with type_mask dispatch, AI_REQ type - Scheduler: round-robin pending queue, task starvation prevention - AI Engine: tool capability declarations (SWARM_CAP_*, LOCAL_ONLY) - Swarm: 20B heartbeat with role/load, load-aware node selection, exponential-backoff RPC retry, required_caps matching - Resource budget: updated to measured 43% usage (100KB free heap) tuning.md (en/zh): - ESP32-C3 memory section: measured runtime data, NET_RESP_MAX reduction (16KB->4KB), heap-allocated sched buffers - Add SWARM_RPC_MAX_RETRIES and SWARM_RPC_RETRY_BASE_MS params CLAUDE.md: - Key Paths: gateway and tools descriptions updated Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>

Implement the "cheap checks first" pattern used by OpenClaw and other Claw projects. When no events are pending, the heartbeat tick performs a lightweight LLM ping (max_tokens=1, ~200B request) instead of skipping entirely. New ai_ping() in ai_engine sends a minimal API request without acquiring s_api_lock, so it never blocks interactive ai_chat() calls. Any HTTP response (including 4xx) counts as "online"; only network failures count as "offline". State transitions (online<->offline) are logged and delivered to IM/console. heartbeat_llm_online() exposes the current state for other modules to query. Ping thread uses a 4KB stack (vs 8KB for full heartbeat AI thread), keeping memory overhead minimal. Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>

Move ai_boot_test_thread() inside the same #ifdef guards that protect its only call site (CONFIG_RTCLAW_AI_BOOT_TEST && no IM). Remove http_get_test() from net_service.c — dead code with zero callers, leftover from early bring-up debugging. Signed-off-by: Chao Liu <chao.liu.zevorn@gmail.com>

zevorn added 9 commits March 16, 2026 10:10

zevorn merged commit e04a549 into main Mar 16, 2026
9 checks passed

zevorn deleted the feat/phase0-arch-optimize branch March 16, 2026 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Phase 0 architecture optimize and LLM heartbeat#38

feat: Phase 0 architecture optimize and LLM heartbeat#38
zevorn merged 9 commits intomainfrom
feat/phase0-arch-optimize

zevorn commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zevorn commented Mar 16, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant