Skip to content

feat(voice): Phase 2 always-on listening engine + RPC (4/8 of #3307)#3343

Merged
senamakel merged 6 commits into
tinyhumansai:mainfrom
M3gA-Mind:feat/voice-split-4-alwayson-core
Jun 4, 2026
Merged

feat(voice): Phase 2 always-on listening engine + RPC (4/8 of #3307)#3343
senamakel merged 6 commits into
tinyhumansai:mainfrom
M3gA-Mind:feat/voice-split-4-alwayson-core

Conversation

@M3gA-Mind
Copy link
Copy Markdown
Contributor

@M3gA-Mind M3gA-Mind commented Jun 4, 2026

Summary

Slice 4/8 of #3307Phase 2 always-on listening engine + config + RPC.

  • Continuous cpal mic → VAD segmenter → STT → agent, no hotkey; opt-in via voice_server.always_on_enabled.
  • "Hey Tiny" wake word (English-forced STT + fuzzy match); screen-lock privacy pause.
  • Config schema + live-apply on the settings RPC + start_if_enabled wiring + a JSON-RPC roundtrip E2E.

Files (9)

voice/{always_on,audio_capture,mod}.rs, config/schema/voice_server.rs, config/{schemas,ops,ops_tests}.rs, credentials/ops.rs, tests/json_rpc_e2e.rs.

Part of the #3307 split. PR 3307 (72 files) is being replaced by 7 small, dependency-ordered PRs (merge-train). Branches are stacked; each PR's true slice is shown once its predecessors merge and it is rebased onto main.
Stacked on slice 3 (#3342).

Summary by CodeRabbit

  • New Features

    • Enabled always-on continuous voice listening (configurable)
    • Added custom wake-word support with fuzzy matching and whitespace handling
    • Exposed voice-activity-detection (VAD) tuning parameters for sensitivity and duration control
    • Automatic pause of voice capture when screen is locked for privacy
  • Tests

    • Added end-to-end test validating voice server settings persistence and defaults

M3gA-Mind added 4 commits June 4, 2026 14:09
Run enigo keyboard/mouse on the app main thread via a native-registry
executor; enigo's macOS TSMGetInputSourceProperty traps off-thread and
crashes the CEF host. Adds mouse/keyboard tools, the main_thread bridge,
and downscaled screenshots so the model can see them.

Slice 1/7 of tinyhumansai#3307 (was the 'computer control' area).
… loop

Adds the Rust-internal automate engine (poll-until-stable settle, playback
verification), the AXEnabled diagnostics field + settle primitives on
ax_interact, the Music fast-path, and the Windows UIA superset. Exposes
launch_platform as pub(crate) so the automate loop can launch apps mid-flow.

Slice 2/7 of tinyhumansai#3307 (accessibility/automate engine).
…trator

Registers the AutomateTool (multi-step UI flows in one call) and the
ax_interact denylist/opt-in plumbing; adds the catalog toggle, tool
definition, and orchestrator prompt guidance (automate + screenshot/
mouse/keyboard fallback for Electron apps with empty AX trees).

Slice 3/7 of tinyhumansai#3307 (tool wiring + prompts).
Continuous cpal mic → VAD segmenter → STT → agent with no hotkey, opt-in
via voice_server.always_on_enabled, 'Hey Tiny' wake word (English-forced
STT + fuzzy match), and screen-lock privacy pause. Adds the config schema,
live-apply on the settings RPC, start_if_enabled wiring, and a JSON-RPC
roundtrip E2E.

Slice 4/7 of tinyhumansai#3307 (always-on core).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3593d6b0-537e-499b-9ede-1f6f98260324

📥 Commits

Reviewing files that changed from the base of the PR and between 9ba7d73 and a834a75.

📒 Files selected for processing (4)
  • src/openhuman/config/ops.rs
  • src/openhuman/config/schemas.rs
  • src/openhuman/credentials/ops.rs
  • src/openhuman/voice/always_on.rs
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/openhuman/config/ops.rs
  • src/openhuman/config/schemas.rs
  • src/openhuman/voice/always_on.rs

📝 Walkthrough

Walkthrough

Adds an always-on listening feature: VAD-based speech segmentation, capture thread + resampling, WAV encoding and STT transcription, wake-word extraction with fuzzy matching, config schema fields (always_on_enabled, wake_word) and RPC wiring, login-time start/stop, macOS lock privacy hook, and tests.

Changes

Always-On Voice Listening Feature

Layer / File(s) Summary
Voice server configuration schema and defaults
src/openhuman/config/schema/voice_server.rs
VoiceServerConfig adds always_on_enabled, VAD params (vad_onset_threshold, vad_hangover_ms, vad_min_speech_ms, vad_max_utterance_secs) and wake_word with serde defaults and unit tests; Default impl updated.
Audio capture utilities for always-on pipeline
src/openhuman/voice/audio_capture.rs
TARGET_SAMPLE_RATE made pub(crate); chunk_rms, to_mono, resample made pub(crate); new encode_wav_16k() added to emit 16kHz mono 16-bit PCM WAV bytes.
Always-on VAD, capture, and transcription engine
src/openhuman/voice/always_on.rs
Implements VadConfig, VadSegmenter, VadEvent, capture thread (CPAL) and frame pipeline, pause on screen lock, utterance buffering, WAV→base64→STT transcription, extract_command wake-word fuzzy gate, macOS lock FFI, start/stop APIs, and comprehensive unit tests.
RPC/config API endpoints for always-on settings
src/openhuman/config/ops.rs, src/openhuman/config/schemas.rs
VoiceServerSettingsPatch extended with always_on_enabled and wake_word; GET/SET handlers updated to include/persist these fields; handler triggers voice::always_on::start_if_enabled after apply.
Module wiring and login-time initialization
src/openhuman/voice/mod.rs, src/openhuman/credentials/ops.rs
pub mod always_on; added; login startup calls start_if_enabled and logout calls stop() to gate runtime capture.
Unit and E2E tests
src/openhuman/config/ops_tests.rs, tests/json_rpc_e2e.rs
Config tests updated to set new patch fields; new E2E test round-trips defaults and updates for always_on_enabled and wake_word.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant RPCHandler
  participant ConfigOps
  participant VoiceAlwaysOn
  Client->>RPCHandler: config_update_voice_server_settings
  RPCHandler->>ConfigOps: apply_voice_server_settings (includes always_on_enabled, wake_word)
  ConfigOps->>ConfigOps: persist trimmed wake_word, always_on flag
  ConfigOps-->>RPCHandler: apply result
  RPCHandler->>VoiceAlwaysOn: start_if_enabled(config)
  VoiceAlwaysOn-->>RPCHandler: started/stopped
  RPCHandler-->>Client: success
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • graycyrus
  • senamakel

Poem

🐇 I listen softly, waiting for the cue,
RMS hums and VAD wakes up anew,
Trimmed wake words whisper, fuzzy but kind,
WAVs and transcripts bring the command to mind,
Ever ready—always-on—ears for you.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: implementing Phase 2 always-on listening engine with RPC integration, which is the core focus across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@M3gA-Mind
Copy link
Copy Markdown
Contributor Author

📚 Stacked PR series (8 total) — split from #3307

Merge bottom-up; each builds on the one above it:

  1. feat(computer): main-thread synthetic-input executor + CEF crash fix (1/8 of #3307) #3340 — main-thread synthetic-input executor + CEF crash fix
  2. feat(accessibility): AX/UIA perception + automate engine (2/8 of #3307) #3341 — AX/UIA perception + automate engine
  3. feat(agent): wire automate/ax_interact computer tools (3/8 of #3307) #3342 — wire automate/ax_interact computer tools
  4. feat(voice): Phase 2 always-on listening engine + RPC (4/8 of #3307) #3343 — Phase 2 always-on listening engine + RPC
  5. feat(voice): always-on Settings toggle + debug panel + i18n (5/8 of #3307) #3344 — always-on Settings toggle + debug panel + i18n
  6. feat(notch): always-visible macOS notch status pill (6/8 of #3307) #3345 — always-visible macOS notch status pill
  7. feat(voice): Phase 3 fast command router (7/8 of #3307) #3346 — Phase 3 fast command router
  8. feat(accessibility): vision-click fallback for Electron/partial-AX apps (8/8 of #3307) #3362 — vision-click fallback for Electron/partial-AX apps (Phase 1.5 complete)

Tracker: docs/voice-system-actions.md.

Take main's versions of already-merged slice-1/2/3 files.
@M3gA-Mind M3gA-Mind marked this pull request as ready for review June 4, 2026 13:46
@M3gA-Mind M3gA-Mind requested a review from a team June 4, 2026 13:46
@coderabbitai coderabbitai Bot added feature Net-new user-facing capability or product behavior. rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. labels Jun 4, 2026
@M3gA-Mind
Copy link
Copy Markdown
Contributor Author

Independent review (beyond the CodeRabbit pass)

Reviewed the always-on listening core — the VadSegmenter, continuous cpal capture, wake-word gating, screen-lock privacy pause, and the config/RPC wiring.

Reviewed clean

  • extract_command (wake word) — case/punctuation-insensitive tokenization, anchors on the longest wake token and fuzzy-matches it (levenshtein) within the first 3 tokens so STT homophones ("tiny"→"tony/tinny", "hey"→"a/ok") still trigger. The wake.iter().max_by_key(..).unwrap() at L416 is safe — the empty-wake-word case returns early at L402. Bounded match window avoids mid-sentence false triggers. Well covered (homophones / absent / empty-passes-everything).
  • VadSegmenter — onset/hangover/min-duration/max-utterance state machine is pure and unit-tested (short-blip drop, mid-utterance pause doesn't split, max-utterance force-flush, reset aborts without event).
  • Capture — cpal runs on a dedicated OS thread (not the tokio runtime); samples flow over an unbounded mpsc to an async processor. Screen-lock watcher gates delivery (macOS CGSessionCopyCurrentDictionary), so audio isn't transcribed while locked.
  • Config/RPCalways_on_enabled + wake_word thread through the voice-server schema and apply live via start_if_enabled; covered by the JSON-RPC E2E (json_rpc_voice_server_settings_roundtrip_always_on_and_wake_word).
  • Opt-in + off by default; start_if_enabled is idempotent (process-once guard).

No correctness issues found. LGTM once CI is green.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
src/openhuman/config/schema/voice_server.rs (1)

165-171: ⚡ Quick win

Assert all newly added defaults in the legacy-deserialization test.

This test currently verifies only a subset of the new Phase-2 fields. Adding assertions for vad_onset_threshold, vad_max_utterance_secs, and wake_word will better protect backward-compat deserialization.

Proposed test hardening
 #[test]
 fn deserializes_with_all_vad_fields_defaulted() {
     // An older config file with none of the Phase 2 keys must still load.
     let c: VoiceServerConfig = serde_json::from_str("{}").unwrap();
     assert!(!c.always_on_enabled);
+    assert_eq!(c.vad_onset_threshold, default_vad_onset_threshold());
     assert_eq!(c.vad_hangover_ms, default_vad_hangover_ms());
     assert_eq!(c.vad_min_speech_ms, default_vad_min_speech_ms());
+    assert_eq!(c.vad_max_utterance_secs, default_vad_max_utterance_secs());
+    assert_eq!(c.wake_word, default_wake_word());
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/config/schema/voice_server.rs` around lines 165 - 171, The
legacy-deserialization test deserializes an empty JSON but only asserts a subset
of new Phase-2 default VAD fields; update the test function
deserializes_with_all_vad_fields_defaulted() to also assert that
c.vad_onset_threshold == default_vad_onset_threshold(), c.vad_max_utterance_secs
== default_vad_max_utterance_secs(), and c.wake_word == default_wake_word() (or
the appropriate default accessor used elsewhere) so all newly added defaults are
verified on legacy deserialization.
src/openhuman/config/ops_tests.rs (1)

1046-1048: ⚡ Quick win

Assert the new always-on fields in this roundtrip test.

The patch now sets always_on_enabled and wake_word, but the test doesn’t verify those values in the returned/saved config.

✅ Suggested assertion additions
     let outcome = load_and_apply_voice_server_settings(patch)
         .await
         .expect("ok");
+    assert_eq!(
+        outcome.value["config"]["voice_server"]["always_on_enabled"],
+        serde_json::json!(true)
+    );
+    assert_eq!(
+        outcome.value["config"]["voice_server"]["wake_word"],
+        serde_json::json!("Hey Tiny")
+    );
     assert!(
         outcome.value["config"]["voice_server"]["min_duration_secs"]
             .as_f64()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/config/ops_tests.rs` around lines 1046 - 1048, The test that
performs the config roundtrip must assert the newly-set fields
`always_on_enabled` and `wake_word`; after you load or receive the roundtripped
config (the variable holding the returned/saved config in the roundtrip test),
add assertions that `always_on_enabled` equals Some(true) and `wake_word` equals
Some("Hey Tiny".to_string()) to ensure those values are preserved by the
roundtrip logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/config/ops.rs`:
- Around line 2044-2046: The code assigns update.wake_word directly to
config.voice_server.wake_word, allowing whitespace-only values to persist;
change the assignment to normalize by trimming surrounding whitespace from
update.wake_word (use update.wake_word.trim()), and if the trimmed string is
empty treat it as the documented “no wake word” (set
config.voice_server.wake_word to the empty string or None consistent with its
type); update the logic around update.wake_word and
config.voice_server.wake_word to use the trimmed value so whitespace-only inputs
do not become a non-empty wake word.

In `@src/openhuman/config/schemas.rs`:
- Around line 1731-1736: The current code calls
config_rpc::load_and_apply_voice_server_settings(patch).await? then attempts to
refresh live state with config_rpc::load_config_with_timeout().await but
silently ignores failures to reload; change this to log a grep-friendly error
when load_config_with_timeout() fails and only call
crate::openhuman::voice::always_on::start_if_enabled(&config).await on success;
specifically, replace the if let Ok(config) =
config_rpc::load_config_with_timeout().await block with explicit match handling
that logs an error (including the returned error) using a stable prefix like
"voice:reload:error" and, on Ok(config), logs a "voice:reload:ok" message before
calling always_on::start_if_enabled to ensure failures are visible and state
transitions are traceable.

In `@src/openhuman/credentials/ops.rs`:
- Around line 44-47: The always-on voice service is started in start_if_enabled
but never stopped during logout; update the logout teardown
(stop_login_gated_services) to symmetrically stop or disable the always-on
service by invoking the appropriate shutdown API on
crate::openhuman::voice::always_on (e.g., add a stop_if_enabled/stop function or
call the existing stop method) so microphone capture is explicitly halted on
logout; locate start_if_enabled in the always_on module and add the matching
stop call in stop_login_gated_services to ensure resources are released and
VAD/STT are disabled.

In `@src/openhuman/voice/always_on.rs`:
- Around line 224-230: The code returns early when RUNNING is true, skipping
refresh of runtime settings so the process keeps using the original
VadConfig/Config; fix by ensuring settings are refreshed before the early return
or by applying them to the running processor: move the
VadConfig::from_server_config(&app_config.voice_server) and let config =
app_config.clone() to occur before the RUNNING.swap(...) check (or,
alternatively, call a runtime update method on the running processor with the
new VadConfig/Config when RUNNING was already true), so wake-word/VAD changes
take effect without restart.

---

Nitpick comments:
In `@src/openhuman/config/ops_tests.rs`:
- Around line 1046-1048: The test that performs the config roundtrip must assert
the newly-set fields `always_on_enabled` and `wake_word`; after you load or
receive the roundtripped config (the variable holding the returned/saved config
in the roundtrip test), add assertions that `always_on_enabled` equals
Some(true) and `wake_word` equals Some("Hey Tiny".to_string()) to ensure those
values are preserved by the roundtrip logic.

In `@src/openhuman/config/schema/voice_server.rs`:
- Around line 165-171: The legacy-deserialization test deserializes an empty
JSON but only asserts a subset of new Phase-2 default VAD fields; update the
test function deserializes_with_all_vad_fields_defaulted() to also assert that
c.vad_onset_threshold == default_vad_onset_threshold(), c.vad_max_utterance_secs
== default_vad_max_utterance_secs(), and c.wake_word == default_wake_word() (or
the appropriate default accessor used elsewhere) so all newly added defaults are
verified on legacy deserialization.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 315fc0e5-c3ab-4ba7-a34d-f7e429a45406

📥 Commits

Reviewing files that changed from the base of the PR and between cd31484 and 9ba7d73.

📒 Files selected for processing (9)
  • src/openhuman/config/ops.rs
  • src/openhuman/config/ops_tests.rs
  • src/openhuman/config/schema/voice_server.rs
  • src/openhuman/config/schemas.rs
  • src/openhuman/credentials/ops.rs
  • src/openhuman/voice/always_on.rs
  • src/openhuman/voice/audio_capture.rs
  • src/openhuman/voice/mod.rs
  • tests/json_rpc_e2e.rs

Comment thread src/openhuman/config/ops.rs
Comment thread src/openhuman/config/schemas.rs
Comment thread src/openhuman/credentials/ops.rs
Comment thread src/openhuman/voice/always_on.rs
Comment thread src/openhuman/voice/always_on.rs
…n logout

CodeRabbit tinyhumansai#3343:
- config/ops.rs: trim wake_word so whitespace-only collapses to the
  documented 'empty = no wake word' case.
- config/schemas.rs: match on the live-apply config reload — warn (don't
  silently skip) when it fails so the saved toggle's non-application is traceable.
- credentials/ops.rs: add symmetric always_on::stop() on logout so mic
  capture stops transcribing/delivering after logout (privacy). New
  always_on::stop() flips the ENABLED gate; covered by a unit test.
@senamakel senamakel merged commit f5dc9ea into tinyhumansai:main Jun 4, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. feature Net-new user-facing capability or product behavior. rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants