Skip to content

v0.7.0

Choose a tag to compare

@sboily sboily released this 15 May 15:48
· 106 commits to main since this release

First stable release after the 0.7.0a10.7.0a18 alpha series. Major architectural changes since 0.6.13 (2026-03-05).

pip install roomkit==0.7.0

Highlights

  • Real-time speech-to-speech AI is the headline feature. New RealtimeVoiceChannel wraps OpenAI Realtime, Gemini Live, xAI, ElevenLabs, Anam, and PersonaPlex behind one Channel ABC. Ten-mixin architecture (_realtime_audio, _realtime_tools, _realtime_speech, _realtime_skills, _realtime_transcription, _realtime_response, _realtime_tool_search, _realtime_tool_recovery, _realtime_context, _skill_handlers) keeps each concern focused.
  • Tool Search for tool-heavy realtime sessions. find_tools(query) + list_tools keep the active tool surface under the ~20 limit where Gemini Live function-calling stays reliable, while exposing thousands of tools dynamically via provider.reconfigure.
  • Skill delivery modeson_demand vs inline_full. Handles providers that can't reconfigure mid-session (Gemini 3.x) by baking skill bodies into system_instruction at session start.
  • Carrier-grade SIP: NAT traversal via advertised_ip, BYE routing fixed for inbound calls behind SBCs, RFC 3326 Reason header parse + emit, runtime auth resolver (set_auth_resolver), runtime invite filter (set_invite_filter), PSTN-compatibility knobs for outbound dial.
  • Orchestration: Supervisor strategy with sequential / parallel / auto_delegate execution + async_delivery for non-blocking pipelines. HandoffHandler state machine. Loop producer/reviewer pattern. All wired to kit.status_bus for observable multi-agent flows.
  • Video / vision / avatar: vision providers (OpenAI, Gemini), avatar providers (MuseTalk lip-sync, WebSocket, Anam cloud), video filters (watermark, YOLO, censor, MediaPipe face-touch detection), screen capture + control tools (DescribeScreenTool, ScreenInputTools), webcam capture (DescribeWebcamTool), PyAV recorder with A/V sync, video bridge.
  • Storage: PostgresStore v2 relational schema with proper indexes (replacing JSONB blobs). PostgresKnowledgeSource for full-text retrieval. SummarizingMemory + RetrievalMemory providers.
  • Delivery backends: pluggable InMemoryDeliveryBackend and RedisDeliveryBackend (Streams + consumer groups) so deliveries survive process restarts and scale across workers.
  • Twilio Media Streams voice backend with stateful soxr resampling and pure-Python G.711 mu-law codec — no audioop dependency.
  • Quality: ON_AI_RESPONSE + ON_FEEDBACK hooks, ConversationScorer ABC, ScoringHook, QualityTracker reports.

Migration from 0.6.x

Removed APIs (BREAKING)

  • kit.connect_voice / kit.disconnect_voice / kit.connect_video / kit.disconnect_video / kit.bind_voice_session / kit.connect_realtime_voice / kit.disconnect_realtime_voice → use kit.join(...) and kit.leave(session).
  • RoomKit(stt=..., tts=..., voice=...) constructor parameters → pass providers to VoiceChannel(stt=..., tts=..., backend=...) directly. kit.stt / kit.tts / kit.voice properties now look up from registered channels.
  • Top-level from roomkit import … exports slimmed from 399 to 66. Providers, voice/video types, mocks, recording, orchestration, and telemetry must be imported from their subpackages (e.g. from roomkit.providers.anthropic.ai import AnthropicAIProvider).
  • HookTrigger.ON_REALTIME_TOOL_CALL → renamed to HookTrigger.ON_TOOL_CALL. Event payload is now a channel-agnostic ToolCallEvent. Return results via HookResult(action="allow", metadata={"result": ...}).
  • Tool handler signature: 3-arg (session, name, arguments) → 2-arg (name, arguments). Use get_current_voice_session() contextvar for session access in voice tool handlers.
  • audit_realtime_tool_handler → use audit_tool_handler (now channel-agnostic).
  • parse_voicemeup_webhook() / configure_voicemeup_mms() module-level functions → per-instance provider.parse_inbound(payload, channel_id) / provider.configure_mms(...). Enables multi-tenant isolation.
  • GeminiLiveProvider.prime_realtime_input()provider.start_audio_stream(session) (also exposed on RealtimeVoiceChannel.inject_text(..., start_audio_stream=True)).

Behavior changes

  • Recording is opt-out, not opt-in. Rooms with recorders now capture every attached channel by default. Disable per-channel with ChannelRecordingConfig(audio=False, video=False). Recording now captures both inbound (mic) and outbound (TTS) audio mixed into a single track.
  • Tool protocol is the standard tool registration path. Pass any object with .definition: dict and .handler(name, args) -> str via tools=[my_tool]. The legacy tool_handler= parameter still exists for MCP / audit middleware.
  • PostgresStore is now relational (schema v2). v1 JSONB-blob databases are auto-migrated on first connect; drops old data columns and rebuilds the relational schema.
  • OpenAIRealtimeProvider honours input_sample_rate / output_sample_rate. PCM is only accepted at 24 kHz by the GA API; invalid rates now raise ValueError at construction.
  • audioop dependency removed. Replaced with pure-Python G.711 codec + linear interpolation resampler — runs on Python 3.13+ without audioop-lts.

Security

Five vulnerabilities closed in 3cd5124 immediately before the release.

  • HTTP webhook SSRF guard hardened (HTTPProviderConfig.webhook_url). Previous validator missed 127.1, 2130706433, 0x7f000001, localhost., and any hostname whose A record points to RFC 1918 / loopback / link-local. New roomkit.providers.url_safety.validate_public_url normalizes IPv4 numeric forms, strips trailing-dot DNS, and resolves every A/AAAA record at validation time.
  • DeepgramSTTProvider no longer fetches AudioContent.url server-side. Switched to Deepgram's native transcribe_url so the fetch happens from Deepgram's network, not ours. Closes an SSRF vector reachable from any inbound webhook channel.
  • PersonaPlexConfig.ssl_verify default flipped from False to True. Local self-signed dev must now pass ssl_verify=False explicitly.
  • Telnyx webhook signatures now check timestamp freshness. Reject signatures > 300s away from the current clock; window configurable via tolerance_seconds. Closes an indefinite replay window.
  • DescribeWebcamTool no longer exposes save_path to the AI. Operator-controlled save_dir at construction; handler auto-generates filenames. Closes a prompt-injection → arbitrary-file-write primitive. save_path in tool arguments is now silently ignored.

Full per-PR detail

See CHANGELOG.md for the granular 0.7.0a1 through 0.7.0a18 entries.

Full compare: v0.6.13...v0.7.0