Talk to an AI out loud and hear it talk back β running locally on your own GPU. Total turn latency cut from ~13.6s to ~8s (~40% faster), with audio that starts playing in ~0.5s.
Zencia Vocalis is a full two-way voice assistant pipeline built on top of Microsoft's VibeVoice real-time text-to-speech model.
VibeVoice on its own only does text β speech. Zencia Vocalis turns it into a complete spoken conversation loop and then optimizes that loop for low latency:
π€ You speak β π it transcribes you β π§ an LLM thinks β π it speaks the answer back
No cloud TTS, no audio leaving your machine for synthesis β the voice runs locally on your GPU. Only the LLM step (optional) calls an external API, and you can swap in a fully local model.
πΉ Demo video/GIF coming soon. To add one: drop a
demo.gifin the repo root and replace this block with. A short screen recording of a live conversation dramatically helps people understand (and star) the project.
- π£οΈ Full voice loop β speech-to-text, reasoning, and text-to-speech wired into one program.
- β‘ ~40% lower latency β measured turn time down from 13.65s β ~8s (see benchmarks).
- π Streaming speech β first audio chunk plays in ~0.5s, while the rest is still generating.
- ποΈ Smart recording β Voice Activity Detection stops listening the instant you stop talking.
- π Pluggable LLM β Google Gemini (default), OpenAI, or a fully local Hugging Face model.
- π Multilingual voices β English plus experimental DE / FR / IT / JP / KR / NL / PL / PT / ES speakers.
- π οΈ Production API β an ElevenLabs-style REST endpoint for integrating TTS into other apps.
π€ You speak
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Voice Activity Detection (WebRTC VAD) β stops recording when you stop talking
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Speech-to-Text Β· OpenAI Whisper β transcribes your words
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β LLM Β· Gemini / OpenAI / Local β generates a reply
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Text-to-Speech Β· VibeVoice (STREAMING) β speaks the reply as it generates
β chunk 1 β play (0.5s) β β
β chunk 2 β play ββ played in parallel β
β chunk 3 β play β¦ β while more generateβ
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
π You hear the answer
Measured on a single conversational turn (full write-up in demo/OPTIMIZATION_SUMMARY.md):
| Stage | Before | After | How it was sped up |
|---|---|---|---|
| Recording | 5.73s | ~3.0s | VAD stops recording the moment you stop speaking |
| Speech-to-Text | 2.09s | ~1.5s | Whisper tiny, greedy decoding, deterministic |
| LLM | 1.52s | ~1.3s | Gemini (already fast) |
| TTS + Playback | 4.31s | streams | Streaming playback β first audio in ~0.5s |
| DDPM steps | 3 | 2 | Fewer diffusion steps, minimal quality loss |
| Total turn | 13.65s | ~8s | ~40% faster |
Requires Python 3.9+, a CUDA-capable GPU, and the base VibeVoice setup (see VIBEVOICE_UPSTREAM.md for the underlying model).
# 1. Clone
git clone https://github.com/tartendu/vocalis.git
cd vocalis
# 2. Install the base VibeVoice package (editable)
pip install -e .
# 3. Install the conversation-pipeline extras
pip install -r demo/requirements_conversation.txt
pip install webrtcvad # optional but recommended for fastest recordingπͺ Windows: if installation fails on long paths, apply enable_long_paths.reg.
Copy the example env file and add your own key (the file is gitignored and never committed):
cp demo/.env.example demo/.env
# then edit demo/.env and set GEMINI_API_KEYOr set it in your shell:
$env:GEMINI_API_KEY = "your-api-key-here" # free key: https://aistudio.google.com/apikeypython demo/realtime_conversation_optimized.py `
--model_path microsoft/VibeVoice-Realtime-0.5B `
--speaker_name Carter `
--device cuda `
--whisper_model tiny.\demo\start_conversation.ps1Say exit or quit to end the conversation; Ctrl+C to force quit.
| Script | Purpose |
|---|---|
| demo/realtime_conversation.py | Baseline conversation pipeline |
| demo/realtime_conversation_optimized.py | Optimized: VAD + streaming playback + tuned settings |
| demo/realtime_conversation_with_timing.py | Same loop with per-stage timing printout (for profiling) |
| demo/api_server.py | REST API server (ElevenLabs-style) |
| demo/api_client_example.py | Example client for the REST API |
| demo/test_audio_simple.py | Verify your speakers work |
| demo/test_gemini_setup.py | Verify your Gemini key works |
# start the server
python demo/api_server.py --model_path microsoft/VibeVoice-Realtime-0.5B --device cuda --port 8000# POST /v1/text-to-speech β returns a WAV
curl -X POST http://localhost:8000/v1/text-to-speech \
-H "Content-Type: application/json" \
-d '{"text": "Hello from Zencia Vocalis", "voice": "en-Carter_man", "steps": 10}' \
--output speech.wav
# GET /v1/voices β list available voices
curl http://localhost:8000/v1/voices| Option | Choices | Notes |
|---|---|---|
| Whisper model | tiny Β· base Β· small Β· medium Β· large |
tiny is fastest (recommended for real-time) |
| DDPM steps | 1 Β· 2 Β· 3 |
2 = good balance; 1 = fastest |
| LLM provider | gemini Β· openai Β· local |
gemini is fast and the default |
| Voice | Carter, Emma, Davis, Grace, β¦ |
any preset in demo/voices/streaming_model/ |
This repository builds on Microsoft's VibeVoice. To be fully transparent about authorship:
| Component | Role | License |
|---|---|---|
| VibeVoice-Realtime-0.5B | Streaming text-to-speech (the voice) | MIT |
| OpenAI Whisper | Speech-to-text | MIT |
| Google Gemini | Default LLM | API service |
| OpenAI / Transformers | Alternative LLM backends | Apache 2.0 |
| py-webrtcvad | Voice activity detection | BSD |
| FastAPI Β· PyTorch Β· sounddevice / soundfile | Server + runtime + audio I/O | MIT / BSD |
- The real-time conversation pipeline stitching STT + LLM + TTS into one spoken loop.
- The latency optimizations: Voice Activity Detection, streaming/parallel playback
(
StreamingAudioPlayerextending VibeVoice'sAudioStreamer), tuned Whisper and DDPM settings β the ~40% turn-time reduction. - The ElevenLabs-style REST API server and example client.
- Web demo enhancements, setup/test tooling, and the optimization write-up.
- Interrupt support β cut off the AI mid-response by speaking.
- Full-duplex β talk while the AI is talking.
- INT8 quantization for lower latency.
- Distilled TTS model for an even smaller footprint.
Issues and pull requests are welcome! If you build something on top of this, I'd love to hear about it.
This project's additions are released under the MIT License. It builds on Microsoft VibeVoice (MIT) β see VIBEVOICE_UPSTREAM.md for the original model documentation and attribution. All third-party components retain their own licenses (listed above).
β οΈ Responsible use: synthetic speech can be misused for impersonation or disinformation. Please disclose AI-generated audio and use this project lawfully and ethically.
Built by Tartendu Kumar Β· on the shoulders of VibeVoice
Connect: GitHub Β· X / Twitter Β· LinkedIn
β If this is useful to you, consider starring the repo!