🎙️ Zencia Vocalis

On-device conversational voice AI — almost real-time, zero cloud TTS

Talk to an AI out loud and hear it talk back — running locally on your own GPU. Total turn latency cut from ~13.6s to ~8s (~40% faster), with audio that starts playing in ~0.5s.

✨ What is this?

Zencia Vocalis is a full two-way voice assistant pipeline built on top of Microsoft's VibeVoice real-time text-to-speech model.

VibeVoice on its own only does text → speech. Zencia Vocalis turns it into a complete spoken conversation loop and then optimizes that loop for low latency:

🎤 You speak → 📝 it transcribes you → 🧠 an LLM thinks → 🔊 it speaks the answer back

No cloud TTS, no audio leaving your machine for synthesis — the voice runs locally on your GPU. Only the LLM step (optional) calls an external API, and you can swap in a fully local model.

🎬 Demo

📹 Demo video/GIF coming soon. To add one: drop a demo.gif in the repo root and replace this block with ![demo](demo.gif). A short screen recording of a live conversation dramatically helps people understand (and star) the project.

🚀 Highlights

🗣️ Full voice loop — speech-to-text, reasoning, and text-to-speech wired into one program.
⚡ ~40% lower latency — measured turn time down from 13.65s → ~8s (see benchmarks).
🔊 Streaming speech — first audio chunk plays in ~0.5s, while the rest is still generating.
🎚️ Smart recording — Voice Activity Detection stops listening the instant you stop talking.
🔌 Pluggable LLM — Google Gemini (default), OpenAI, or a fully local Hugging Face model.
🌍 Multilingual voices — English plus experimental DE / FR / IT / JP / KR / NL / PL / PT / ES speakers.
🛠️ Production API — an ElevenLabs-style REST endpoint for integrating TTS into other apps.

🧩 How it works

🎤 You speak
   │
   ▼
┌─────────────────────────────────────────────┐
│  Voice Activity Detection (WebRTC VAD)       │  stops recording when you stop talking
└─────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────┐
│  Speech-to-Text  ·  OpenAI Whisper           │  transcribes your words
└─────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────┐
│  LLM  ·  Gemini / OpenAI / Local             │  generates a reply
└─────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────┐
│  Text-to-Speech  ·  VibeVoice (STREAMING)    │  speaks the reply as it generates
│   chunk 1 → play (0.5s) ┐                     │
│   chunk 2 → play        ├─ played in parallel │
│   chunk 3 → play …      ┘  while more generate│
└─────────────────────────────────────────────┘
   │
   ▼
🔊 You hear the answer

📊 Benchmarks

Measured on a single conversational turn (full write-up in demo/OPTIMIZATION_SUMMARY.md):

Stage	Before	After	How it was sped up
Recording	5.73s	~3.0s	VAD stops recording the moment you stop speaking
Speech-to-Text	2.09s	~1.5s	Whisper `tiny`, greedy decoding, deterministic
LLM	1.52s	~1.3s	Gemini (already fast)
TTS + Playback	4.31s	streams	Streaming playback — first audio in ~0.5s
DDPM steps	3	2	Fewer diffusion steps, minimal quality loss
Total turn	13.65s	~8s	~40% faster

📦 Installation

Requires Python 3.9+, a CUDA-capable GPU, and the base VibeVoice setup (see VIBEVOICE_UPSTREAM.md for the underlying model).

# 1. Clone
git clone https://github.com/tartendu/vocalis.git
cd vocalis

# 2. Install the base VibeVoice package (editable)
pip install -e .

# 3. Install the conversation-pipeline extras
pip install -r demo/requirements_conversation.txt
pip install webrtcvad   # optional but recommended for fastest recording

🪟 Windows: if installation fails on long paths, apply enable_long_paths.reg.

🔑 Configuration

Copy the example env file and add your own key (the file is gitignored and never committed):

cp demo/.env.example demo/.env
# then edit demo/.env and set GEMINI_API_KEY

Or set it in your shell:

$env:GEMINI_API_KEY = "your-api-key-here"   # free key: https://aistudio.google.com/apikey

▶️ Usage

Optimized conversation (recommended)

python demo/realtime_conversation_optimized.py `
  --model_path microsoft/VibeVoice-Realtime-0.5B `
  --speaker_name Carter `
  --device cuda `
  --whisper_model tiny

One-click start (Windows)

.\demo\start_conversation.ps1

Say exit or quit to end the conversation; Ctrl+C to force quit.

Other entry points

Script	Purpose
demo/realtime_conversation.py	Baseline conversation pipeline
demo/realtime_conversation_optimized.py	Optimized: VAD + streaming playback + tuned settings
demo/realtime_conversation_with_timing.py	Same loop with per-stage timing printout (for profiling)
demo/api_server.py	REST API server (ElevenLabs-style)
demo/api_client_example.py	Example client for the REST API
demo/test_audio_simple.py	Verify your speakers work
demo/test_gemini_setup.py	Verify your Gemini key works

REST API

# start the server
python demo/api_server.py --model_path microsoft/VibeVoice-Realtime-0.5B --device cuda --port 8000

# POST /v1/text-to-speech  →  returns a WAV
curl -X POST http://localhost:8000/v1/text-to-speech \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from Zencia Vocalis", "voice": "en-Carter_man", "steps": 10}' \
  --output speech.wav

# GET /v1/voices  →  list available voices
curl http://localhost:8000/v1/voices

⚙️ Configuration cheatsheet

Option	Choices	Notes
Whisper model	`tiny` · `base` · `small` · `medium` · `large`	`tiny` is fastest (recommended for real-time)
DDPM steps	`1` · `2` · `3`	`2` = good balance; `1` = fastest
LLM provider	`gemini` · `openai` · `local`	`gemini` is fast and the default
Voice	`Carter`, `Emma`, `Davis`, `Grace`, …	any preset in `demo/voices/streaming_model/`

🧠 What's mine vs. what's open source

This repository builds on Microsoft's VibeVoice. To be fully transparent about authorship:

Open-source components used (not written by me)

Component	Role	License
VibeVoice-Realtime-0.5B	Streaming text-to-speech (the voice)	MIT
OpenAI Whisper	Speech-to-text	MIT
Google Gemini	Default LLM	API service
OpenAI / Transformers	Alternative LLM backends	Apache 2.0
py-webrtcvad	Voice activity detection	BSD
FastAPI · PyTorch · sounddevice / soundfile	Server + runtime + audio I/O	MIT / BSD

What I (Tartendu Kumar) built on top

The real-time conversation pipeline stitching STT + LLM + TTS into one spoken loop.
The latency optimizations: Voice Activity Detection, streaming/parallel playback (StreamingAudioPlayer extending VibeVoice's AudioStreamer), tuned Whisper and DDPM settings — the ~40% turn-time reduction.
The ElevenLabs-style REST API server and example client.
Web demo enhancements, setup/test tooling, and the optimization write-up.

🗺️ Roadmap

Interrupt support — cut off the AI mid-response by speaking.
Full-duplex — talk while the AI is talking.
INT8 quantization for lower latency.
Distilled TTS model for an even smaller footprint.

🤝 Contributing

Issues and pull requests are welcome! If you build something on top of this, I'd love to hear about it.

📄 License

This project's additions are released under the MIT License. It builds on Microsoft VibeVoice (MIT) — see VIBEVOICE_UPSTREAM.md for the original model documentation and attribution. All third-party components retain their own licenses (listed above).

⚠️ Responsible use: synthetic speech can be misused for impersonation or disinformation. Please disclose AI-generated audio and use this project lawfully and ethically.

Built by Tartendu Kumar · on the shoulders of VibeVoice

Connect: GitHub · X / Twitter · LinkedIn

⭐ If this is useful to you, consider starring the repo!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.claude		.claude
Figures		Figures
demo		demo
docs		docs
vibevoice		vibevoice
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VIBEVOICE_UPSTREAM.md		VIBEVOICE_UPSTREAM.md
enable_long_paths.reg		enable_long_paths.reg
how to run.txt		how to run.txt
pyproject.toml		pyproject.toml
test_output.wav		test_output.wav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ Zencia Vocalis

On-device conversational voice AI — almost real-time, zero cloud TTS

✨ What is this?

🎬 Demo

🚀 Highlights

🧩 How it works

📊 Benchmarks

📦 Installation

🔑 Configuration

▶️ Usage

Optimized conversation (recommended)

One-click start (Windows)

Other entry points

REST API

⚙️ Configuration cheatsheet

🧠 What's mine vs. what's open source

Open-source components used (not written by me)

What I (Tartendu Kumar) built on top

🗺️ Roadmap

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎙️ Zencia Vocalis

On-device conversational voice AI — almost real-time, zero cloud TTS

✨ What is this?

🎬 Demo

🚀 Highlights

🧩 How it works

📊 Benchmarks

📦 Installation

🔑 Configuration

▶️ Usage

Optimized conversation (recommended)

One-click start (Windows)

Other entry points

REST API

⚙️ Configuration cheatsheet

🧠 What's mine vs. what's open source

Open-source components used (not written by me)

What I (Tartendu Kumar) built on top

🗺️ Roadmap

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages