Skip to content

tartendu/vocalis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™οΈ Zencia Vocalis

On-device conversational voice AI β€” almost real-time, zero cloud TTS

Talk to an AI out loud and hear it talk back β€” running locally on your own GPU. Total turn latency cut from ~13.6s to ~8s (~40% faster), with audio that starts playing in ~0.5s.

License: MIT Python Built on VibeVoice PyTorch

X LinkedIn


✨ What is this?

Zencia Vocalis is a full two-way voice assistant pipeline built on top of Microsoft's VibeVoice real-time text-to-speech model.

VibeVoice on its own only does text β†’ speech. Zencia Vocalis turns it into a complete spoken conversation loop and then optimizes that loop for low latency:

🎀 You speak β†’ πŸ“ it transcribes you β†’ 🧠 an LLM thinks β†’ πŸ”Š it speaks the answer back

No cloud TTS, no audio leaving your machine for synthesis β€” the voice runs locally on your GPU. Only the LLM step (optional) calls an external API, and you can swap in a fully local model.


🎬 Demo

πŸ“Ή Demo video/GIF coming soon. To add one: drop a demo.gif in the repo root and replace this block with ![demo](demo.gif). A short screen recording of a live conversation dramatically helps people understand (and star) the project.


πŸš€ Highlights

  • πŸ—£οΈ Full voice loop β€” speech-to-text, reasoning, and text-to-speech wired into one program.
  • ⚑ ~40% lower latency β€” measured turn time down from 13.65s β†’ ~8s (see benchmarks).
  • πŸ”Š Streaming speech β€” first audio chunk plays in ~0.5s, while the rest is still generating.
  • 🎚️ Smart recording β€” Voice Activity Detection stops listening the instant you stop talking.
  • πŸ”Œ Pluggable LLM β€” Google Gemini (default), OpenAI, or a fully local Hugging Face model.
  • 🌍 Multilingual voices β€” English plus experimental DE / FR / IT / JP / KR / NL / PL / PT / ES speakers.
  • πŸ› οΈ Production API β€” an ElevenLabs-style REST endpoint for integrating TTS into other apps.

🧩 How it works

🎀 You speak
   β”‚
   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Voice Activity Detection (WebRTC VAD)       β”‚  stops recording when you stop talking
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚
   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Speech-to-Text  Β·  OpenAI Whisper           β”‚  transcribes your words
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚
   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LLM  Β·  Gemini / OpenAI / Local             β”‚  generates a reply
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚
   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Text-to-Speech  Β·  VibeVoice (STREAMING)    β”‚  speaks the reply as it generates
β”‚   chunk 1 β†’ play (0.5s) ┐                     β”‚
β”‚   chunk 2 β†’ play        β”œβ”€ played in parallel β”‚
β”‚   chunk 3 β†’ play …      β”˜  while more generateβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚
   β–Ό
πŸ”Š You hear the answer

πŸ“Š Benchmarks

Measured on a single conversational turn (full write-up in demo/OPTIMIZATION_SUMMARY.md):

Stage Before After How it was sped up
Recording 5.73s ~3.0s VAD stops recording the moment you stop speaking
Speech-to-Text 2.09s ~1.5s Whisper tiny, greedy decoding, deterministic
LLM 1.52s ~1.3s Gemini (already fast)
TTS + Playback 4.31s streams Streaming playback β€” first audio in ~0.5s
DDPM steps 3 2 Fewer diffusion steps, minimal quality loss
Total turn 13.65s ~8s ~40% faster

πŸ“¦ Installation

Requires Python 3.9+, a CUDA-capable GPU, and the base VibeVoice setup (see VIBEVOICE_UPSTREAM.md for the underlying model).

# 1. Clone
git clone https://github.com/tartendu/vocalis.git
cd vocalis

# 2. Install the base VibeVoice package (editable)
pip install -e .

# 3. Install the conversation-pipeline extras
pip install -r demo/requirements_conversation.txt
pip install webrtcvad   # optional but recommended for fastest recording

πŸͺŸ Windows: if installation fails on long paths, apply enable_long_paths.reg.


πŸ”‘ Configuration

Copy the example env file and add your own key (the file is gitignored and never committed):

cp demo/.env.example demo/.env
# then edit demo/.env and set GEMINI_API_KEY

Or set it in your shell:

$env:GEMINI_API_KEY = "your-api-key-here"   # free key: https://aistudio.google.com/apikey

▢️ Usage

Optimized conversation (recommended)

python demo/realtime_conversation_optimized.py `
  --model_path microsoft/VibeVoice-Realtime-0.5B `
  --speaker_name Carter `
  --device cuda `
  --whisper_model tiny

One-click start (Windows)

.\demo\start_conversation.ps1

Say exit or quit to end the conversation; Ctrl+C to force quit.

Other entry points

Script Purpose
demo/realtime_conversation.py Baseline conversation pipeline
demo/realtime_conversation_optimized.py Optimized: VAD + streaming playback + tuned settings
demo/realtime_conversation_with_timing.py Same loop with per-stage timing printout (for profiling)
demo/api_server.py REST API server (ElevenLabs-style)
demo/api_client_example.py Example client for the REST API
demo/test_audio_simple.py Verify your speakers work
demo/test_gemini_setup.py Verify your Gemini key works

REST API

# start the server
python demo/api_server.py --model_path microsoft/VibeVoice-Realtime-0.5B --device cuda --port 8000
# POST /v1/text-to-speech  β†’  returns a WAV
curl -X POST http://localhost:8000/v1/text-to-speech \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from Zencia Vocalis", "voice": "en-Carter_man", "steps": 10}' \
  --output speech.wav

# GET /v1/voices  β†’  list available voices
curl http://localhost:8000/v1/voices

βš™οΈ Configuration cheatsheet

Option Choices Notes
Whisper model tiny Β· base Β· small Β· medium Β· large tiny is fastest (recommended for real-time)
DDPM steps 1 Β· 2 Β· 3 2 = good balance; 1 = fastest
LLM provider gemini Β· openai Β· local gemini is fast and the default
Voice Carter, Emma, Davis, Grace, … any preset in demo/voices/streaming_model/

🧠 What's mine vs. what's open source

This repository builds on Microsoft's VibeVoice. To be fully transparent about authorship:

Open-source components used (not written by me)

Component Role License
VibeVoice-Realtime-0.5B Streaming text-to-speech (the voice) MIT
OpenAI Whisper Speech-to-text MIT
Google Gemini Default LLM API service
OpenAI / Transformers Alternative LLM backends Apache 2.0
py-webrtcvad Voice activity detection BSD
FastAPI Β· PyTorch Β· sounddevice / soundfile Server + runtime + audio I/O MIT / BSD

What I (Tartendu Kumar) built on top

  • The real-time conversation pipeline stitching STT + LLM + TTS into one spoken loop.
  • The latency optimizations: Voice Activity Detection, streaming/parallel playback (StreamingAudioPlayer extending VibeVoice's AudioStreamer), tuned Whisper and DDPM settings β€” the ~40% turn-time reduction.
  • The ElevenLabs-style REST API server and example client.
  • Web demo enhancements, setup/test tooling, and the optimization write-up.

πŸ—ΊοΈ Roadmap

  • Interrupt support β€” cut off the AI mid-response by speaking.
  • Full-duplex β€” talk while the AI is talking.
  • INT8 quantization for lower latency.
  • Distilled TTS model for an even smaller footprint.

🀝 Contributing

Issues and pull requests are welcome! If you build something on top of this, I'd love to hear about it.


πŸ“„ License

This project's additions are released under the MIT License. It builds on Microsoft VibeVoice (MIT) β€” see VIBEVOICE_UPSTREAM.md for the original model documentation and attribution. All third-party components retain their own licenses (listed above).

⚠️ Responsible use: synthetic speech can be misused for impersonation or disinformation. Please disclose AI-generated audio and use this project lawfully and ethically.


Built by Tartendu Kumar Β· on the shoulders of VibeVoice

Connect: GitHub Β· X / Twitter Β· LinkedIn

⭐ If this is useful to you, consider starring the repo!

Releases

No releases published

Packages

 
 
 

Contributors

Languages