Real-time voice simulation for bank customer support training with ASRβLLMβTTS pipeline
- Overview
- Features
- Architecture
- System Requirements
- Complete Installation Guide
- Configuration
- Usage
- Project Structure
- Troubleshooting
- Performance
- License
An AI-powered voice assistant designed for bank customer support training. The system provides realistic voice interactions using state-of-the-art technologies:
- ASR (Speech Recognition): Groq Whisper (whisper-large-v3-turbo)
- LLM (Language Model): Groq LLM for intelligent responses
- TTS (Text-to-Speech): Kokoro-82M for natural voice (local, offline)
- Bank support agent training and assessment
- Customer service simulation scenarios
- Voice interface prototyping and testing
- Multi-modal AI demonstrations
- Manual Recording Controls - Explicit START/STOP buttons for precise control
- High-Accuracy ASR - Groq Whisper for speech transcription
- Intelligent Responses - Scenario-specific AI behavior and prompts
- Natural Voice - Kokoro-82M local TTS (11+ voices, runs offline)
- Three Scenarios - Lost Card, Failed Transfer, Locked Account
- Performance Metrics - Real-time latency tracking (ASR, LLM, TTS)
- Conversation History - Multi-turn dialogue with context
- State Management - Context-aware conversation flow
- Local TTS - Zero API costs for voice synthesis, runs offline
- Streaming Pipeline - Real-time audio processing
- Modular Architecture - Easy to extend and maintain
- Error Handling - Comprehensive exception management
- Production Ready - Clean code, proper logging, tested
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Streamlit Web UI β
β [START RECORDING] [STOP & PROCESS] β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β SimpleVoiceHandler Pipeline β
β β
β Record β ASR β LLM β TTS β Playback β
β β β β β β β
β Audio Text Reply Audio Speaker β
ββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββββββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
Mic Groq Groq Kokoro
Whisper LLM TTS
(Local)
Pipeline Flow:
- User speaks β Microphone captures audio
- ASR β Groq Whisper transcribes to text
- LLM β Groq generates intelligent response
- TTS β Kokoro synthesizes speech (offline)
- Playback β User hears AI response
- CPU: Modern multi-core processor (Intel/AMD/Apple Silicon)
- RAM: Minimum 4GB, recommended 8GB+
- Storage: ~500MB for dependencies and models
- Microphone: Any working microphone (built-in or external)
- Speakers/Headphones: For audio output
- Operating System: macOS, Linux, or Windows
- Python: Version 3.11 or higher
- Internet: Required for ASR/LLM API calls (Groq)
- TTS works offline (Kokoro is local)
Follow these steps carefully to set up the system on your local machine.
Check if Python is installed:
python --version
# or
python3 --versionYou need Python 3.11 or higher. If not installed:
macOS:
# Using Homebrew (install Homebrew first if needed: https://brew.sh)
brew install python@3.11Ubuntu/Debian:
sudo apt update
sudo apt install python3.11 python3.11-venv python3-pipWindows:
- Download from python.org
- Run installer
- IMPORTANT: Check "Add Python to PATH" during installation
Verify installation:
python3 --version
# Should show: Python 3.11.x or higherThe Kokoro TTS engine requires espeak-ng for phoneme generation.
macOS:
brew install espeak-ngUbuntu/Debian:
sudo apt-get update
sudo apt-get install espeak-ngWindows:
- Download installer from espeak-ng releases
- Run the installer (choose default options)
- Add installation directory to PATH:
- Default:
C:\Program Files\eSpeak NG\ - Add to System PATH in Environment Variables
- Default:
Verify installation:
espeak-ng --version
# Should show: eSpeak NG version infoYou need a free Groq API key for ASR and LLM.
- Visit Groq Console
- Sign up for a free account (no credit card required)
- Navigate to API Keys section
- Click "Create API Key"
- Copy and save the key (you'll need it in Step 7)
Example key format: gsk_... (starts with gsk_)
Option A: Using Git
git clone <your-repository-url>
cd Voice_Test_ProjectOption B: Download ZIP
- Download the project ZIP file
- Extract to your desired location
- Open terminal/command prompt and navigate:
cd /path/to/Voice_Test_Project
A virtual environment keeps project dependencies isolated from your system Python.
Create the environment:
python3 -m venv venvActivate the environment:
macOS/Linux:
source venv/bin/activateWindows (Command Prompt):
venv\Scripts\activate.batWindows (PowerShell):
venv\Scripts\Activate.ps1Success indicator: You should see (venv) at the beginning of your terminal prompt.
Example:
(venv) user@computer Voice_Test_Project %
Upgrade pip first:
pip install --upgrade pipInstall all required packages:
pip install -r requirements.txtThis installs 100+ packages including:
streamlit- Web UI frameworkgroq- ASR and LLM API clientkokoro- Local TTS enginesounddevice- Audio recording/playbacknumpy,torch- Audio processingloguru- Logging- And many dependencies
β±οΈ Installation takes 3-5 minutes. Wait for completion.
Verify installation:
python -c "import streamlit, groq, kokoro, sounddevice; print('β
All core modules installed!')"Create .env file from template:
cp .env.example .envEdit the .env file:
macOS/Linux:
nano .env
# or use: vim .env, code .env, open -a TextEdit .envWindows:
notepad .envAdd your configuration:
# ========================================
# REQUIRED: Groq API Configuration
# ========================================
GROQ_API_KEY=your_actual_groq_api_key_here
# ========================================
# Optional: Model Selection (defaults work well)
# ========================================
GROQ_ASR_MODEL=whisper-large-v3-turbo
GROQ_LLM_MODEL=openai/gpt-oss-20b
# ========================================
# Optional: Kokoro TTS Configuration
# ========================================
KOKORO_VOICE=af_sky
KOKORO_LANG_CODE=a
# ========================================
# Optional: Advanced Settings
# ========================================
ALLOW_FALLBACK_TTS=0
SEED=0
LLM_PRICE_IN_PER_1K=0
LLM_PRICE_OUT_PER_1K=0your_actual_groq_api_key_here with your real API key from Step 3!
Save the file:
- nano: Press
Ctrl+O,Enter, thenCtrl+X - vim: Press
Esc, type:wq, pressEnter - Windows Notepad: File β Save
Run these checks to ensure everything is set up correctly:
1. Check virtual environment:
which python
# macOS/Linux: should show /path/to/Voice_Test_Project/venv/bin/python
# Windows: should show \path\to\Voice_Test_Project\venv\Scripts\python2. Check Python modules:
python -c "import streamlit, groq, kokoro, sounddevice, numpy, torch; print('β
All modules imported successfully!')"3. Check espeak-ng:
espeak-ng --version4. Check Groq API key:
python -c "import os; from dotenv import load_dotenv; load_dotenv(); key = os.getenv('GROQ_API_KEY'); print('β
API Key loaded!' if key and key.startswith('gsk_') else 'β API Key missing or invalid!')"5. Check project structure:
ls -la src/ config/
# Should show: asr_module.py, llm_module.py, tts_module.py, etc.
# Should show: personas/ directory with JSON filesβ All checks should pass before proceeding to Step 9.
Start the Streamlit web application:
streamlit run streamlit_app.pyExpected output:
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://192.168.x.x:8501
For better performance, install the Watchdog module:
$ xcode-select --install
$ pip install watchdog
Your browser should automatically open to http://localhost:8501
If it doesn't open automatically:
- Manually open your web browser
- Navigate to
http://localhost:8501
You should see: The AI Voice Assistant interface with scenario selection and recording buttons.
Test the complete pipeline:
- Select a scenario from the dropdown (e.g., "Lost Card")
- Click "START RECORDING" button
- Speak clearly: "Hi, I lost my credit card yesterday"
- Click "STOP & PROCESS" button
- Wait for processing:
- Transcribing... (~1 second)
- Thinking... (~1-2 seconds)
- Speaking... (~3-5 seconds)
- Listen to the AI's response
Expected behavior:
- You should see your transcribed text in the conversation
- The AI should respond with an empathetic bank agent response
- You should hear the AI speaking through your speakers
If you hear a proper AI response, congratulations! Setup is complete! π
Edit .env file to customize:
# ==================================================
# GROQ API CONFIGURATION
# ==================================================
# Your Groq API key (REQUIRED)
GROQ_API_KEY=gsk_your_key_here
# ASR Model (optional, default: whisper-large-v3-turbo)
# Options: whisper-large-v3-turbo, whisper-large-v3
GROQ_ASR_MODEL=whisper-large-v3-turbo
# LLM Model (optional, default: openai/gpt-oss-20b)
# Options: openai/gpt-oss-20b, llama-3.1-70b-versatile, mixtral-8x7b-32768
GROQ_LLM_MODEL=openai/gpt-oss-20b
# ==================================================
# KOKORO TTS CONFIGURATION
# ==================================================
# Voice selection (default: af_sky)
# Female: af_sky, af_bella, af_heart, af_nicole, af_sarah
# Male: am_adam, am_michael
KOKORO_VOICE=af_sky
# Language code (default: a = American English)
# Options: a (American), b (British)
KOKORO_LANG_CODE=a
# ==================================================
# ADVANCED SETTINGS
# ==================================================
# Allow fallback to macOS 'say' command if Kokoro fails (0 = disabled, 1 = enabled)
ALLOW_FALLBACK_TTS=0
# LLM seed for reproducibility (0 = random, any int = fixed seed)
SEED=0
# Cost tracking (set to actual prices if needed)
LLM_PRICE_IN_PER_1K=0
LLM_PRICE_OUT_PER_1K=0Female Voices:
| Voice | Description | Use Case |
|---|---|---|
af_sky |
Clear, friendly (default) | General purpose, professional |
af_bella |
Elegant, sophisticated | Premium services, upscale |
af_heart |
Warm, engaging | Empathetic support, care |
af_nicole |
Professional, authoritative | Corporate, formal |
af_sarah |
Soft, gentle | Calming, reassuring |
Male Voices:
| Voice | Description | Use Case |
|---|---|---|
am_adam |
Professional, clear | Business, technical |
am_michael |
Deep, authoritative | Leadership, serious topics |
To change voice:
- Edit
.envfile - Set
KOKORO_VOICE=af_bella(or any voice above) - Restart the application
Start the application:
streamlit run streamlit_app.pyUsing the interface:
-
Select Scenario: Choose from dropdown
- Lost Card
- Failed Transfer
- Locked Account
-
Start Recording: Click "START RECORDING" button
- Status changes to "Recording - Speak now..."
- Speak your question clearly
-
Stop & Process: Click "STOP & PROCESS" button
- System transcribes your speech (ASR)
- Generates intelligent response (LLM)
- Synthesizes voice (TTS)
- Plays response
-
Continue Conversation: Repeat steps 2-3 for follow-up questions
-
New Conversation: Click "New Conversation" to reset
Tips for best results:
- Speak clearly and at normal pace
- Wait for "Recording" status before speaking
- Minimize background noise
- Use a good microphone (built-in works fine)
- Click STOP immediately after finishing your question
For automation and scripting:
python main.py --persona card_lost --turns 3Arguments:
--persona: Scenario selectioncard_lost- Lost card supporttransfer_failed- Failed transfer supportaccount_locked- Locked account support
--turns: Number of conversation turns (default: 3)
Example:
# 5-turn conversation for transfer failure scenario
python main.py --persona transfer_failed --turns 5CLI features:
- Automatic voice activity detection (VAD)
- Streaming transcription with partials
- Real-time conversation
- Performance logging to CSV
Voice_Test_Project/
βββ src/ # Core application modules
β βββ __init__.py # Package initialization
β βββ asr_module.py # Speech recognition (Groq Whisper)
β βββ llm_module.py # Language model (Groq LLM)
β βββ tts_module.py # Text-to-speech (Kokoro)
β βββ simple_voice_handler.py # Manual recording pipeline
β βββ voice_client.py # Auto VAD pipeline (CLI mode)
β βββ state_manager.py # Conversation state tracking
β βββ logger.py # Logging utilities
β βββ feedback.py # Post-conversation evaluation
β
βββ config/ # Configuration files
β βββ __init__.py
β βββ personas/ # AI behavior definitions
β βββ card_lost.json # Lost card scenario
β βββ transfer_failed.json # Failed transfer scenario
β βββ account_locked.json # Locked account scenario
β
βββ logs/ # Performance logs
β βββ .gitkeep
β βββ latency_log.csv # Auto-generated metrics
β
βββ streamlit_app.py # Web UI application (main entry)
βββ main.py # CLI entry point
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
βββ .env # Your configuration (create this)
βββ .gitignore # Git ignore rules
βββ README.md # This file
βββ venv/ # Virtual environment (created by you)
Key Files:
streamlit_app.py: Main web interfacemain.py: CLI interface for automationsrc/simple_voice_handler.py: Core pipeline logicconfig/personas/*.json: AI behavior definitions.env: Your API keys and configuration
The system includes three pre-configured bank support scenarios:
Persona File: config/personas/card_lost.json
AI Behavior:
- Tone: Empathetic and reassuring
- Priority: Security and quick action
- Workflow:
- Acknowledge customer concern warmly
- Ask for security verification (last 4 digits)
- Confirm immediate card blocking
- Explain replacement timeline (5-7 business days)
- Offer digital card alternatives
- Provide fraud monitoring information
Example interaction:
- User: "I lost my credit card yesterday"
- AI: "I'm really sorry to hear that. Let me help you secure your account right away. Can you confirm the last 4 digits of your card for security?"
Persona File: config/personas/transfer_failed.json
AI Behavior:
- Tone: Solution-focused and efficient
- Priority: Fast resolution
- Workflow:
- Acknowledge frustration quickly
- Request transfer details (amount, recipient, time)
- Identify issue (balance, limits, recipient problems)
- Provide immediate solution
- Suggest alternatives if needed
Example interaction:
- User: "My transfer to Sarah didn't go through"
- AI: "I'm sorry you're having troubleβlet's get this sorted quickly. Can you tell me the amount you tried to send and when you attempted the transfer?"
Persona File: config/personas/account_locked.json
AI Behavior:
- Tone: Reassuring and educational
- Priority: Security explanation and unlock
- Workflow:
- Reassure it's a security measure
- Explain trigger (travel, unusual activity)
- Verify customer identity
- Unlock account
- Educate on prevention
Example interaction:
- User: "I can't log into my account anymore"
- AI: "Don't worry, this is a security measure to protect your account. I can help unlock it. Have you traveled recently or made any unusual transactions?"
Edit JSON files in config/personas/ to customize AI behavior:
{
"name": "Lost Card Support",
"scenario": "Card Lost",
"system_prompt": "You are an empathetic bank support agent helping a customer who lost their card. Be warm, security-focused, and provide clear next steps..."
}Fields:
name: Display name for the personascenario: Short scenario descriptionsystem_prompt: Detailed instructions for the LLM
| Component | Time Range | Average | Notes |
|---|---|---|---|
| Recording | User-controlled | Variable | Until user clicks STOP |
| ASR | 500-1500ms | ~800ms | Depends on audio length |
| LLM | 400-2000ms | ~500ms | Depends on response length |
| TTS | 2000-10000ms | ~5000ms | Depends on response length |
| Total | 3-13 seconds | ~6 seconds | Excluding recording time |
| Resource | Usage | Notes |
|---|---|---|
| Memory | 300-500MB | Includes loaded models |
| CPU | Moderate | Peaks during TTS synthesis |
| Network | ~50-200KB/turn | Only for ASR and LLM API calls |
| Storage | ~500MB | Models and dependencies |
- Faster responses: Use shorter questions
- Better accuracy: Speak clearly with minimal background noise
- Reduce latency: Use a faster internet connection
- Lower memory: Close other applications
Cause: Running from wrong directory
Solution:
# Make sure you're in the project root
cd /path/to/Voice_Test_Project
# Then run
streamlit run streamlit_app.pyCauses:
- Microphone not working
- Wrong microphone selected
- Didn't click STOP button
- No permission for microphone
Solutions:
- Check microphone: Test with system recorder
- Check permissions: Allow microphone access in System Preferences/Settings
- Click STOP: Must click STOP & PROCESS after speaking
- Try different mic: Select different device in system settings
Cause: Invalid or missing Groq API key
Solution:
# Check if .env file exists
ls -la .env
# Check if API key is set
cat .env | grep GROQ_API_KEY
# If empty or wrong, edit .env
nano .env
# Add: GROQ_API_KEY=your_actual_key_hereCause: espeak-ng not installed or not in PATH
Solution:
# macOS
brew install espeak-ng
# Ubuntu/Debian
sudo apt-get install espeak-ng
# Verify
espeak-ng --versionCause: Code mismatch or outdated files
Solution:
# Make sure all files are up to date
# Check that src/asr_module.py has transcribe_wav_bytes method
# Restart the applicationCause: Another Streamlit app is running
Solution:
# Kill existing Streamlit process
pkill -f streamlit
# Or use different port
streamlit run streamlit_app.py --server.port 8502Causes:
- Poor internet connection
- Groq API issues
- Rate limiting
Solutions:
- Check internet connection
- Wait a moment and try again
- Check Groq Status
- Verify API key quota