Real-time speech analysis for CEAT tyre sales assistants. Transcribes live audio, analyzes customer sentiment and persona, detects competitor mentions, and provides actionable coaching suggestions in real-time.
- Real-time transcription with Azure Speech or Gemini 2.0 Flash
- Speaker role detection - Automatically identifies sales rep vs customer
- Role-based processing - Only customer speech triggers agent analysis
- Multi-speaker batching - Combine segments for efficient processing
- Multi-agent analysis running in parallel for low latency:
- Sentiment analysis with confidence scores and signal detection
- Persona inference (6 CEAT customer personas)
- Product analysis with upsell opportunity detection
- Competitor intelligence with counter-positioning
- Contextual sales scripts and objection handlers
- Actionable suggestions consolidated from all agents (2-3 bullets max)
- Conversation context tracking throughout the session
- Per-agent model configuration for optimal performance
- Unified CLI - Single entry point with --file, --backend, --ui options
Audio Input → Transcription → Role Assignment → Agents → Consolidator → Display
│ │
┌─────┴─────┐ ├── Sentiment
│ │ ├── Persona
sales_rep customer ├── Product
│ │ ├── Competition
(log only) (analyze) └── Sales Prompts
The system automatically assigns roles based on call type and speaker order:
- Outbound calls (default): First speaker is
sales_rep, second iscustomer - Inbound calls: First speaker is
customer, second issales_rep
Role-based processing:
sales_repspeech → Transcript stored only (no agent analysis)customerspeech → Full multi-agent analysis
This prevents wasted computation on sales rep utterances where persona/sentiment analysis doesn't apply.
Batching combines multiple transcript segments for efficient processing (single agent call instead of per-segment):
- Gemini: Automatically detects multiple speakers within each VAD-triggered audio chunk. Additionally, use
--batch-timeoutor--batch-maxto combine multiple audio chunks. - Azure: Use
--batch-timeout MS(time window) or--batch-max N(segment count) to combine sequential utterances.
# Batch within 2-second windows (both backends)
speechai --batch-timeout 2000
# Batch every 3 transcript chunks
speechai --batch-max 3
# Combine both: flush after 3 segments OR 2 seconds, whichever comes first
speechai --batch-timeout 2000 --batch-max 3When batched, all segments are stored in history but customer text is combined for a single agent analysis call—reducing API costs and providing more context.
Note on speaker detection: If the backend cannot distinguish speakers (returns "Unknown"), the system assumes turn-based alternation between sales rep and customer.
| Agent | Purpose | Default Model |
|---|---|---|
| Sentiment | Detects positive/negative/neutral sentiment with signals | gpt-5-mini |
| Persona | Infers customer persona from 6 CEAT segments | gpt-5-mini |
| Product | Identifies products mentioned, upsell opportunities | gpt-5-mini |
| Competition | Detects competitor mentions, provides counter-positioning | gpt-5-mini |
| Sales Prompts | Retrieves contextual scripts and objection handlers | claude-haiku-4-5 |
| Consolidator | Combines all outputs into 2-3 actionable suggestions | claude-haiku-4-5 |
6 Customer Personas:
- Entitled Evan (premium demanding)
- Impatient Ashish (time-sensitive)
- Pragmatic Purnima (safety/hassle-free)
- Thorough Tushar (research-oriented)
- Savvy Sarabh (value-seeking)
- Bindaas Bharat (durability-focused)
8 CEAT Products: SportDrive SUV CALM, SportDrive, CrossDrive AT, SecuraDrive SUV, SecuraDrive, Energy Drive, Milaze X5, Milaze X3
6 Competitors Tracked: MRF, Apollo, JK Tyre, Bridgestone, Michelin, Goodyear
Requires Python 3.12+ and uv.
# Clone the repository
git clone https://github.com/yourusername/speechAI.git
cd speechAI
# Install dependencies
uv sync- ffmpeg - Required for audio file processing (MP3 conversion)
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpegKey libraries (installed automatically via uv sync):
- textual - Terminal UI framework for the Textual dashboard
- gradio - Web UI framework for the browser dashboard
- azure-cognitiveservices-speech - Azure Speech SDK
- litellm - LLM routing for agents
- sounddevice / webrtcvad - Audio capture and voice activity detection
Create a .env file with your credentials:
# Azure Speech (for Azure mode)
AZURE_SPEECH_KEY=your_key
AZURE_SPEECH_ENDPOINT=your_endpoint
# LiteLLM proxy (agents use this)
LITELLM_BASE_URL=http://localhost:4000
LITELLM_API_KEY=sk-1234
# Gemini transcription (for Gemini mode)
GEMINI_MODEL=vertex_ai/gemini-2.0-flash
# Debug mode (optional - enables verbose logging)
SPEECHAI_DEBUG=trueEach agent has a default model optimized for its task. Override via environment variables:
# JSON output agents (reliable structured output)
LITELLM_MODEL_SENTIMENT=gpt-5-mini
LITELLM_MODEL_PERSONA=gpt-5-mini
LITELLM_MODEL_PRODUCT=gpt-5-mini
LITELLM_MODEL_COMPETITION=gpt-5-mini
# Tone-sensitive agents (customer-facing scripts)
LITELLM_MODEL_SALES_PROMPTS=claude-haiku-4.5
LITELLM_MODEL_CONSOLIDATOR=claude-haiku-4.5Model recommendations:
| Model | Best For |
|---|---|
gpt-5-mini |
Reliable JSON output, structured data extraction |
gemini-2.5-flash-lite |
Large context, cost-effective |
deepseek-v3.1 |
Strong reasoning, cost-effective |
claude-haiku-4-5 |
Natural conversational tone, customer-facing text |
Enable verbose logging for troubleshooting:
SPEECHAI_DEBUG=trueDebug output includes:
- Transcription: Buffer processing, API response timing, callback invocations
- Agents: Parallel agent completion with results, consolidator timing
- UI: Analysis updates with metrics (signals, suggestions, latency)
Single unified CLI with all options:
speechai [OPTIONS]| Option | Description |
|---|---|
--file, -f PATH |
Audio file to process (default: microphone) |
--backend, -b {gemini,azure} |
Transcription backend (default: gemini) |
--ui |
Use Textual terminal dashboard |
--web |
Use Gradio web UI at http://localhost:7860 |
--no-realtime |
Process file as fast as possible (no pacing) |
--batch-timeout MS |
Batch transcripts within time window (ms) |
--batch-max N |
Max segments to batch before processing |
# Live microphone + Gemini + CLI (default)
uv run speechai
# Live microphone + Azure + CLI
uv run speechai --backend azure
# Live microphone + Gemini + Textual UI dashboard
uv run speechai --ui
# Live microphone + Gemini + Web UI (Gradio)
uv run speechai --web
# Stream audio file with Gemini
uv run speechai --file recording.mp3
# Stream file with Azure + Textual UI
uv run speechai --file recording.mp3 --backend azure --ui
# Stream file with Web UI
uv run speechai --file recording.mp3 --web
# Fast file processing (no real-time pacing)
uv run speechai --file recording.mp3 --no-realtime
# Batch Azure results (2-second window)
uv run speechai -b azure --batch-timeout 2000
# Batch every 3 segments
uv run speechai -b azure --batch-max 3Keyboard commands:
Ctrl+C- Quit- In Textual UI mode:
r=reset,m=mute,q=quit
Terminal-based dashboard with fixed panels that update as utterances are detected:
- Fixed panels for Sentiment, Persona, Product, Competitors
- Suggestions panel with 2-3 actionable bullets
- Real-time latency display
- Scrollable history log
Gradio-based web interface at http://localhost:7860:
- Browser-accessible dashboard
- Same panels and layout as Textual UI
- Auto-refreshing display
- Reset and Mute controls
Supported audio formats: .mp3, .wav, .m4a, .ogg, .flac
For batch processing multiple audio files with Azure Speech, translation, and LLM analysis.
Directory structure:
data/
├── phrases.txt # Phrase list for improved recognition
├── transcripts/ # JSON files from Azure Speech
├── translations/ # Formatted English transcripts (_en.txt)
└── analysis/ # Analysis JSON files + combined report
Environment variables:
# Azure Speech & Storage (required for transcription)
AZURE_SPEECH_KEY=your_key
AZURE_SPEECH_REGION=swedencentral
AZURE_STORAGE_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=...
# For Azure translation:
AZURE_TRANSLATOR_KEY=your_key
AZURE_TRANSLATOR_ENDPOINT=https://your-endpoint.cognitiveservices.azure.com
# For LLM translation (--llm flag):
LLM_BASE_URL=http://localhost:4000
LLM_API_KEY=sk-1234
LLM_MODEL_ANALYZE=kimi-2.5
LLM_MODEL_TRANSLATE=kimi-2.5# Transcribe audio files (auto-detect language)
uv run python scripts/batch_transcribe.py ./data/recordingsConverts JSON transcripts to formatted English with timestamps and speaker labels.
# Using Azure Translator
uv run python scripts/batch_transcribe.py --translate-only
# Using LLM (recommended for better quality)
uv run python scripts/batch_transcribe.py --translate-only --llmOutput format (data/translations/*_en.txt):
[0.0s] Customer: Hello, I'm calling about your service
[3.5s] Sales Rep: Welcome! How can I help you?
[8.2s] Customer: I want to know the pricing details
(Original hi-IN: मुझे कीमत की जानकारी चाहिए)
# Full analysis: analyze transcripts and generate report
uv run python scripts/analyze_transcripts.py
# Report only: regenerate report from existing analysis files
uv run python scripts/analyze_transcripts.py --report-onlyOutput:
data/analysis/*_analysis.json- Individual analysis per calldata/analysis/report_*.md- Combined executive report (for agents & dealers)
# Transcribe + translate in one step
uv run python scripts/batch_transcribe.py ./data/recordings --translate --llm
# Download from existing job + translate
uv run python scripts/batch_transcribe.py --job-id <id> --translate --llm
# Use existing Azure container
uv run python scripts/batch_transcribe.py --container <name> --translate --llmCreate data/phrases.txt to improve recognition of domain-specific terms:
# One phrase per line, comments start with #
product name
company name
technical term
batch_transcribe.py:
| Option | Description |
|---|---|
--translate |
Translate after transcription |
--translate-only |
Only translate existing JSON files |
--llm |
Use LLM for translation (instead of Azure) |
--job-id |
Download from existing transcription job |
--container |
Use existing Azure container (skip upload) |
--phrases |
Phrase list file (default: ./data/phrases.txt) |
--locale |
Language locale (auto-detect if not set) |
analyze_transcripts.py:
| Option | Description |
|---|---|
--input |
Input directory (default: ./data/translations) |
--output |
Output directory (default: ./data/analysis) |
--model |
LLM model (default: gemini-2.5-pro) |
--report-only |
Generate report only from existing analysis JSON files |
Per call: summary, outcome, sentiment, objections, customer pain points, sales rep performance, action items, risk flags
Combined report: executive summary, outcomes breakdown, customer insights, objection patterns, recommendations
[14:23:45] Customer │ INTERESTED (78%) │ Pragmatic Purnima
"I'm looking for tyres for my Innova. Safety is important, we do a lot of highway driving.
But I've heard MRF is more durable..."
Suggestions:
→ Recommend SecuraDrive SUV - excellent wet grip for highway safety
→ Counter MRF: "CEAT offers similar durability with CALM noise reduction technology"
→ Mention run-flat capability for highway peace of mind
Persona: Pragmatic Purnima (safety-focused, hassle-free)
Products: SecuraDrive SUV, Run-flat option
Competitor: MRF mentioned
Upsell: Run-flat tyres (highway safety trigger)
[Azure STT: 1250ms | 5 Agents: 487ms | Consolidator: 312ms | Total: 2049ms]
At the end of each session (or on reset), you'll see:
────────────────────────────────────
Session Summary:
Utterances: 12
Duration: 145s
Sentiment: {'positive': 4, 'negative': 3, 'neutral': 5}
Signals: {'budget': 2, 'interest': 3, 'objection': 1}
Personas detected: Pragmatic Purnima (7x), Thorough Tushar (3x)
Products discussed: SecuraDrive SUV, SportDrive
Competitors mentioned: MRF (2x), Apollo (1x)
Upsell opportunities: run_flat, calm_technology
────────────────────────────────────
├── data/
│ ├── phrases.txt # Phrase list for recognition (70+ CEAT terms)
│ ├── transcripts/ # JSON files from Azure Speech
│ ├── translations/ # Formatted English transcripts
│ └── analysis/ # Analysis JSON + reports
│
├── scripts/
│ ├── batch_transcribe.py # Batch transcription + translation
│ └── analyze_transcripts.py # LLM-based transcript analysis
│
└── src/speechai/
├── main.py # Unified entry point (all modes)
├── ui.py # Textual terminal UI components
├── ui_web.py # Gradio web UI components
├── display.py # CLI terminal output formatting
├── context.py # Conversation context + role assignment
├── transcription.py # Azure Speech transcriber + data types
├── transcription_gemini.py # Gemini VAD transcriber
├── transcription_file.py # File-based transcription utilities
├── prompts.yaml # Agent prompts + CEAT domain knowledge
└── agents/
├── __init__.py # Exports all agents and constants
├── base.py # BaseAgent with per-agent model config
├── sentiment.py # Sentiment + signal detection
├── persona.py # 6 CEAT personas with triggers
├── product.py # CEAT products + upsell toolkit
├── competition.py # Competitor intelligence database
├── sales_prompts.py # Upsell scripts + objection handlers
└── consolidator.py # Consolidator + AgentOrchestrator
- Create a new agent in
src/speechai/agents/:from speechai.agents.base import AgentResult, BaseAgent class MyAgent(BaseAgent): name = "my_agent" default_model = "gpt-5-mini" # or claude-haiku-4-5 for tone async def analyze(self, text: str, context: dict | None = None) -> AgentResult: # Your analysis logic return AgentResult(agent_name=self.name, success=True, data={...}, latency_ms=0)
- Add to
AgentOrchestrator.initialize()inconsolidator.py - Include in
asyncio.gather()call inAgentOrchestrator.process() - Update
Consolidator.analyze()to use new agent's output - Add prompts to
prompts.yamlunder your agent's name
Edit src/speechai/prompts.yaml to customize:
- Sentiment detection criteria and signal keywords
- Persona triggers and characteristics
- Product catalog and upsell triggers
- Competitor counter-positioning
- Sales scripts and objection handlers
- Consolidator suggestion generation rules
Each agent embeds CEAT-specific knowledge as fallback:
persona.py:PERSONASdict with 6 customer segmentsproduct.py:CEAT_PRODUCTSandUPSELL_TOOLKITcompetition.py:COMPETITORSandCEAT_DIFFERENTIATORSsales_prompts.py:UPSELL_SCRIPTSandOBJECTION_HANDLERS
This ensures agents can detect triggers even if LLM parsing fails.
MIT License - see LICENSE for details.