Live docs: jphein.github.io/speech-to-cli
Voice interface for AI coding assistants — talk to your CLI agent and hear it respond, powered by Azure Speech Services. Works with GitHub Copilot CLI, Claude Code, and Gemini CLI.
This project adds voice input and output to your terminal AI workflow via the Model Context Protocol (MCP):
| Tool | Description |
|---|---|
MCP Server (mcp_speech.py) |
Integrates with Copilot CLI, Claude Code, and Gemini CLI via MCP — gives your AI listen, speak, and converse tools |
Voice Chat (voice_chat.py) |
Standalone voice chat companion — runs in a second terminal alongside Copilot CLI |
Speech-to-Text (speech.py) |
Simple mic → text → clipboard tool |
Text-to-Speech (tts.py) |
Simple text → speech tool (reads from args, stdin, or clipboard) |
- Azure HD Voices: Uses high-quality DragonHD voices for natural-sounding speech.
- Thinking Hum: A subtle 150Hz tone that loops while the AI is processing.
- Visual Status: Colorful progress bars with live VU meters and real-time subtitles (🎤/🧠/🔊).
- Audio Feedback: Configurable chimes for ready, processing, speak, and done states.
- Low Latency: Streaming playback and persistent connections for fast responses.
- VAD: Energy-gated voice activity detection that auto-calibrates to your environment.
GitHub Copilot CLI brings agentic AI coding assistance directly to your terminal. It can edit files, run commands, search code, and interact with GitHub — all through natural language.
Copilot CLI gives you access to multiple frontier models via the /model command:
| Model | Tier |
|---|---|
| Claude Sonnet 4.5 | Standard (default) |
| Claude Sonnet 4 | Standard |
| Claude Opus 4.5 | Premium |
| Claude Opus 4.6 | Premium |
| Claude Haiku 4.5 | Fast |
| GPT-5.1 / 5.2 / 5.4 | Standard |
| GPT-5.1-Codex / 5.2-Codex / 5.3-Codex | Standard |
| GPT-5.1-Codex-Max | Standard |
| GPT-5 mini | Fast |
| Gemini 3 Pro (Preview) | Standard |
GitHub Education members get Copilot Pro free for 1 year, which includes Copilot CLI access. Sign up at education.github.com with your school email — no credit card required. Each prompt uses one premium request from your monthly quota.
curl -fsSL https://gh.io/copilot-install | bash
copilot # launch and authenticate- Linux with ALSA audio (
arecord/aplay) - Python 3.8+
- An Azure Speech Services API key (free tier: 5 hours STT + 500K characters TTS per month)
This project defaults to Azure HD (DragonHD) voices — specifically en-US-Ava:DragonHDLatestNeural. These are Azure's highest-quality neural voices with natural intonation, breathing, and expressiveness that sounds remarkably human. You can change the voice via the AZURE_SPEECH_VOICE env var or config file.
Browse all available voices in the Azure Voice Gallery. Look for voices tagged HD or DragonHD for the best quality.
- Nonprofits: Microsoft offers up to $3,500/year in free Azure credits through Azure for Nonprofits. This more than covers Speech Services usage for voice-enabled Copilot workflows.
- Students: The Azure for Students program provides $100 in free credits with no credit card required — just verify with your school email.
git clone https://github.com/techempower-org/speech-to-cli.git
cd speech-to-cli
./install.shSet your Azure credentials via environment variables:
export AZURE_SPEECH_KEY="your-key-here"
export AZURE_SPEECH_REGION="westus2" # optional, default: westus2
export AZURE_SPEECH_VOICE="en-US-Ava:DragonHDLatestNeural" # optionalOr create a JSON config file at ~/.config/speech-to-cli/config.json. You can create the directory if it doesn't exist:
mkdir -p ~/.config/speech-to-cliExample config.json:
{
"key": "your-azure-speech-key",
"region": "westus2",
"voice": "en-US-Ava:DragonHDLatestNeural",
"fast_voice": "en-US-AvaNeural",
"chime_hum": true,
"visual_indicator": true
}| Key | Default | Description |
|---|---|---|
key |
None | Your Azure Speech Services API key. |
region |
westus2 |
The Azure region for your speech resource. |
voice |
en-US-Ava:DragonHDLatestNeural |
Primary voice for high-quality (HD) synthesis. |
fast_voice |
en-US-AvaNeural |
Low-latency voice for fast responses. |
chime_ready |
true |
Play an ascending tone when the microphone opens. |
chime_processing |
false |
Play a short "blip" when speech is recognized. |
chime_hum |
false |
Start a looping 150Hz tone while thinking. |
chime_speak |
false |
Play a descending tone before starting to speak. |
chime_done |
false |
Play a double-tap tone when the AI is done talking. |
visual_indicator |
true |
Show status icons (🎤/🧠) in the terminal. |
live_subtitles |
true |
Show real-time partial transcription in progress bar. |
vu_meter |
true |
Show live volume meter in progress bar. |
silence_timeout |
3.0 |
Seconds of silence after speech before auto-stop. |
loop_silence_timeout |
1.2 |
Silence timeout in continuous loop mode (shorter = faster turnaround). |
no_speech_timeout |
7.0 |
Max seconds to wait for any speech before giving up. |
talk_silence_timeout |
4.0 |
Silence timeout for talk/converse mode. |
energy_multiplier |
2.5 |
Noise gate threshold multiplier. Lower = more sensitive. |
half_duplex |
"auto" |
"auto" (detect speakers/headphones), "true", or "false". |
continuous_dictation |
false |
Auto-restart listening after each utterance (used by gnome-speaks). |
dictation_mode |
true |
Type transcribed text at cursor (vs clipboard only). |
terminal_mode |
false |
All lowercase, no auto-capitalization. |
end_word |
"over" |
Say this word to immediately stop recording. |
max_record_seconds |
120 |
Absolute maximum recording duration. |
debug |
false |
Write detailed logs to /tmp/speech-debug.log. |
⚠️ Never commit your API key. Use environment variables or the config file (which is in your home directory, outside the repo). See Security below.
The MCP server works with any AI CLI that supports the Model Context Protocol. When paired with a modern terminal AI like Gemini CLI, Claude Code, or Copilot CLI, it creates an incredibly seamless voice loop. The AI automatically invokes the listen tool when it needs input, processes your request, and calls the speak tool to respond—all while providing rich, real-time terminal UI feedback (like live subtitles and VU meters) without you ever needing to type a command.
Add to your ~/.copilot/mcp.json:
{
"mcpServers": {
"azure-speech": {
"command": "python3",
"args": ["/path/to/speech-to-cli/mcp_speech.py"]
}
}
}Option A — Use the included .mcp.json (auto-detected when working in the project directory).
Option B — Add globally via the CLI:
claude mcp add --transport stdio azure-speech -- python3 /path/to/speech-to-cli/mcp_speech.pyInstall as a Gemini extension (uses the included gemini-extension.json):
gemini extensions install /path/to/speech-to-cliOr add manually to your ~/.gemini/settings.json:
{
"mcpServers": {
"azure-speech": {
"command": "python3",
"args": ["/path/to/speech-to-cli/mcp_speech.py"]
}
}
}Restart your CLI — it will now have listen, speak, multi_speak, and converse tools available. Just say "listen" and your AI will record your voice, transcribe it, and respond. Ask it to "speak" and it'll read its response aloud. Use "converse" for a continuous voice chat loop. Use multi_speak for multi-agent conversations where each agent speaks with a different voice.
MCP Tools:
| Tool | Parameters | Description |
|---|---|---|
listen |
seconds (1-30), mode (streaming/vad/whisper/fixed) |
Records from mic, returns transcribed text |
speak |
text (required), quality (fast/hd) |
Speaks text aloud via Azure TTS |
talk |
text (required), quality (fast/hd) |
Speaks text then listens for a reply — full-duplex TTS+STT in one call |
converse |
seconds, mode |
Like listen, but signals conversational intent — Copilot will speak its reply then listen again |
multi_speak |
segments (array of {text, voice}), quality (fast/hd) |
Speak multiple text+voice segments in one call — TTS requests fire in parallel, audio plays back-to-back |
multi_speak_stream |
segments, quality |
Like multi_speak with streaming progress events |
configure |
key, value |
View or change runtime settings |
get_voices |
(none) | List available Azure TTS voices |
pause |
(none) | Pause current TTS playback |
resume |
(none) | Resume paused TTS playback |
STT Modes (auto-selected by default):
| Mode | Description |
|---|---|
streaming |
Real-time Azure WebSocket + energy-gated VAD (fastest, default) |
vad |
Record with VAD, upload on silence |
whisper |
Local transcription via faster-whisper (offline, no network) |
fixed |
Record for full duration, then upload (fallback) |
TTS Quality:
| Quality | Voice | Azure latency | Best for |
|---|---|---|---|
fast (default) |
AvaNeural | ~120ms | Conversation, quick responses |
hd |
DragonHD | ~1200ms | High-quality narration |
The half_duplex setting controls whether TTS and STT can overlap:
"auto"(default): Auto-detects speakers vs headphones. Headphones get full duplex; speakers get half duplex."true"(half duplex): TTS must finish before the mic opens. Prevents the mic from hearing speaker output. A 0.5s drain buffer is added after TTS ends."false"(full duplex): TTS and STT can run simultaneously. The recorder is prewarmed during TTS so listening starts immediately. Only works well with headphones.
The talk MCP tool combines TTS and STT in a single call — it speaks the provided text, then immediately listens for the user's reply:
talk(text="What would you like me to do?")
→ speaks the text, records user's reply, returns transcribed text
This is the same mechanism that gnome-speaks exposes over D-Bus as org.gnome.Speaks.Talk(text). When called via the MCP server, it uses speech_tts.talk_fullduplex() which handles the TTS→STT handoff, respecting the duplex setting. The talk_silence_timeout (default 4.0s) controls how long it waits for a reply.
Run in a separate terminal alongside Copilot CLI:
python3 voice_chat.pyFlow:
- Press Enter → records your voice → transcribes → copies to clipboard
- Paste into Copilot CLI with Ctrl+Shift+V
- Copy Copilot's response with Ctrl+Shift+C
- Press Enter → speaks the response aloud → starts recording again
# Speech-to-text: record and transcribe to clipboard
python3 speech.py
# Text-to-speech: speak text aloud
python3 tts.py "Hello world"
echo "Hello" | python3 tts.py
python3 tts.py # speaks clipboard contents- Recording: Uses
arecord(ALSA) to capture 16kHz mono audio from the default input device - Voice Activity Detection: Energy-gated VAD auto-calibrates to ambient noise, stops recording on silence (~400ms after speech ends)
- Speech-to-Text: Streams audio to Azure via persistent WebSocket for real-time recognition, with local Whisper fallback for offline use
- Text-to-Speech: Sends SSML to Azure TTS REST API with HTTP connection pooling; streams MP3 audio through mpv for immediate playback
- Ready chime: A short ascending tone plays before each recording so you know when to speak
- Performance: Connections are pre-warmed on startup; noise floor is cached between calls; response latency is ~275ms from end of speech to first audio byte
- MCP Protocol: Implements MCP (v2024-11-05) over stdio JSON-RPC for direct Copilot CLI integration
No Azure SDK required — just plain REST/WebSocket API calls.
Other projects import the core modules from this repository directly rather than going through the MCP server:
- gnome-speaks — a GNOME Shell extension that adds voice control to the desktop. It imports
state.py,audio.py,stt.py, andspeech_tts.pyviasys.path. - the-oracle — proxies this MCP server through a FastMCP gateway so multiple clients can share a single speech backend.
The env var SPEECH_ENGINE_PATH can be set to the path of this directory so that downstream projects can locate and import the modules at runtime.
File roles:
| File | Role |
|---|---|
state.py |
Shared state, constants, config, helpers |
audio.py |
Audio I/O, device detection, chimes, UI, VAD |
stt.py |
Speech-to-text backends |
speech_tts.py |
Text-to-speech (Azure TTS, multi_speak) |
mcp_speech.py |
MCP protocol adapter (tool schemas, request routing, stdio) |
speech.py |
Standalone CLI — mic to clipboard |
tts.py |
Standalone CLI — text to speech |
voice_chat.py |
Standalone CLI — interactive voice chat loop |
If you are building on top of these modules, import the four core library files (state, audio, stt, speech_tts) and leave mcp_speech.py and the standalone CLIs out of your import graph.
This project handles audio data and API credentials. Please review:
- Never hardcode your Azure key in source code or commit it to git. The
.gitignoreincludes.envto help prevent this. - Store your key via environment variable (
AZURE_SPEECH_KEY) or in the user-level config file (~/.config/speech-to-cli/config.json). - Azure keys can be rotated at any time in the Azure Portal → your Speech resource → Keys and Endpoint.
- Consider using a restricted key with only Speech Services access (not a broad subscription key).
- Audio is recorded from your local microphone and sent to Azure Speech Services for processing. No audio is stored locally after transcription (temp files are deleted immediately).
- Azure processes your audio to produce transcriptions. Review the Azure AI Services data privacy policy to understand how Microsoft handles your audio data.
- Text sent to TTS is transmitted to Azure for synthesis. The same privacy policies apply.
- No data is sent anywhere other than Azure Speech Services — there are no analytics, telemetry, or third-party services.
- All Azure API calls use HTTPS (TLS encrypted in transit).
- Audio data and API keys are sent over encrypted connections only.
- The MCP server exposes tools (
listen,speak,multi_speak,converse,configure,get_voices,pause,resume). It cannot read files, execute commands, or access anything beyond the microphone and Azure API. - The server communicates with Copilot CLI over local stdio only — no network listeners are opened.
- Rotate your Azure key periodically.
- Use Azure's free tier to limit potential cost exposure from a leaked key.
- If running on a shared machine, be aware that other users with access to your environment variables or config file can read your API key.
Use of Azure Speech Services is subject to the Microsoft Azure terms of service and the Azure AI Services terms. You are responsible for your own Azure usage and billing.
Use of GitHub Copilot CLI requires an active Copilot subscription and is subject to the GitHub Copilot terms. Copilot is free for verified students, teachers, and maintainers of popular open-source projects.
This project is independently developed and is not affiliated with, endorsed by, or sponsored by Microsoft or GitHub. It is a third-party integration that connects to their respective APIs.
Licensed under the GNU General Public License v3.0 (GPLv3) — see LICENSE.
GNU General Public License v3.0 (GPLv3) — see LICENSE for details.