speech-to-cli

Live docs: jphein.github.io/speech-to-cli

Voice interface for AI coding assistants — talk to your CLI agent and hear it respond, powered by Azure Speech Services. Works with GitHub Copilot CLI, Claude Code, and Gemini CLI.

What it does

This project adds voice input and output to your terminal AI workflow via the Model Context Protocol (MCP):

Tool	Description
MCP Server (`mcp_speech.py`)	Integrates with Copilot CLI, Claude Code, and Gemini CLI via MCP — gives your AI `listen`, `speak`, and `converse` tools
Voice Chat (`voice_chat.py`)	Standalone voice chat companion — runs in a second terminal alongside Copilot CLI
Speech-to-Text (`speech.py`)	Simple mic → text → clipboard tool
Text-to-Speech (`tts.py`)	Simple text → speech tool (reads from args, stdin, or clipboard)

Features

Azure HD Voices: Uses high-quality DragonHD voices for natural-sounding speech.
Thinking Hum: A subtle 150Hz tone that loops while the AI is processing.
Visual Status: Colorful progress bars with live VU meters and real-time subtitles (🎤/🧠/🔊).
Audio Feedback: Configurable chimes for ready, processing, speak, and done states.
Low Latency: Streaming playback and persistent connections for fast responses.
VAD: Energy-gated voice activity detection that auto-calibrates to your environment.

About GitHub Copilot CLI

GitHub Copilot CLI brings agentic AI coding assistance directly to your terminal. It can edit files, run commands, search code, and interact with GitHub — all through natural language.

Available models

Copilot CLI gives you access to multiple frontier models via the /model command:

Model	Tier
Claude Sonnet 4.5	Standard (default)
Claude Sonnet 4	Standard
Claude Opus 4.5	Premium
Claude Opus 4.6	Premium
Claude Haiku 4.5	Fast
GPT-5.1 / 5.2 / 5.4	Standard
GPT-5.1-Codex / 5.2-Codex / 5.3-Codex	Standard
GPT-5.1-Codex-Max	Standard
GPT-5 mini	Fast
Gemini 3 Pro (Preview)	Standard

Free for students

GitHub Education members get Copilot Pro free for 1 year, which includes Copilot CLI access. Sign up at education.github.com with your school email — no credit card required. Each prompt uses one premium request from your monthly quota.

Install Copilot CLI

curl -fsSL https://gh.io/copilot-install | bash
copilot          # launch and authenticate

Quick start

Prerequisites

Linux with ALSA audio (arecord/aplay)
Python 3.8+
An Azure Speech Services API key (free tier: 5 hours STT + 500K characters TTS per month)

Azure Speech HD voices

This project defaults to Azure HD (DragonHD) voices — specifically en-US-Ava:DragonHDLatestNeural. These are Azure's highest-quality neural voices with natural intonation, breathing, and expressiveness that sounds remarkably human. You can change the voice via the AZURE_SPEECH_VOICE env var or config file.

Browse all available voices in the Azure Voice Gallery. Look for voices tagged HD or DragonHD for the best quality.

Azure for nonprofits and education

Nonprofits: Microsoft offers up to $3,500/year in free Azure credits through Azure for Nonprofits. This more than covers Speech Services usage for voice-enabled Copilot workflows.
Students: The Azure for Students program provides $100 in free credits with no credit card required — just verify with your school email.

Install

git clone https://github.com/techempower-org/speech-to-cli.git
cd speech-to-cli
./install.sh

Configure

Set your Azure credentials via environment variables:

export AZURE_SPEECH_KEY="your-key-here"
export AZURE_SPEECH_REGION="westus2"           # optional, default: westus2
export AZURE_SPEECH_VOICE="en-US-Ava:DragonHDLatestNeural"  # optional

Or create a JSON config file at ~/.config/speech-to-cli/config.json. You can create the directory if it doesn't exist:

mkdir -p ~/.config/speech-to-cli

Example config.json:

{
  "key": "your-azure-speech-key",
  "region": "westus2",
  "voice": "en-US-Ava:DragonHDLatestNeural",
  "fast_voice": "en-US-AvaNeural",
  "chime_hum": true,
  "visual_indicator": true
}

Configuration Settings

Key	Default	Description
`key`	None	Your Azure Speech Services API key.
`region`	`westus2`	The Azure region for your speech resource.
`voice`	`en-US-Ava:DragonHDLatestNeural`	Primary voice for high-quality (HD) synthesis.
`fast_voice`	`en-US-AvaNeural`	Low-latency voice for fast responses.
`chime_ready`	`true`	Play an ascending tone when the microphone opens.
`chime_processing`	`false`	Play a short "blip" when speech is recognized.
`chime_hum`	`false`	Start a looping 150Hz tone while thinking.
`chime_speak`	`false`	Play a descending tone before starting to speak.
`chime_done`	`false`	Play a double-tap tone when the AI is done talking.
`visual_indicator`	`true`	Show status icons (🎤/🧠) in the terminal.
`live_subtitles`	`true`	Show real-time partial transcription in progress bar.
`vu_meter`	`true`	Show live volume meter in progress bar.
`silence_timeout`	`3.0`	Seconds of silence after speech before auto-stop.
`loop_silence_timeout`	`1.2`	Silence timeout in continuous loop mode (shorter = faster turnaround).
`no_speech_timeout`	`7.0`	Max seconds to wait for any speech before giving up.
`talk_silence_timeout`	`4.0`	Silence timeout for talk/converse mode.
`energy_multiplier`	`2.5`	Noise gate threshold multiplier. Lower = more sensitive.
`half_duplex`	`"auto"`	`"auto"` (detect speakers/headphones), `"true"`, or `"false"`.
`continuous_dictation`	`false`	Auto-restart listening after each utterance (used by gnome-speaks).
`dictation_mode`	`true`	Type transcribed text at cursor (vs clipboard only).
`terminal_mode`	`false`	All lowercase, no auto-capitalization.
`end_word`	`"over"`	Say this word to immediately stop recording.
`max_record_seconds`	`120`	Absolute maximum recording duration.
`debug`	`false`	Write detailed logs to `/tmp/speech-debug.log`.

⚠️ Never commit your API key. Use environment variables or the config file (which is in your home directory, outside the repo). See Security below.

Usage

MCP Server (recommended)

The MCP server works with any AI CLI that supports the Model Context Protocol. When paired with a modern terminal AI like Gemini CLI, Claude Code, or Copilot CLI, it creates an incredibly seamless voice loop. The AI automatically invokes the listen tool when it needs input, processes your request, and calls the speak tool to respond—all while providing rich, real-time terminal UI feedback (like live subtitles and VU meters) without you ever needing to type a command.

GitHub Copilot CLI

Add to your ~/.copilot/mcp.json:

{
  "mcpServers": {
    "azure-speech": {
      "command": "python3",
      "args": ["/path/to/speech-to-cli/mcp_speech.py"]
    }
  }
}

Claude Code

Option A — Use the included .mcp.json (auto-detected when working in the project directory).

Option B — Add globally via the CLI:

claude mcp add --transport stdio azure-speech -- python3 /path/to/speech-to-cli/mcp_speech.py

Gemini CLI

Install as a Gemini extension (uses the included gemini-extension.json):

gemini extensions install /path/to/speech-to-cli

Or add manually to your ~/.gemini/settings.json:

{
  "mcpServers": {
    "azure-speech": {
      "command": "python3",
      "args": ["/path/to/speech-to-cli/mcp_speech.py"]
    }
  }
}

Restart your CLI — it will now have listen, speak, multi_speak, and converse tools available. Just say "listen" and your AI will record your voice, transcribe it, and respond. Ask it to "speak" and it'll read its response aloud. Use "converse" for a continuous voice chat loop. Use multi_speak for multi-agent conversations where each agent speaks with a different voice.

MCP Tools:

Tool	Parameters	Description
`listen`	`seconds` (1-30), `mode` (streaming/vad/whisper/fixed)	Records from mic, returns transcribed text
`speak`	`text` (required), `quality` (fast/hd)	Speaks text aloud via Azure TTS
`talk`	`text` (required), `quality` (fast/hd)	Speaks text then listens for a reply — full-duplex TTS+STT in one call
`converse`	`seconds`, `mode`	Like `listen`, but signals conversational intent — Copilot will speak its reply then listen again
`multi_speak`	`segments` (array of {text, voice}), `quality` (fast/hd)	Speak multiple text+voice segments in one call — TTS requests fire in parallel, audio plays back-to-back
`multi_speak_stream`	`segments`, `quality`	Like `multi_speak` with streaming progress events
`configure`	`key`, `value`	View or change runtime settings
`get_voices`	(none)	List available Azure TTS voices
`pause`	(none)	Pause current TTS playback
`resume`	(none)	Resume paused TTS playback

STT Modes (auto-selected by default):

Mode	Description
`streaming`	Real-time Azure WebSocket + energy-gated VAD (fastest, default)
`vad`	Record with VAD, upload on silence
`whisper`	Local transcription via faster-whisper (offline, no network)
`fixed`	Record for full duration, then upload (fallback)

TTS Quality:

Quality	Voice	Azure latency	Best for
`fast` (default)	AvaNeural	~120ms	Conversation, quick responses
`hd`	DragonHD	~1200ms	High-quality narration

Half-Duplex vs Full-Duplex

The half_duplex setting controls whether TTS and STT can overlap:

"auto" (default): Auto-detects speakers vs headphones. Headphones get full duplex; speakers get half duplex.
"true" (half duplex): TTS must finish before the mic opens. Prevents the mic from hearing speaker output. A 0.5s drain buffer is added after TTS ends.
"false" (full duplex): TTS and STT can run simultaneously. The recorder is prewarmed during TTS so listening starts immediately. Only works well with headphones.

Talk Tool

The talk MCP tool combines TTS and STT in a single call — it speaks the provided text, then immediately listens for the user's reply:

talk(text="What would you like me to do?")
→ speaks the text, records user's reply, returns transcribed text

This is the same mechanism that gnome-speaks exposes over D-Bus as org.gnome.Speaks.Talk(text). When called via the MCP server, it uses speech_tts.talk_fullduplex() which handles the TTS→STT handoff, respecting the duplex setting. The talk_silence_timeout (default 4.0s) controls how long it waits for a reply.

Voice Chat (standalone companion)

Run in a separate terminal alongside Copilot CLI:

python3 voice_chat.py

Flow:

Press Enter → records your voice → transcribes → copies to clipboard
Paste into Copilot CLI with Ctrl+Shift+V
Copy Copilot's response with Ctrl+Shift+C
Press Enter → speaks the response aloud → starts recording again

Standalone tools

# Speech-to-text: record and transcribe to clipboard
python3 speech.py

# Text-to-speech: speak text aloud
python3 tts.py "Hello world"
echo "Hello" | python3 tts.py
python3 tts.py  # speaks clipboard contents

How it works

Recording: Uses arecord (ALSA) to capture 16kHz mono audio from the default input device
Voice Activity Detection: Energy-gated VAD auto-calibrates to ambient noise, stops recording on silence (~400ms after speech ends)
Speech-to-Text: Streams audio to Azure via persistent WebSocket for real-time recognition, with local Whisper fallback for offline use
Text-to-Speech: Sends SSML to Azure TTS REST API with HTTP connection pooling; streams MP3 audio through mpv for immediate playback
Ready chime: A short ascending tone plays before each recording so you know when to speak
Performance: Connections are pre-warmed on startup; noise floor is cached between calls; response latency is ~275ms from end of speech to first audio byte
MCP Protocol: Implements MCP (v2024-11-05) over stdio JSON-RPC for direct Copilot CLI integration

No Azure SDK required — just plain REST/WebSocket API calls.

Used as a library

Other projects import the core modules from this repository directly rather than going through the MCP server:

gnome-speaks — a GNOME Shell extension that adds voice control to the desktop. It imports state.py, audio.py, stt.py, and speech_tts.py via sys.path.
the-oracle — proxies this MCP server through a FastMCP gateway so multiple clients can share a single speech backend.

The env var SPEECH_ENGINE_PATH can be set to the path of this directory so that downstream projects can locate and import the modules at runtime.

File roles:

File	Role
`state.py`	Shared state, constants, config, helpers
`audio.py`	Audio I/O, device detection, chimes, UI, VAD
`stt.py`	Speech-to-text backends
`speech_tts.py`	Text-to-speech (Azure TTS, multi_speak)
`mcp_speech.py`	MCP protocol adapter (tool schemas, request routing, stdio)
`speech.py`	Standalone CLI — mic to clipboard
`tts.py`	Standalone CLI — text to speech
`voice_chat.py`	Standalone CLI — interactive voice chat loop

If you are building on top of these modules, import the four core library files (state, audio, stt, speech_tts) and leave mcp_speech.py and the standalone CLIs out of your import graph.

Security

This project handles audio data and API credentials. Please review:

API key management

Never hardcode your Azure key in source code or commit it to git. The .gitignore includes .env to help prevent this.
Store your key via environment variable (AZURE_SPEECH_KEY) or in the user-level config file (~/.config/speech-to-cli/config.json).
Azure keys can be rotated at any time in the Azure Portal → your Speech resource → Keys and Endpoint.
Consider using a restricted key with only Speech Services access (not a broad subscription key).

Audio and data privacy

Audio is recorded from your local microphone and sent to Azure Speech Services for processing. No audio is stored locally after transcription (temp files are deleted immediately).
Azure processes your audio to produce transcriptions. Review the Azure AI Services data privacy policy to understand how Microsoft handles your audio data.
Text sent to TTS is transmitted to Azure for synthesis. The same privacy policies apply.
No data is sent anywhere other than Azure Speech Services — there are no analytics, telemetry, or third-party services.

Network security

All Azure API calls use HTTPS (TLS encrypted in transit).
Audio data and API keys are sent over encrypted connections only.

MCP server scope

The MCP server exposes tools (listen, speak, multi_speak, converse, configure, get_voices, pause, resume). It cannot read files, execute commands, or access anything beyond the microphone and Azure API.
The server communicates with Copilot CLI over local stdio only — no network listeners are opened.

Recommendations

Rotate your Azure key periodically.
Use Azure's free tier to limit potential cost exposure from a leaked key.
If running on a shared machine, be aware that other users with access to your environment variables or config file can read your API key.

Legal

Azure Speech Services

Use of Azure Speech Services is subject to the Microsoft Azure terms of service and the Azure AI Services terms. You are responsible for your own Azure usage and billing.

GitHub Copilot

Use of GitHub Copilot CLI requires an active Copilot subscription and is subject to the GitHub Copilot terms. Copilot is free for verified students, teachers, and maintainers of popular open-source projects.

This project

This project is independently developed and is not affiliated with, endorsed by, or sponsored by Microsoft or GitHub. It is a third-party integration that connects to their respective APIs.

Licensed under the GNU General Public License v3.0 (GPLv3) — see LICENSE.

License

GNU General Public License v3.0 (GPLv3) — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github/workflows		.github/workflows
docs		docs
.claudeignore		.claudeignore
.gitignore		.gitignore
.mcp.json		.mcp.json
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
audio.py		audio.py
bump-version.sh		bump-version.sh
gemini-extension.json		gemini-extension.json
install.sh		install.sh
mcp_speech.py		mcp_speech.py
requirements.txt		requirements.txt
speech.py		speech.py
speech_tts.py		speech_tts.py
state.py		state.py
stt.py		stt.py
tts.py		tts.py
voice_chat.py		voice_chat.py

Folders and files

Latest commit

History

Repository files navigation

speech-to-cli

What it does

Features

About GitHub Copilot CLI

Available models

Free for students

Install Copilot CLI

Quick start

Prerequisites

Azure Speech HD voices

Azure for nonprofits and education

Install

Configure

Configuration Settings

Usage

MCP Server (recommended)

GitHub Copilot CLI

Claude Code

Gemini CLI

Half-Duplex vs Full-Duplex

Talk Tool

Voice Chat (standalone companion)

Standalone tools

How it works

Used as a library

Security

API key management

Audio and data privacy

Network security

MCP server scope

Recommendations

Legal

Azure Speech Services

GitHub Copilot

This project

License

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages