voice-input

Local voice input with screen-aware context. Push-to-talk on Mac, transcribed by Whisper, refined by a local LLM — all running on your own GPU.

No cloud services. Your voice and screen data never leave your network.

Features

Push-to-talk — Hold left Option key on Mac to record, release to transcribe
Real-time streaming — Audio streamed every 2s with partial transcription results during recording
Screen-aware context — Focused window screenshot captured at recording start, analyzed by a vision model (qwen3-vl) to extract active tab text content, and used to inform text refinement
LLM text refinement — Removes filler words, adds punctuation, formats lists as bullet points, and fixes recognition errors
Multi-language support — Language-specific prompts for Japanese, English, Chinese, and Korean (auto-detected or configurable)
Floating HUD — macOS cursor-following status overlay showing recording/transcribing/refining state. Background turns green when vision analysis completes during recording, so you know screen context will be used
Auto-paste + Enter — Result pasted via Cmd+V and Enter sent automatically. Hold Ctrl during recording to paste without Enter
Voice slash commands — Say "スラッシュヘルプ" or "slash help" to input /help directly. Commands auto-loaded from ~/.claude/skills/ at startup. Pasted without Enter so you can review before submitting

How it works

Mac (Push-to-Talk)              Server (GPU)
─────────────────               ─────────────
[Hold Option key]
  ├─ Capture screenshot ──────→ Vision analysis (qwen3-vl) ──┐
  ├─ Record audio                                          │
  ├─ Stream chunks (2s) ─────→ Whisper partial results     │
  │                              ↓ (shown in HUD)          │
[Release key]                                              │
  └─ Send final audio ───────→ Whisper final transcribe ───┤
                                                           │
                                LLM refine (context+text) ←┘
                                         │
  Auto-paste via Cmd+V  ←───── Result ───┘

The screenshot analysis runs in parallel with recording. When it finishes, the HUD turns green to signal that screen context is available. Partial transcription results are displayed during recording, giving immediate feedback.

Quick start

Server (Linux with NVIDIA GPU)

# Clone
git clone https://github.com/xuiltul/voice-input
cd voice-input

# Setup Python environment
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt

# Install Ollama (https://ollama.com)
ollama pull gpt-oss:20b    # Text refinement (or any model you prefer)
ollama pull qwen3-vl:8b-instruct  # Screenshot analysis (if running vision locally)

# Start WebSocket server (LD_LIBRARY_PATH required for pip-installed CUDA libs)
LD_LIBRARY_PATH=".venv/lib/$(python3 -c 'import sys;print(f"python{sys.version_info.major}.{sys.version_info.minor}")')/site-packages/nvidia/cublas/lib:.venv/lib/$(python3 -c 'import sys;print(f"python{sys.version_info.major}.{sys.version_info.minor}")')/site-packages/nvidia/cudnn/lib" \
  .venv/bin/python ws_server.py

CUDA note: When NVIDIA libraries (cublas, cudnn) are installed via pip into the venv, they are not on the default library search path. You must set LD_LIBRARY_PATH to include the venv's nvidia/*/lib directories, or Whisper will fail with Library libcublas.so.12 is not found. The path includes the Python version directory (e.g., python3.13), so use the shell snippet above to auto-detect it.

VRAM note: Whisper (~3 GB) stays loaded. gpt-oss (~12 GB) loads on demand. For best performance, run the vision model on a separate GPU server via VISION_SERVERS to avoid model swapping.

Docker

# Build
docker build -t voice-input .

# Run with GPU (vision on local Ollama)
docker run --gpus all -p 8991:8991 \
  -e LLM_MODEL=gpt-oss:20b \
  -e VISION_MODEL=qwen3-vl:8b-instruct \
  voice-input

# Run with remote vision server (avoids VRAM contention)
docker run --gpus all -p 8991:8991 \
  -e LLM_MODEL=gpt-oss:20b \
  -e VISION_MODEL=qwen3-vl:8b-instruct \
  -e VISION_SERVERS=http://vision-gpu:11434 \
  voice-input

# Persist Ollama models across restarts
docker run --gpus all -p 8991:8991 \
  -v ollama-data:/root/.ollama \
  voice-input

Mac client

pip3 install sounddevice numpy websockets pynput
scp your-server:~/voice-input/mac_client.py ~/
python3 mac_client.py --server ws://YOUR_SERVER_IP:8991

Grant these permissions in System Settings > Privacy & Security:

Microphone → Terminal
Accessibility → Terminal (for key monitoring + auto-paste)
Screen Recording → Terminal (for screenshot context)

Auto-start (Automator app)

macOS requires a .app bundle to grant privacy permissions (Accessibility, Input Monitoring, etc.). A raw python3 process launched by launchd cannot receive these permissions. The recommended approach is to wrap the client in an Automator application:

Copy the client script

mkdir -p ~/voice-input
cp mac_client.py ~/voice-input/

Create an Automator app
- Open Automator.app → choose Application
- Add a Run Shell Script action
- Set Shell to /bin/bash and paste:

cd ~/voice-input && /usr/bin/python3 mac_client.py --server ws://YOUR_SERVER_IP:8991

Save as ~/Applications/VoiceInput.app

Grant permissions in System Settings > Privacy & Security:
- Accessibility → VoiceInput.app
- Input Monitoring → VoiceInput.app
- Microphone → VoiceInput.app
- Screen Recording → VoiceInput.app
Add to Login Items for auto-start:
- System Settings > General > Login Items → add VoiceInput.app

Double-click VoiceInput.app to launch. The HUD appears at the bottom of the screen when the client is running.

Usage

Push-to-talk

Hold left Option/Alt — Recording starts, screenshot captured, streaming begins
During recording — Partial transcription shown in floating HUD at screen bottom
HUD turns green — Vision analysis of your screen is complete; screen context will be used for refinement
Release — Final audio transcribed with VAD → LLM refines with screen context → result pasted + Enter sent
Hold left Option/Alt + Ctrl — Same as above, but paste only (no Enter) — useful for text editors

HUD indicator colors

Color	Meaning
Dark (default)	Recording in progress, vision analysis not yet complete
Green	Vision analysis complete — screen context will be used for text refinement

The HUD automatically resets to dark at the start of each new recording.

Practical tips

Short recordings (under ~10s): Vision analysis may not finish in time. The HUD staying dark means your text will be refined without screen context. This is fine for simple dictation
Longer recordings: The HUD will turn green during recording, meaning the LLM will use your screen content to improve accuracy (e.g., recognizing technical terms visible on screen)
For best accuracy with technical terms: Wait until the HUD turns green before releasing the key. This gives the vision model time to read your screen and provide context to the LLM
Claude Code / chat apps: Use the default Alt mode — text is pasted and Enter is sent automatically, submitting your message instantly
Text editors / documents: Hold Alt + Ctrl — text is pasted without pressing Enter, so you can review before submitting
Dictation in any language: The system auto-detects the language from your speech. You can also set a language hint with --language en for better accuracy

Voice slash commands

Say "スラッシュ" (or "slash") followed by a command name to input a slash command instead of dictated text. The LLM matches your spoken words to the closest available command.

Examples:

「スラッシュヘルプ」→ /help
「スラッシュコミット」→ /commit
「スラッシュイシュートゥーピーアール 123」→ /issue-to-pr 123
「スラッシュピーディーエフ」→ /pdf
「slash compact」→ /compact

Commands are auto-loaded from ~/.claude/skills/*/SKILL.md at client startup, plus built-in Claude Code commands (/help, /clear, /compact, /cost, /doctor, /init, /fast). Slash commands are always pasted without Enter so you can review before submitting.

If no command matches, the system falls back to normal text refinement.

Client options

python3 mac_client.py [options]

  -s, --server URL      WebSocket server (default: ws://YOUR_SERVER_IP:8991)
  -l, --language CODE   Language hint for Whisper (default: ja)
  -m, --model NAME      Ollama model for refinement (default: gpt-oss:20b)
  --raw                 Skip LLM refinement, Whisper output only
  -p, --prompt TEXT     Custom refinement instructions
  --no-paste            Copy to clipboard only, don't auto-paste
  --no-screenshot       Disable screenshot context analysis

Server CLI

# WebSocket server (for Mac client)
voice-input serve ws --port 8991

# HTTP API server
voice-input serve --port 8990

# Transcribe a file directly
voice-input recording.mp3
voice-input recording.mp3 --raw
voice-input recording.mp3 --output json
voice-input recording.mp3 --language en --model qwen3:30b

HTTP API

# Transcribe + refine
curl -X POST http://localhost:8990/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @recording.wav

# Whisper only (skip LLM)
curl -X POST "http://localhost:8990/transcribe?raw=true" \
  -H "Content-Type: audio/wav" \
  --data-binary @recording.wav

# With language hint and custom prompt
curl -X POST "http://localhost:8990/transcribe?language=en&prompt=Format%20as%20meeting%20notes" \
  -H "Content-Type: audio/wav" \
  --data-binary @recording.wav

Response:

{
  "text": "Refined text here",
  "raw_text": "Raw Whisper output",
  "language": "ja",
  "duration": 5.2,
  "processing_time": {
    "transcribe": 0.3,
    "refine": 4.1,
    "total": 4.4
  }
}

Multi-language prompts

Refinement prompts are stored in the prompts/ directory as JSON files, one per language:

prompts/
├── ja.json    # Japanese (default)
├── en.json    # English
├── zh.json    # Chinese
└── ko.json    # Korean

The language is determined by:

Client config — --language flag on the client
Whisper auto-detection — If no language is specified, Whisper detects it and the matching prompt is used

To add a new language, create prompts/{lang_code}.json with this structure:

{
  "system_prompt": "Your system prompt here...",
  "user_template": "Please format the following text.\n```\n{text}\n```",
  "few_shot": [
    {
      "user": "Please format the following text.\n```\nraw input example\n```",
      "assistant": "Formatted output example"
    }
  ],
  "context_prefix": "Context information (from screenshot analysis):",
  "custom_prompt_prefix": "Additional instructions:"
}

If a language has no matching prompt file, it falls back to English, then Japanese.

Architecture

Component	Role	Tech
`voice_input.py`	Core pipeline: Whisper + Ollama LLM refinement + Vision	faster-whisper (CUDA), Ollama API
`ws_server.py`	WebSocket server, orchestrates streaming pipeline	Python, websockets
`mac_client.py`	Push-to-talk, screenshot, HUD overlay, clipboard paste	Python, pynput, sounddevice, PyObjC
`prompts/`	Language-specific refinement prompts	JSON
`transcribe.py`	Standalone Whisper CLI tool	faster-whisper

Models

Model	Purpose	VRAM	Lifecycle
`large-v3-turbo`	Whisper speech recognition	~3 GB	Loaded once at startup, stays in memory
`gpt-oss:20b`	Text refinement (configurable)	~12 GB	Managed by Ollama (load on demand), `think: "low"` for speed
`qwen3-vl:8b-instruct`	Active tab text extraction (focused window screenshot)	~5 GB	Runs on separate GPU server (no local VRAM usage)

WebSocket protocol

Streaming mode (recommended):

Client → Server: {"type": "stream_start", "screenshot": "<base64>"}
Client → Server: <binary WAV chunks> (every 2 seconds, cumulative audio)
Server → Client: {"type": "partial", "text": "..."}  (after each chunk)
Server → Client: {"type": "status", "stage": "vision_ready"}  (when screenshot analysis completes)
Client → Server: <final binary WAV> → {"type": "stream_end"}
Server → Client: {"type": "status", "stage": "refining"}
Server → Client: {"type": "result", "text": "...", "raw_text": "...", ...}

Legacy mode (single-shot):

Client → Server: {"type": "audio_with_screenshot", "screenshot": "<base64>"}
Client → Server: <binary WAV> (complete recording)
Server → Client: {"type": "result", ...}

Environment variables

Variable	Default	Description
`OLLAMA_URL`	`http://localhost:11434`	Ollama server URL for text refinement
`LLM_MODEL`	`gpt-oss:20b`	Model for text refinement
`VISION_MODEL`	`qwen3-vl:8b-instruct`	Model for active tab text extraction
`VISION_SERVERS`	(unset = local Ollama)	Comma-separated Ollama URLs for remote vision inference
`WHISPER_MODEL`	`large-v3-turbo`	Whisper model name
`DEFAULT_LANGUAGE`	`ja`	Default language for transcription
`VOICE_INPUT_SERVER`	`ws://localhost:8991`	Mac client: default WebSocket server URL

Example: separate vision server (recommended for single-GPU setups):

export VISION_MODEL=qwen3-vl:8b-instruct
export VISION_SERVERS=http://gpu-server-1:11434,http://gpu-server-2:11434
python ws_server.py

VRAM management

Whisper (~3 GB) is loaded once at server startup as a singleton
LLM (~12 GB for gpt-oss:20b) stays loaded in Ollama with think: "low" for fast inference (~0.5s)
Vision runs on local Ollama by default. Set VISION_SERVERS to offload to separate GPU(s) and avoid model swapping. Uses an active-tab-focused prompt to maximize text extraction from the focused window
Screenshots are captured at full resolution (no resize) for best OCR accuracy. Analysis runs in parallel with recording; if not ready when refinement starts, proceeds without context
VAD (Voice Activity Detection) is disabled for streaming chunks but enabled for final transcription

Troubleshooting

`Library libcublas.so.12 is not found or cannot be loaded`

This happens when NVIDIA CUDA libraries installed via pip (nvidia-cublas-cu12, nvidia-cudnn-cu12) are not on the library search path. Set LD_LIBRARY_PATH to include the venv's nvidia lib directories:

# Auto-detect Python version in venv
PYVER=$(.venv/bin/python -c 'import sys;print(f"python{sys.version_info.major}.{sys.version_info.minor}")')
export LD_LIBRARY_PATH=".venv/lib/$PYVER/site-packages/nvidia/cublas/lib:.venv/lib/$PYVER/site-packages/nvidia/cudnn/lib"
.venv/bin/python ws_server.py

This is required because pip installs CUDA libraries under .venv/lib/pythonX.Y/site-packages/nvidia/*/lib/ which is not a standard library search path. The exact directory name changes with the Python version (e.g., python3.13, python3.12).

Why?

Cloud voice input services send your audio and screen content to external servers. This tool does everything locally on your own hardware — your voice, screen data, and all processing stay on your network.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
prompts		prompts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
com.voice-input.client.plist		com.voice-input.client.plist
docker-entrypoint.sh		docker-entrypoint.sh
mac_client.py		mac_client.py
requirements.txt		requirements.txt
transcribe		transcribe
transcribe.py		transcribe.py
voice-input		voice-input
voice_input.py		voice_input.py
ws_server.py		ws_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

voice-input

Features

How it works

Quick start

Server (Linux with NVIDIA GPU)

Docker

Mac client

Auto-start (Automator app)

Usage

Push-to-talk

HUD indicator colors

Practical tips

Voice slash commands

Client options

Server CLI

HTTP API

Multi-language prompts

Architecture

Models

WebSocket protocol

Environment variables

VRAM management

Troubleshooting

`Library libcublas.so.12 is not found or cannot be loaded`

Why?

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

xuiltul/voice-input

Folders and files

Latest commit

History

Repository files navigation

voice-input

Features

How it works

Quick start

Server (Linux with NVIDIA GPU)

Docker

Mac client

Auto-start (Automator app)

Usage

Push-to-talk

HUD indicator colors

Practical tips

Voice slash commands

Client options

Server CLI

HTTP API

Multi-language prompts

Architecture

Models

WebSocket protocol

Environment variables

VRAM management

Troubleshooting

Library libcublas.so.12 is not found or cannot be loaded

Why?

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`Library libcublas.so.12 is not found or cannot be loaded`

Packages