Hold
CapsLock, speak, release, and the text appears.
CapsWriter-Offline is a fully offline speech input tool built primarily for Windows.
- Initial support for Qwen3-ASR-1.7B
- Works for both microphone dictation and file transcription in this fork
- When the model does not return real timestamps, the app falls back to the existing approximate timeline path for subtitle and JSON generation
- Decoder-side Vulkan acceleration is enabled by default and typically needs about 1.6 GB of VRAM
- If your GPU drops memory clocks while idle, cold-start latency can rise to around 300 ms
- Locking memory clocks with
nvidia-smi -lmc 9000can reduce short-clip latency to around 100 ms on hardware such as RTX 5050
- Improved Fun-ASR-Nano-GGUF support with DirectML encoder acceleration and better FP16 defaults
- Server-side Fun-ASR-Nano now uses its own
hot-server.txthotword context file - Spoken punctuation like “comma”, “period”, and “new line” can be converted automatically
- Added decoder temperature handling to avoid edge-case repetition loops
- Improved server-side alphabet spelling merge behavior
- Added Fun-ASR-Nano-GGUF support
- Refactored large-file transcription to async streaming
- Improved Chinese/English spacing cleanup
- Improved server cleanup after abnormal disconnects
- Speech input: hold
CapsLockor mouse side buttonX2, speak, release, and insert text immediately - File transcription: drag audio/video files onto the client and generate
.srt,.txt, and.json - ITN formatting: convert spoken number patterns into clean written forms
- Server hotword context: store domain-specific terms in
hot-server.txtto help Fun-ASR-Nano recognize context - Hotword replacement: use
hot.txtfor phoneme-based fuzzy matching and forced replacement - Rule replacement: use
hot-rule.txtfor regex-based or direct text replacement - Rectify history: keep correction history in
hot-rectify.txtto help LLM polishing - LLM roles: route text to roles such as
assistantortranslatewhen the recognized text starts with that role name - Tray menu: manage hotwords, copy results, or clear LLM memory from the tray icon
- Client/server split: run the model on one machine and the lightweight client on another if needed
- Diary archive: save recognized sentences by date
- Audio archive: save recorded audio locally for privacy and traceability
The project is designed around four ideas: offline, fast, accurate, and highly configurable. The goal is a smooth voice-input workflow that still works without cloud access, installation-heavy deployment, or network connectivity.
LLM roles can use local models through Ollama or remote APIs through providers such as OpenAI-compatible services.
The project is mainly targeted at Windows 10/11 (64-bit).
- Linux: not officially tested or packaged
- macOS: currently unsupported because low-level keyboard-hook support is limited and system permissions are restrictive
- Install the VC++ runtime.
- Download the app from Latest Release.
- Download the model package from Models Release and extract it into the matching subfolder under
models/. - Launch
start_server.exe. - Launch
start_client.exe. - Hold
CapsLockor mouse side buttonX2and start speaking.
Select the speech model in config.toml through server.model_type:
qwen3_asr: built-in punctuation, acceptable CPU speed, very fast on discrete GPUs, excellent accuracyfun_asr_nano: built-in punctuation, fast on CPU, very fast on discrete GPUs, top-tier accuracysensevoice: built-in punctuation, extremely fast on CPU, strong multilingual supportparaformer: external punctuation model, extremely fast on CPU, high accuracy
All runtime settings live in the root config.toml.
- Edit
[[client.shortcuts]]to change keyboard or mouse triggers - Set
hold_mode = falsefor press-once / press-again recording - Toggle
llm_enabledto enable or disable LLM post-processing - Change
server.model_typeto switch ASR backends - Tune model-specific acceleration flags under
[models.*]
Q: Why does nothing happen when I press the key?
A: Make sure the start_client.exe console process is still running. If you want to type into an elevated application, run the client with administrator privileges too.
Q: Why is there no recognition output?
A: Check the recorded audio inside the dated year/month/assets folder. Make sure the microphone is actually recording and that Windows microphone permissions are enabled.
Q: Can I use GPU acceleration?
A: Fun-ASR-Nano and Qwen3-ASR can use GPU acceleration. If your integrated GPU performs worse than CPU, disable the model-specific dml_enable or vulkan_enable flags in config.toml.
Q: File transcription is too slow on low-end hardware. What can I do?
A: Try these in order:
- Use
sensevoiceorparaformerif you need the fastest CPU path. - Disable
dml_enableorvulkan_enablefor Qwen3-ASR / Fun-ASR-Nano if GPU acceleration hurts more than it helps. - Lower model thread counts where supported.
- Lock GPU memory clocks with
nvidia-smi -lmc 9000if you are optimizing for short clips on NVIDIA hardware.
Q: Fun-ASR-Nano quality is unstable on my integrated GPU. Why?
A: Some integrated GPUs have poor behavior with Vulkan FP16 accumulation in llama.cpp. If you see degraded output, disable vulkan_enable for that model and run the decoder on CPU.
Q: How do hotwords work?
A: hot-server.txt is used for server-side Fun-ASR-Nano context enhancement. hot.txt and hot-rule.txt are client-side replacement sources. hot-rectify.txt stores correction history used by the LLM pipeline.
Q: How do I use LLM roles?
A: Start your spoken command with the role name. For example, if you have a role named translate, saying translate, the weather is great today sends the recognized text through the translation role instead of direct output.
Q: How do I choose the LLM model behind a role?
A: Role defaults and overrides are defined through config.toml plus the LLM/ role entry modules. You can point roles to local Ollama models or remote API providers.
Q: Can an LLM role read selected text on screen?
A: Yes. If a role enables enable_read_selection, the app can capture the current selection with Ctrl+C and pass it into the LLM context before processing your voice command.
Q: How do I hide the console window?
A: Use the tray menu to hide it.
Q: How do I start it automatically on boot?
A: Run Win+R, open shell:startup, and place shortcuts for the client and server there.
This project builds on several excellent open-source projects:
Special thanks to modern AI coding assistants and to the users who supported the project.

