Skip to content

starfield17/CapsWriter

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

320 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CapsWriter-Offline (v2.5)

demo

Hold CapsLock, speak, release, and the text appears.

CapsWriter-Offline is a fully offline speech input tool built primarily for Windows.

🚀 What’s New

v2.5-alpha

  • Initial support for Qwen3-ASR-1.7B
    • Works for both microphone dictation and file transcription in this fork
    • When the model does not return real timestamps, the app falls back to the existing approximate timeline path for subtitle and JSON generation
    • Decoder-side Vulkan acceleration is enabled by default and typically needs about 1.6 GB of VRAM
    • If your GPU drops memory clocks while idle, cold-start latency can rise to around 300 ms
    • Locking memory clocks with nvidia-smi -lmc 9000 can reduce short-clip latency to around 100 ms on hardware such as RTX 5050

v2.4

  • Improved Fun-ASR-Nano-GGUF support with DirectML encoder acceleration and better FP16 defaults
  • Server-side Fun-ASR-Nano now uses its own hot-server.txt hotword context file
  • Spoken punctuation like “comma”, “period”, and “new line” can be converted automatically
  • Added decoder temperature handling to avoid edge-case repetition loops
  • Improved server-side alphabet spelling merge behavior

v2.3

  • Added Fun-ASR-Nano-GGUF support
  • Refactored large-file transcription to async streaming
  • Improved Chinese/English spacing cleanup
  • Improved server cleanup after abnormal disconnects

✨ Core Features

  • Speech input: hold CapsLock or mouse side button X2, speak, release, and insert text immediately
  • File transcription: drag audio/video files onto the client and generate .srt, .txt, and .json
  • ITN formatting: convert spoken number patterns into clean written forms
  • Server hotword context: store domain-specific terms in hot-server.txt to help Fun-ASR-Nano recognize context
  • Hotword replacement: use hot.txt for phoneme-based fuzzy matching and forced replacement
  • Rule replacement: use hot-rule.txt for regex-based or direct text replacement
  • Rectify history: keep correction history in hot-rectify.txt to help LLM polishing
  • LLM roles: route text to roles such as assistant or translate when the recognized text starts with that role name
  • Tray menu: manage hotwords, copy results, or clear LLM memory from the tray icon
  • Client/server split: run the model on one machine and the lightweight client on another if needed
  • Diary archive: save recognized sentences by date
  • Audio archive: save recorded audio locally for privacy and traceability

The project is designed around four ideas: offline, fast, accurate, and highly configurable. The goal is a smooth voice-input workflow that still works without cloud access, installation-heavy deployment, or network connectivity.

LLM roles can use local models through Ollama or remote APIs through providers such as OpenAI-compatible services.

💻 Platform Support

The project is mainly targeted at Windows 10/11 (64-bit).

  • Linux: not officially tested or packaged
  • macOS: currently unsupported because low-level keyboard-hook support is limited and system permissions are restrictive

🎬 Quick Start

  1. Install the VC++ runtime.
  2. Download the app from Latest Release.
  3. Download the model package from Models Release and extract it into the matching subfolder under models/.
  4. Launch start_server.exe.
  5. Launch start_client.exe.
  6. Hold CapsLock or mouse side button X2 and start speaking.

🎤 Model Options

Select the speech model in config.toml through server.model_type:

  • qwen3_asr: built-in punctuation, acceptable CPU speed, very fast on discrete GPUs, excellent accuracy
  • fun_asr_nano: built-in punctuation, fast on CPU, very fast on discrete GPUs, top-tier accuracy
  • sensevoice: built-in punctuation, extremely fast on CPU, strong multilingual support
  • paraformer: external punctuation model, extremely fast on CPU, high accuracy

⚙️ Configuration

All runtime settings live in the root config.toml.

  • Edit [[client.shortcuts]] to change keyboard or mouse triggers
  • Set hold_mode = false for press-once / press-again recording
  • Toggle llm_enabled to enable or disable LLM post-processing
  • Change server.model_type to switch ASR backends
  • Tune model-specific acceleration flags under [models.*]

🛠️ FAQ

Q: Why does nothing happen when I press the key?
A: Make sure the start_client.exe console process is still running. If you want to type into an elevated application, run the client with administrator privileges too.

Q: Why is there no recognition output?
A: Check the recorded audio inside the dated year/month/assets folder. Make sure the microphone is actually recording and that Windows microphone permissions are enabled.

Q: Can I use GPU acceleration?
A: Fun-ASR-Nano and Qwen3-ASR can use GPU acceleration. If your integrated GPU performs worse than CPU, disable the model-specific dml_enable or vulkan_enable flags in config.toml.

Q: File transcription is too slow on low-end hardware. What can I do?
A: Try these in order:

  1. Use sensevoice or paraformer if you need the fastest CPU path.
  2. Disable dml_enable or vulkan_enable for Qwen3-ASR / Fun-ASR-Nano if GPU acceleration hurts more than it helps.
  3. Lower model thread counts where supported.
  4. Lock GPU memory clocks with nvidia-smi -lmc 9000 if you are optimizing for short clips on NVIDIA hardware.

Q: Fun-ASR-Nano quality is unstable on my integrated GPU. Why?
A: Some integrated GPUs have poor behavior with Vulkan FP16 accumulation in llama.cpp. If you see degraded output, disable vulkan_enable for that model and run the decoder on CPU.

Q: How do hotwords work?
A: hot-server.txt is used for server-side Fun-ASR-Nano context enhancement. hot.txt and hot-rule.txt are client-side replacement sources. hot-rectify.txt stores correction history used by the LLM pipeline.

Q: How do I use LLM roles?
A: Start your spoken command with the role name. For example, if you have a role named translate, saying translate, the weather is great today sends the recognized text through the translation role instead of direct output.

Q: How do I choose the LLM model behind a role?
A: Role defaults and overrides are defined through config.toml plus the LLM/ role entry modules. You can point roles to local Ollama models or remote API providers.

Q: Can an LLM role read selected text on screen?
A: Yes. If a role enables enable_read_selection, the app can capture the current selection with Ctrl+C and pass it into the LLM context before processing your voice command.

Q: How do I hide the console window?
A: Use the tray menu to hide it.

Q: How do I start it automatically on boot?
A: Run Win+R, open shell:startup, and place shortcuts for the client and server there.

❤️ Credits

This project builds on several excellent open-source projects:

Special thanks to modern AI coding assistants and to the users who supported the project.

sponsor

About

A useful PC voice input tool that supports hotwords and LLM processing. Press CapsLock or the mouse side button twice, speak, and release to automatically input text.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.0%
  • Jupyter Notebook 2.0%