-
Notifications
You must be signed in to change notification settings - Fork 1
FAQ
🌐 Language: English | Français
Design decisions, scope, and comparisons. For step-by-step fixes, see Troubleshooting instead.
- Why is the ASR engine in Rust?
- Why not use Whisper streaming?
- Why not Vosk for streaming?
- Why ONNX and not PyTorch?
- Why NeMo Parakeet over other models?
- Does dictee run 100% offline?
- Can I use multiple ASR backends at the same time?
- Does dictee work on Wayland?
- Does dictee work in a Flatpak / Snap?
- Is dictee multi-user safe?
- Does LLM correction send my text to the cloud?
- Does the ASR model phone home?
The ASR engine (Parakeet, Canary, Sortformer, Nemotron) runs in Rust via ONNX Runtime. Three reasons:
- Latency: Rust + ONNX avoids Python interpreter overhead. Warm latency for a 5-second utterance on CPU is ~0.8 s vs ~1.5 s for a pure-Python equivalent.
- Startup time: The Rust daemon cold-starts in < 500 ms (model load excluded). Python equivalents take 2–3 seconds just to import torch + transformers.
- Portability: A single binary with statically-linked ONNX Runtime is easier to package than a Python + venv + multiple wheels.
The UI (setup wizard, tray) remains in Python + PyQt6 because GUI iteration speed matters more there than latency.
Whisper is an encoder-decoder model. The encoder needs the full utterance (in 30-second chunks) before the decoder can emit tokens. Various projects approximate streaming by sliding a window and re-running the encoder every ~1 s — this wastes GPU and produces jitter.
dictee's approach: use Parakeet-TDT (transducer, emits tokens incrementally without re-running the encoder) as the default, and reserve Whisper for its strength — wide language coverage on batch transcription where latency doesn't matter.
Vosk does stream natively. It's shipped as a backend exactly for that reason. Trade-offs:
- Lower accuracy than Parakeet/Canary/Whisper
- No native punctuation
- Smaller language coverage
If you need truly live transcription with text appearing as you speak, Vosk is the only option. The default Parakeet-TDT flow transcribes after you stop speaking (toggle / push-to-talk style) which most users find more comfortable.
- Deployment simplicity: a compiled ONNX graph doesn't need Python + torch + CUDA toolkit installed
- Cross-platform: ONNX Runtime has execution providers for CPU, CUDA, TensorRT, CoreML, OpenVINO, DirectML, WebGPU
- Size: ONNX graphs are ~30% smaller than torch checkpoints
- Stability: no version pinning hell with torch API changes
The trade-off: harder to debug edge cases, no autograd for custom fine-tuning. For inference-only shipping, ONNX wins.
NVIDIA's NeMo Parakeet family combines four properties no other open model has all of:
- FastConformer encoder (subsampling + conformer blocks = low latency)
- TDT decoder (token-and-duration transducer, frame-skipping = faster inference)
- Native punctuation and capitalization (no post-processing needed)
- 25 European languages in a single 600M model
Alternatives like Parakeet-CTC (same encoder, CTC decoder) are faster but don't punctuate. Canary-1B has better quality but is 5 GB and GPU-only. Whisper has broader language support but higher latency and worse punctuation on short utterances.
By default, yes — with Parakeet, Canary, faster-whisper, or Vosk, all inference is local.
You opt into network only if you pick:
- Google Translate or Bing Translator (text sent to Google/Microsoft)
- Ollama or LibreTranslate on a remote host (text sent to your LAN)
Audio never leaves your machine in any configuration.
No. Systemd service unit files include Conflicts= directives — only one of dictee.service / dictee-canary.service / dictee-whisper.service / dictee-vosk.service can be active at a time.
Reason: each backend binds to the same Unix socket. Running two would conflict.
Switching is instant (no service restart latency), so you can alternate in the middle of a session.
Yes, fully supported. dictee uses dotool to simulate keyboard input, which works on Wayland (unlike xdotool). Notifications, plasmoid, and tray all work natively on Wayland compositors.
Tested: KDE Plasma 6 Wayland, GNOME 45+ Wayland, Sway, Hyprland.
Not yet. Packaging challenges:
-
dotoolneeds/dev/uinputaccess (blocked in most sandboxes) -
dictee-pttneeds/dev/input/event*access - Plasmoid installation requires
kpackagetool6with system permissions - CUDA libraries are hundreds of MB — not ideal for a sandboxed app
These are solvable (portals exist for input simulation, Flatpak has NVIDIA extensions) but haven't been tackled. Contributions welcome!
Yes, since v1.3.0-beta1. State files in /dev/shm are namespaced by UID:
/dev/shm/.dictee_state_1000 # user 1000
/dev/shm/.dictee_state_1001 # user 1001 — separate file, separate state
Two users on the same machine can dictate in parallel without interference.
No — LLM correction uses your local Ollama instance (http://localhost:11434 by default). The text never leaves your machine.
Exception: if you point DICTEE_OLLAMA_HOST at a remote Ollama server, text goes over your LAN. But that's an explicit user choice.
No. ONNX models are static files. They do not execute arbitrary code and have no network access. The model weights you download from HuggingFace are cryptographically signed by NVIDIA/NeMo.
dictee itself makes no outbound requests except:
-
Installer (
install.sh): fetches release artifacts fromapi.github.com -
Update check (optional, disabled by default): queries
api.github.com/repos/rcspam/dictee/releases - Cloud translation (Google / Bing): only if you pick those backends
Nothing else.
nerd-dictation is a Vosk-based dictation tool with an elegant CLI.
| Feature | dictee | nerd-dictation |
|---|---|---|
| ASR backends | 4 | 1 (Vosk) |
| Punctuation | Native (Parakeet/Canary) | Regex-based |
| Translation | 5 backends | None |
| GUI | Setup wizard + plasmoid + tray | None |
| LLM correction | Yes (Ollama) | No |
| Diarization | Yes (Sortformer) | No |
Pick nerd-dictation if: you want a minimal Python-only setup with no desktop integration. Pick dictee if: you want a full dictation suite with GUI and modern models.
Talon is a proprietary voice control suite with excellent scripting.
| Feature | dictee | Talon |
|---|---|---|
| Open source | ✅ GPL-3.0 | ❌ Closed source, free tier |
| ASR engines | 4 open | 1 closed (custom) |
| Voice scripting | Basic (regex voice commands) | Advanced (Python scripting) |
| Multi-language | 25+ (Parakeet) | English-first |
| Accessibility | Standard | Strong focus on accessibility |
Pick Talon if: you need voice control for navigation / mouse / complex scripting. Pick dictee if: you need open-source dictation in many languages.
Whisper.cpp is a C++ Whisper runtime. dictee uses faster-whisper (CTranslate2) which is another Whisper runtime.
Both produce comparable Whisper output. Whisper.cpp is faster on CPU for some configurations, faster-whisper is faster on GPU. dictee chose faster-whisper because:
- Better Python integration (our post-processing is Python)
- Word-level timestamps for continuation feature
-
distil-large-v3support out of the box
Anything that runs modern Linux with a microphone. Specifically:
- CPU: any x86_64 (or aarch64 with source build) from the last 10 years
- RAM: 4 GB for Vosk, 8 GB for Parakeet, 16 GB for Canary + LLM
- GPU: optional (CPU fallback works for all backends except Canary which is impractical on CPU)
- Disk: 5 GB for all default models, 15 GB if you want all ASR backends + translation models
Tested minimum: Raspberry Pi 4 (4 GB RAM) with Vosk, ~2 s latency per utterance.
Not currently. dictee uses the CUDA execution provider of ONNX Runtime. Experimental providers exist for ROCm (AMD) and OpenVINO (Intel) but:
- Not shipped in the pre-built packages (would require rebuilding ONNX Runtime from source per-distro)
- No automated testing on AMD/Intel hardware
- Community contributions welcome!
AMD/Intel GPU users should use the CPU variant — expect ~1 s warm latency for 5-second utterances.
Via Asahi Linux: CPU-only. ONNX Runtime has a CoreML provider but Asahi doesn't expose it to Linux userspace.
Native macOS: not supported. dictee relies heavily on:
- Linux
dotool/evdevfor input - KDE Plasma / GNOME for desktop integration
- systemd user services
A macOS port would require rewriting large parts. Not planned.
- Troubleshooting — when things go wrong
- Developer-Guide — how to contribute
- Changelog — what's new and planned
Getting started / Premiers pas
- Installation · 🇬🇧 · 🇫🇷
- Setup-Wizard · 🇬🇧 · 🇫🇷
- Configuration · 🇬🇧 · 🇫🇷
- Plasmoid-Widget · 🇬🇧 · 🇫🇷
- Tray-Icon · 🇬🇧 · 🇫🇷
- Keyboard-Shortcuts · 🇬🇧 · 🇫🇷
- Voice-Commands · 🇬🇧 · 🇫🇷
- GPU-Setup · 🇬🇧 · 🇫🇷
- Diarization · 🇬🇧 · 🇫🇷
- LLM-Diarization · 🇬🇧 · 🇫🇷
Speech recognition / ASR
Translation / Traduction
Post-processing / Post-traitement
- Overview · 🇬🇧 · 🇫🇷
- Rules-and-Dictionary · 🇬🇧 · 🇫🇷
- LLM-Correction · 🇬🇧 · 🇫🇷
- Numbers-Dates-Continuation · 🇬🇧 · 🇫🇷
CLI
Reference / Référence
🏠 Repo · 📦 Releases · 🐛 Issues