Skip to content
rcspam edited this page Apr 23, 2026 · 5 revisions

🌐 Language: English | Français

Frequently Asked Questions

Design decisions, scope, and comparisons. For step-by-step fixes, see Troubleshooting instead.

Table of Contents

Design

Features

Multi-user / security

Comparison

Hardware


Design

Why Rust?

The ASR engine (Parakeet, Canary, Sortformer, Nemotron) runs in Rust via ONNX Runtime. Three reasons:

  1. Latency: Rust + ONNX avoids Python interpreter overhead. Warm latency for a 5-second utterance on CPU is ~0.8 s vs ~1.5 s for a pure-Python equivalent.
  2. Startup time: The Rust daemon cold-starts in < 500 ms (model load excluded). Python equivalents take 2–3 seconds just to import torch + transformers.
  3. Portability: A single binary with statically-linked ONNX Runtime is easier to package than a Python + venv + multiple wheels.

The UI (setup wizard, tray) remains in Python + PyQt6 because GUI iteration speed matters more there than latency.

Why not Whisper streaming?

Whisper is an encoder-decoder model. The encoder needs the full utterance (in 30-second chunks) before the decoder can emit tokens. Various projects approximate streaming by sliding a window and re-running the encoder every ~1 s — this wastes GPU and produces jitter.

dictee's approach: use Parakeet-TDT (transducer, emits tokens incrementally without re-running the encoder) as the default, and reserve Whisper for its strength — wide language coverage on batch transcription where latency doesn't matter.

Why not Vosk for streaming?

Vosk does stream natively. It's shipped as a backend exactly for that reason. Trade-offs:

  • Lower accuracy than Parakeet/Canary/Whisper
  • No native punctuation
  • Smaller language coverage

If you need truly live transcription with text appearing as you speak, Vosk is the only option. The default Parakeet-TDT flow transcribes after you stop speaking (toggle / push-to-talk style) which most users find more comfortable.

Why ONNX, not PyTorch?

  • Deployment simplicity: a compiled ONNX graph doesn't need Python + torch + CUDA toolkit installed
  • Cross-platform: ONNX Runtime has execution providers for CPU, CUDA, TensorRT, CoreML, OpenVINO, DirectML, WebGPU
  • Size: ONNX graphs are ~30% smaller than torch checkpoints
  • Stability: no version pinning hell with torch API changes

The trade-off: harder to debug edge cases, no autograd for custom fine-tuning. For inference-only shipping, ONNX wins.

Why NeMo Parakeet?

NVIDIA's NeMo Parakeet family combines four properties no other open model has all of:

  1. FastConformer encoder (subsampling + conformer blocks = low latency)
  2. TDT decoder (token-and-duration transducer, frame-skipping = faster inference)
  3. Native punctuation and capitalization (no post-processing needed)
  4. 25 European languages in a single 600M model

Alternatives like Parakeet-CTC (same encoder, CTC decoder) are faster but don't punctuate. Canary-1B has better quality but is 5 GB and GPU-only. Whisper has broader language support but higher latency and worse punctuation on short utterances.


Features

Is dictee fully offline?

By default, yes — with Parakeet, Canary, faster-whisper, or Vosk, all inference is local.

You opt into network only if you pick:

  • Google Translate or Bing Translator (text sent to Google/Microsoft)
  • Ollama or LibreTranslate on a remote host (text sent to your LAN)

Audio never leaves your machine in any configuration.

Can I run multiple ASR backends at the same time?

No. Systemd service unit files include Conflicts= directives — only one of dictee.service / dictee-canary.service / dictee-whisper.service / dictee-vosk.service can be active at a time.

Reason: each backend binds to the same Unix socket. Running two would conflict.

Switching is instant (no service restart latency), so you can alternate in the middle of a session.

Wayland?

Yes, fully supported. dictee uses dotool to simulate keyboard input, which works on Wayland (unlike xdotool). Notifications, plasmoid, and tray all work natively on Wayland compositors.

Tested: KDE Plasma 6 Wayland, GNOME 45+ Wayland, Sway, Hyprland.

Flatpak / Snap?

Not yet. Packaging challenges:

  • dotool needs /dev/uinput access (blocked in most sandboxes)
  • dictee-ptt needs /dev/input/event* access
  • Plasmoid installation requires kpackagetool6 with system permissions
  • CUDA libraries are hundreds of MB — not ideal for a sandboxed app

These are solvable (portals exist for input simulation, Flatpak has NVIDIA extensions) but haven't been tackled. Contributions welcome!


Multi-user / security

Multi-user safe?

Yes, since v1.3.0-beta1. State files in /dev/shm are namespaced by UID:

/dev/shm/.dictee_state_1000     # user 1000
/dev/shm/.dictee_state_1001     # user 1001 — separate file, separate state

Two users on the same machine can dictate in parallel without interference.

Does LLM correction send text to the cloud?

No — LLM correction uses your local Ollama instance (http://localhost:11434 by default). The text never leaves your machine.

Exception: if you point DICTEE_OLLAMA_HOST at a remote Ollama server, text goes over your LAN. But that's an explicit user choice.

Does the ASR model phone home?

No. ONNX models are static files. They do not execute arbitrary code and have no network access. The model weights you download from HuggingFace are cryptographically signed by NVIDIA/NeMo.

dictee itself makes no outbound requests except:

  • Installer (install.sh): fetches release artifacts from api.github.com
  • Update check (optional, disabled by default): queries api.github.com/repos/rcspam/dictee/releases
  • Cloud translation (Google / Bing): only if you pick those backends

Nothing else.


Comparison

vs nerd-dictation

nerd-dictation is a Vosk-based dictation tool with an elegant CLI.

Feature dictee nerd-dictation
ASR backends 4 1 (Vosk)
Punctuation Native (Parakeet/Canary) Regex-based
Translation 5 backends None
GUI Setup wizard + plasmoid + tray None
LLM correction Yes (Ollama) No
Diarization Yes (Sortformer) No

Pick nerd-dictation if: you want a minimal Python-only setup with no desktop integration. Pick dictee if: you want a full dictation suite with GUI and modern models.

vs Talon

Talon is a proprietary voice control suite with excellent scripting.

Feature dictee Talon
Open source ✅ GPL-3.0 ❌ Closed source, free tier
ASR engines 4 open 1 closed (custom)
Voice scripting Basic (regex voice commands) Advanced (Python scripting)
Multi-language 25+ (Parakeet) English-first
Accessibility Standard Strong focus on accessibility

Pick Talon if: you need voice control for navigation / mouse / complex scripting. Pick dictee if: you need open-source dictation in many languages.

vs Whisper.cpp

Whisper.cpp is a C++ Whisper runtime. dictee uses faster-whisper (CTranslate2) which is another Whisper runtime.

Both produce comparable Whisper output. Whisper.cpp is faster on CPU for some configurations, faster-whisper is faster on GPU. dictee chose faster-whisper because:

  • Better Python integration (our post-processing is Python)
  • Word-level timestamps for continuation feature
  • distil-large-v3 support out of the box

Hardware

Minimum hardware?

Anything that runs modern Linux with a microphone. Specifically:

  • CPU: any x86_64 (or aarch64 with source build) from the last 10 years
  • RAM: 4 GB for Vosk, 8 GB for Parakeet, 16 GB for Canary + LLM
  • GPU: optional (CPU fallback works for all backends except Canary which is impractical on CPU)
  • Disk: 5 GB for all default models, 15 GB if you want all ASR backends + translation models

Tested minimum: Raspberry Pi 4 (4 GB RAM) with Vosk, ~2 s latency per utterance.

AMD / Intel GPUs?

Not currently. dictee uses the CUDA execution provider of ONNX Runtime. Experimental providers exist for ROCm (AMD) and OpenVINO (Intel) but:

  • Not shipped in the pre-built packages (would require rebuilding ONNX Runtime from source per-distro)
  • No automated testing on AMD/Intel hardware
  • Community contributions welcome!

AMD/Intel GPU users should use the CPU variant — expect ~1 s warm latency for 5-second utterances.

Apple Silicon?

Via Asahi Linux: CPU-only. ONNX Runtime has a CoreML provider but Asahi doesn't expose it to Linux userspace.

Native macOS: not supported. dictee relies heavily on:

  • Linux dotool / evdev for input
  • KDE Plasma / GNOME for desktop integration
  • systemd user services

A macOS port would require rewriting large parts. Not planned.


Next steps

📖 dictee Wiki

🇬🇧 Home · 🇫🇷 Accueil


Getting started / Premiers pas

Speech recognition / ASR

Translation / Traduction

Post-processing / Post-traitement

CLI

Reference / Référence


🏠 Repo · 📦 Releases · 🐛 Issues

Clone this wiki locally