Conversation
Run a 1-second silent audio transcription after model loading to pre-compile CUDA kernels and warm up memory allocations. This reduces first-transcription latency at the cost of ~1-2s additional startup time. Warmup failure is logged as a warning and does not block startup. Closes #31 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a startup warmup inference step after loading the NVIDIA NeMo Parakeet model to reduce first-request latency caused by CUDA JIT/kernel initialization.
Changes:
- Calls a new
_warmup()method immediately afterself._model.eval()inModelLoader.load(). - Implements
_warmup()by running a 1-second silent transcription usingself._model.cfg.sample_rate, logging success and treating failure as non-fatal.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Warmup inference to pre-compile CUDA kernels | ||
| self._warmup() | ||
|
|
There was a problem hiding this comment.
Warmup is invoked unconditionally after loading, but the loader allows device to be configured (e.g., PARAKEY_DEVICE=cpu). In that case this adds startup latency and logs misleading "CUDA warmup" messages without actually warming CUDA. Consider gating warmup on self._device == "cuda" (and CUDA availability) or renaming the log messaging to a device-agnostic warmup when not on CUDA.
| warmup_audio = np.zeros(self._model.cfg.sample_rate, dtype="float32") | ||
| warmup_tensor = torch.from_numpy(warmup_audio).float() | ||
| with torch.no_grad(): | ||
| self._model.transcribe(audio=[warmup_tensor]) |
There was a problem hiding this comment.
The warmup path calls self._model.transcribe(audio=[warmup_tensor]) with only the torch-tensor variant. In transcribe() you already implement compatibility fallbacks for NeMo versions that don’t accept tensors (or require positional args). If warmup hits one of those cases it will always fail and the intended CUDA kernel warmup won’t happen. Consider reusing the same fallback strategy here so warmup remains effective across supported NeMo/PyTorch combinations.
| self._model.transcribe(audio=[warmup_tensor]) | |
| try: | |
| # Preferred path: newer NeMo versions accepting torch tensors | |
| self._model.transcribe(audio=[warmup_tensor]) | |
| except TypeError: | |
| # Fallbacks for NeMo versions with different transcribe signatures | |
| try: | |
| # Some versions expect positional arguments | |
| self._model.transcribe([warmup_tensor]) | |
| except TypeError: | |
| # Some versions may not accept tensors at all; try NumPy input | |
| self._model.transcribe([warmup_audio]) |
Summary
cfg.sample_rateto generate the correct length of silenceTest plan
Closes #31
🤖 Generated with Claude Code