Skip to content

Add CUDA warmup inference on startup#49

Open
wpflueger wants to merge 1 commit intomainfrom
fix/cuda-warmup
Open

Add CUDA warmup inference on startup#49
wpflueger wants to merge 1 commit intomainfrom
fix/cuda-warmup

Conversation

@wpflueger
Copy link
Owner

Summary

  • Runs a 1-second silent audio transcription after model loading to pre-compile CUDA kernels
  • Reduces first-transcription latency (CUDA JIT compilation, memory allocation patterns)
  • Adds ~1-2s to startup time; failure is non-fatal (logged as warning)
  • Uses the model's own cfg.sample_rate to generate the correct length of silence

Test plan

  • Verify "Running CUDA warmup inference..." and "CUDA warmup complete" appear in backend logs during startup
  • Compare first-dictation latency with and without warmup
  • Verify startup still succeeds if warmup fails for any reason

Closes #31

🤖 Generated with Claude Code

Run a 1-second silent audio transcription after model loading to
pre-compile CUDA kernels and warm up memory allocations. This reduces
first-transcription latency at the cost of ~1-2s additional startup time.
Warmup failure is logged as a warning and does not block startup.

Closes #31

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 9, 2026 03:41
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a startup warmup inference step after loading the NVIDIA NeMo Parakeet model to reduce first-request latency caused by CUDA JIT/kernel initialization.

Changes:

  • Calls a new _warmup() method immediately after self._model.eval() in ModelLoader.load().
  • Implements _warmup() by running a 1-second silent transcription using self._model.cfg.sample_rate, logging success and treating failure as non-fatal.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +186 to +188
# Warmup inference to pre-compile CUDA kernels
self._warmup()

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warmup is invoked unconditionally after loading, but the loader allows device to be configured (e.g., PARAKEY_DEVICE=cpu). In that case this adds startup latency and logs misleading "CUDA warmup" messages without actually warming CUDA. Consider gating warmup on self._device == "cuda" (and CUDA availability) or renaming the log messaging to a device-agnostic warmup when not on CUDA.

Copilot uses AI. Check for mistakes.
warmup_audio = np.zeros(self._model.cfg.sample_rate, dtype="float32")
warmup_tensor = torch.from_numpy(warmup_audio).float()
with torch.no_grad():
self._model.transcribe(audio=[warmup_tensor])
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warmup path calls self._model.transcribe(audio=[warmup_tensor]) with only the torch-tensor variant. In transcribe() you already implement compatibility fallbacks for NeMo versions that don’t accept tensors (or require positional args). If warmup hits one of those cases it will always fail and the intended CUDA kernel warmup won’t happen. Consider reusing the same fallback strategy here so warmup remains effective across supported NeMo/PyTorch combinations.

Suggested change
self._model.transcribe(audio=[warmup_tensor])
try:
# Preferred path: newer NeMo versions accepting torch tensors
self._model.transcribe(audio=[warmup_tensor])
except TypeError:
# Fallbacks for NeMo versions with different transcribe signatures
try:
# Some versions expect positional arguments
self._model.transcribe([warmup_tensor])
except TypeError:
# Some versions may not accept tensors at all; try NumPy input
self._model.transcribe([warmup_audio])

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CUDA warmup inference on startup

2 participants