Add CUDA warmup inference on startup by wpflueger · Pull Request #49 · wpflueger/ParaKey

wpflueger · 2026-02-09T03:41:40Z

Summary

Runs a 1-second silent audio transcription after model loading to pre-compile CUDA kernels
Reduces first-transcription latency (CUDA JIT compilation, memory allocation patterns)
Adds ~1-2s to startup time; failure is non-fatal (logged as warning)
Uses the model's own cfg.sample_rate to generate the correct length of silence

Test plan

Verify "Running CUDA warmup inference..." and "CUDA warmup complete" appear in backend logs during startup
Compare first-dictation latency with and without warmup
Verify startup still succeeds if warmup fails for any reason

Closes #31

🤖 Generated with Claude Code

Run a 1-second silent audio transcription after model loading to pre-compile CUDA kernels and warm up memory allocations. This reduces first-transcription latency at the cost of ~1-2s additional startup time. Warmup failure is logged as a warning and does not block startup. Closes #31 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a startup warmup inference step after loading the NVIDIA NeMo Parakeet model to reduce first-request latency caused by CUDA JIT/kernel initialization.

Changes:

Calls a new _warmup() method immediately after self._model.eval() in ModelLoader.load().
Implements _warmup() by running a 1-second silent transcription using self._model.cfg.sample_rate, logging success and treating failure as non-fatal.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-09T03:44:25Z

backend/src/parakey_backend/model.py

+            # Warmup inference to pre-compile CUDA kernels
+            self._warmup()
+


Warmup is invoked unconditionally after loading, but the loader allows device to be configured (e.g., PARAKEY_DEVICE=cpu). In that case this adds startup latency and logs misleading "CUDA warmup" messages without actually warming CUDA. Consider gating warmup on self._device == "cuda" (and CUDA availability) or renaming the log messaging to a device-agnostic warmup when not on CUDA.

Copilot · 2026-02-09T03:44:26Z

backend/src/parakey_backend/model.py

+            warmup_audio = np.zeros(self._model.cfg.sample_rate, dtype="float32")
+            warmup_tensor = torch.from_numpy(warmup_audio).float()
+            with torch.no_grad():
+                self._model.transcribe(audio=[warmup_tensor])


The warmup path calls self._model.transcribe(audio=[warmup_tensor]) with only the torch-tensor variant. In transcribe() you already implement compatibility fallbacks for NeMo versions that don’t accept tensors (or require positional args). If warmup hits one of those cases it will always fail and the intended CUDA kernel warmup won’t happen. Consider reusing the same fallback strategy here so warmup remains effective across supported NeMo/PyTorch combinations.

Suggested change

self._model.transcribe(audio=[warmup_tensor])

try:

# Preferred path: newer NeMo versions accepting torch tensors

self._model.transcribe(audio=[warmup_tensor])

except TypeError:

# Fallbacks for NeMo versions with different transcribe signatures

try:

# Some versions expect positional arguments

self._model.transcribe([warmup_tensor])

except TypeError:

# Some versions may not accept tensors at all; try NumPy input

self._model.transcribe([warmup_audio])

Copilot AI review requested due to automatic review settings February 9, 2026 03:41

Copilot started reviewing on behalf of wpflueger February 9, 2026 03:42 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA warmup inference on startup#49

Add CUDA warmup inference on startup#49
wpflueger wants to merge 1 commit intomainfrom
fix/cuda-warmup

wpflueger commented Feb 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Warmup inference to pre-compile CUDA kernels
		self._warmup()

-                self._model.transcribe(audio=[warmup_tensor])
+                try:
+                    # Preferred path: newer NeMo versions accepting torch tensors
+                    self._model.transcribe(audio=[warmup_tensor])
+                except TypeError:
+                    # Fallbacks for NeMo versions with different transcribe signatures
+                    try:
+                        # Some versions expect positional arguments
+                        self._model.transcribe([warmup_tensor])
+                    except TypeError:
+                        # Some versions may not accept tensors at all; try NumPy input
+                        self._model.transcribe([warmup_audio])

Conversation

wpflueger commented Feb 9, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants