Skip to content

feat(api): add POST /dev/unload to release model from GPU VRAM#474

Merged
remsky merged 3 commits into
remsky:masterfrom
sageframe-no-kaji:feature/model-unload
Jun 3, 2026
Merged

feat(api): add POST /dev/unload to release model from GPU VRAM#474
remsky merged 3 commits into
remsky:masterfrom
sageframe-no-kaji:feature/model-unload

Conversation

@sageframe-no-kaji
Copy link
Copy Markdown
Contributor

Summary

  • Adds POST /dev/unload endpoint that releases the Kokoro model from GPU VRAM without stopping the container
  • Model reloads lazily on the next inference request (zero cold-start penalty vs. container restart)
  • Lock-protected unload prevents races with concurrent requests during reload

Motivation

Homelab deployments share GPU memory across services (Ollama, Frigate, etc.). Without this endpoint the container holds model weights in VRAM indefinitely. Current workaround is stopping/restarting the container, which incurs a 25–30s cold start on every session.

Changes

api/src/inference/model_manager.py

  • unload(): acquires _lock, calls backend.unload(), nulls _backend, then torch.cuda.empty_cache() if CUDA available
  • generate(): lazy reinit when _backend is None (calls initialize() + load_model()) instead of raising RuntimeError

api/src/routers/development.py

  • POST /dev/unload: returns {"status": "unloaded"} on success, 503 if manager not initialised, 500 on unexpected error

api/src/services/tts_service.py

  • model_manager annotated as Optional[ModelManager] for correct mypy type narrowing at the endpoint

api/tests/test_model_unload.py (new)

  • 11 tests covering all branches: unload() with/without backend, CUDA available/unavailable, lazy reinit in generate(), and all four HTTP response paths

Test plan

  • uv run --extra test pytest api/tests/test_model_unload.py -v — 11/11 pass
  • uv run --with ruff ruff check api/src/inference/model_manager.py api/src/routers/development.py api/src/services/tts_service.py api/tests/test_model_unload.py — clean
  • Container smoke test: curl -X POST http://localhost:8880/dev/unload{"status":"unloaded"}
  • nvidia-smi shows VRAM released after unload
  • Next /dev/captioned_speech request succeeds (model reloads)

Closes #473

🤖 Generated with Claude Code

Allows homelab deployments to free GPU memory when the TTS service is idle
without stopping the container. The model reloads lazily on the next request.

- ModelManager.unload(): acquires lock, calls backend.unload(), nulls
  _backend, then calls torch.cuda.empty_cache() if CUDA is available
- ModelManager.generate(): lazy reinit when _backend is None (calls
  initialize() + load_model()) instead of raising RuntimeError
- POST /dev/unload: 200 on success, 503 if manager not initialised,
  500 on unexpected error
- TTSService.model_manager annotated as Optional[ModelManager] for
  correct mypy narrowing at the endpoint
- Full test coverage in api/tests/test_model_unload.py (11 tests)

Closes remsky#473 (partial)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@remsky
Copy link
Copy Markdown
Owner

remsky commented May 31, 2026

This is great, will spin it up and take a look

@sageframe-no-kaji
Copy link
Copy Markdown
Contributor Author

great! let me know if you need anything

Add ensure_backend() to ModelManager which reinitialises the backend and
reloads the model if /dev/unload was called. All three get_backend() call
sites in tts_service now await ensure_backend() first, so the first TTS
request after an unload reloads the model automatically rather than raising
RuntimeError: Backend not initialized.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sageframe-no-kaji
Copy link
Copy Markdown
Contributor Author

Bug fix added (commit 750fabc): The original commit described lazy reload on the next request, but generate_audio_stream, _process_chunk, and generate_from_phonemes in tts_service.py all call model_manager.get_backend() directly — which raises RuntimeError: Backend not initialized rather than triggering the reload logic in model_manager.generate(). Tested this in production: calling /dev/unload then making a TTS request returned a 500 until the container was manually restarted.

The fix adds ensure_backend() to ModelManager — a single async method that reinitialises and reloads the model if the backend is None — and replaces the three bare get_backend() calls in tts_service.py with await ensure_backend() before them. Lazy reload now works as described.

@remsky
Copy link
Copy Markdown
Owner

remsky commented Jun 2, 2026

@sageframe-no-kaji

the lazy-reload fix looks good, and the suite passes clean here, 11/11.

before merging though, i think asyncio.Lock only wraps unload(), so the reload path doesn't serialize. the reload block is inlined in generate() (and duplicated in ensure_backend()), and neither holds the lock, so if a burst of requests lands while _backend is None, it could pile up (tested a 5-way concurrent reload, went through with 5 model loads etc)

can add a lock on the reload path, and route generate() through ensure_backend() to get a single source of truth, something like the below?

    async def ensure_backend(self) -> None:
        """Reload the backend if it was unloaded via /dev/unload."""
        if self._backend:
            return
        async with self._lock:
            # Re-checks after acquiring
            if not self._backend:
                await self.initialize()
                await self.load_model(self._config.pytorch_kokoro_v1_file)
    async def generate(self, *args, **kwargs):
        """Generate audio using initialized backend.
        Raises:
            RuntimeError: If generation fails
        """
        await self.ensure_backend()
        assert self._backend is not None  # ensure_backend loaded it or raised

        try:
            async for chunk in self._backend.generate(*args, **kwargs):
            ...

if you can add a commit for this, should be good to merge in, or I can patch it on top.

Addresses review feedback: the original lazy-reload in generate() ran
without a lock, so a burst of requests landing while _backend is None
could trigger multiple concurrent initialize()/load_model() calls.

- ensure_backend(): fast-path check outside lock, then re-check inside
  _lock before initializing (double-checked locking pattern)
- generate(): routes through ensure_backend() instead of inline check,
  eliminating the duplicate code path
- test_ensure_backend_serializes_concurrent_reloads: 5-way concurrent
  gather confirms only one initialize/load_model cycle fires

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sageframe-no-kaji
Copy link
Copy Markdown
Contributor Author

Good catch... the reload path wasn't protected. I added double-checked locking to ensure_backend() so concurrent requests while the model is unloaded serialize correctly: first caller loads, the rest wait and skip the reload once the lock clears.
Committed and ready to go.

@remsky
Copy link
Copy Markdown
Owner

remsky commented Jun 3, 2026

LGTM, cheers.

@remsky remsky closed this Jun 3, 2026
@remsky remsky reopened this Jun 3, 2026
@remsky remsky merged commit ff6efaf into remsky:master Jun 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature proposal: model unload endpoint to release GPU VRAM without stopping container

2 participants