feat(api): add POST /dev/unload to release model from GPU VRAM by sageframe-no-kaji · Pull Request #474 · remsky/Kokoro-FastAPI

sageframe-no-kaji · 2026-05-31T18:29:55Z

Summary

Adds POST /dev/unload endpoint that releases the Kokoro model from GPU VRAM without stopping the container
Model reloads lazily on the next inference request (zero cold-start penalty vs. container restart)
Lock-protected unload prevents races with concurrent requests during reload

Motivation

Homelab deployments share GPU memory across services (Ollama, Frigate, etc.). Without this endpoint the container holds model weights in VRAM indefinitely. Current workaround is stopping/restarting the container, which incurs a 25–30s cold start on every session.

Changes

api/src/inference/model_manager.py

unload(): acquires _lock, calls backend.unload(), nulls _backend, then torch.cuda.empty_cache() if CUDA available
generate(): lazy reinit when _backend is None (calls initialize() + load_model()) instead of raising RuntimeError

api/src/routers/development.py

POST /dev/unload: returns {"status": "unloaded"} on success, 503 if manager not initialised, 500 on unexpected error

api/src/services/tts_service.py

model_manager annotated as Optional[ModelManager] for correct mypy type narrowing at the endpoint

api/tests/test_model_unload.py (new)

11 tests covering all branches: unload() with/without backend, CUDA available/unavailable, lazy reinit in generate(), and all four HTTP response paths

Test plan

uv run --extra test pytest api/tests/test_model_unload.py -v — 11/11 pass
uv run --with ruff ruff check api/src/inference/model_manager.py api/src/routers/development.py api/src/services/tts_service.py api/tests/test_model_unload.py — clean
Container smoke test: curl -X POST http://localhost:8880/dev/unload → {"status":"unloaded"}
nvidia-smi shows VRAM released after unload
Next /dev/captioned_speech request succeeds (model reloads)

Closes #473

🤖 Generated with Claude Code

Allows homelab deployments to free GPU memory when the TTS service is idle without stopping the container. The model reloads lazily on the next request. - ModelManager.unload(): acquires lock, calls backend.unload(), nulls _backend, then calls torch.cuda.empty_cache() if CUDA is available - ModelManager.generate(): lazy reinit when _backend is None (calls initialize() + load_model()) instead of raising RuntimeError - POST /dev/unload: 200 on success, 503 if manager not initialised, 500 on unexpected error - TTSService.model_manager annotated as Optional[ModelManager] for correct mypy narrowing at the endpoint - Full test coverage in api/tests/test_model_unload.py (11 tests) Closes remsky#473 (partial) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

remsky · 2026-05-31T21:14:27Z

This is great, will spin it up and take a look

sageframe-no-kaji · 2026-05-31T22:27:38Z

great! let me know if you need anything

Add ensure_backend() to ModelManager which reinitialises the backend and reloads the model if /dev/unload was called. All three get_backend() call sites in tts_service now await ensure_backend() first, so the first TTS request after an unload reloads the model automatically rather than raising RuntimeError: Backend not initialized. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sageframe-no-kaji · 2026-06-01T01:05:42Z

Bug fix added (commit 750fabc): The original commit described lazy reload on the next request, but generate_audio_stream, _process_chunk, and generate_from_phonemes in tts_service.py all call model_manager.get_backend() directly — which raises RuntimeError: Backend not initialized rather than triggering the reload logic in model_manager.generate(). Tested this in production: calling /dev/unload then making a TTS request returned a 500 until the container was manually restarted.

The fix adds ensure_backend() to ModelManager — a single async method that reinitialises and reloads the model if the backend is None — and replaces the three bare get_backend() calls in tts_service.py with await ensure_backend() before them. Lazy reload now works as described.

remsky · 2026-06-02T20:39:55Z

@sageframe-no-kaji

the lazy-reload fix looks good, and the suite passes clean here, 11/11.

before merging though, i think asyncio.Lock only wraps unload(), so the reload path doesn't serialize. the reload block is inlined in generate() (and duplicated in ensure_backend()), and neither holds the lock, so if a burst of requests lands while _backend is None, it could pile up (tested a 5-way concurrent reload, went through with 5 model loads etc)

can add a lock on the reload path, and route generate() through ensure_backend() to get a single source of truth, something like the below?

    async def ensure_backend(self) -> None:
        """Reload the backend if it was unloaded via /dev/unload."""
        if self._backend:
            return
        async with self._lock:
            # Re-checks after acquiring
            if not self._backend:
                await self.initialize()
                await self.load_model(self._config.pytorch_kokoro_v1_file)

    async def generate(self, *args, **kwargs):
        """Generate audio using initialized backend.
        Raises:
            RuntimeError: If generation fails
        """
        await self.ensure_backend()
        assert self._backend is not None  # ensure_backend loaded it or raised

        try:
            async for chunk in self._backend.generate(*args, **kwargs):
            ...

if you can add a commit for this, should be good to merge in, or I can patch it on top.

Addresses review feedback: the original lazy-reload in generate() ran without a lock, so a burst of requests landing while _backend is None could trigger multiple concurrent initialize()/load_model() calls. - ensure_backend(): fast-path check outside lock, then re-check inside _lock before initializing (double-checked locking pattern) - generate(): routes through ensure_backend() instead of inline check, eliminating the duplicate code path - test_ensure_backend_serializes_concurrent_reloads: 5-way concurrent gather confirms only one initialize/load_model cycle fires Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sageframe-no-kaji · 2026-06-02T22:02:28Z

Good catch... the reload path wasn't protected. I added double-checked locking to ensure_backend() so concurrent requests while the model is unloaded serialize correctly: first caller loads, the rest wait and skip the reload once the lock clears.
Committed and ready to go.

remsky · 2026-06-03T06:34:26Z

LGTM, cheers.

remsky closed this Jun 3, 2026

remsky reopened this Jun 3, 2026

remsky merged commit ff6efaf into remsky:master Jun 3, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(api): add POST /dev/unload to release model from GPU VRAM#474

feat(api): add POST /dev/unload to release model from GPU VRAM#474
remsky merged 3 commits into
remsky:masterfrom
sageframe-no-kaji:feature/model-unload

sageframe-no-kaji commented May 31, 2026

Uh oh!

remsky commented May 31, 2026

Uh oh!

sageframe-no-kaji commented May 31, 2026

Uh oh!

sageframe-no-kaji commented Jun 1, 2026

Uh oh!

remsky commented Jun 2, 2026

Uh oh!

sageframe-no-kaji commented Jun 2, 2026

Uh oh!

remsky commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sageframe-no-kaji commented May 31, 2026

Summary

Motivation

Changes

Test plan

Uh oh!

remsky commented May 31, 2026

Uh oh!

sageframe-no-kaji commented May 31, 2026

Uh oh!

sageframe-no-kaji commented Jun 1, 2026

Uh oh!

remsky commented Jun 2, 2026

Uh oh!

sageframe-no-kaji commented Jun 2, 2026

Uh oh!

remsky commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants