feat(api): add POST /dev/unload to release model from GPU VRAM#474
Conversation
Allows homelab deployments to free GPU memory when the TTS service is idle without stopping the container. The model reloads lazily on the next request. - ModelManager.unload(): acquires lock, calls backend.unload(), nulls _backend, then calls torch.cuda.empty_cache() if CUDA is available - ModelManager.generate(): lazy reinit when _backend is None (calls initialize() + load_model()) instead of raising RuntimeError - POST /dev/unload: 200 on success, 503 if manager not initialised, 500 on unexpected error - TTSService.model_manager annotated as Optional[ModelManager] for correct mypy narrowing at the endpoint - Full test coverage in api/tests/test_model_unload.py (11 tests) Closes remsky#473 (partial) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
This is great, will spin it up and take a look |
|
great! let me know if you need anything |
Add ensure_backend() to ModelManager which reinitialises the backend and reloads the model if /dev/unload was called. All three get_backend() call sites in tts_service now await ensure_backend() first, so the first TTS request after an unload reloads the model automatically rather than raising RuntimeError: Backend not initialized. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Bug fix added (commit 750fabc): The original commit described lazy reload on the next request, but The fix adds |
|
the lazy-reload fix looks good, and the suite passes clean here, 11/11. before merging though, i think can add a lock on the reload path, and route async def ensure_backend(self) -> None:
"""Reload the backend if it was unloaded via /dev/unload."""
if self._backend:
return
async with self._lock:
# Re-checks after acquiring
if not self._backend:
await self.initialize()
await self.load_model(self._config.pytorch_kokoro_v1_file) async def generate(self, *args, **kwargs):
"""Generate audio using initialized backend.
Raises:
RuntimeError: If generation fails
"""
await self.ensure_backend()
assert self._backend is not None # ensure_backend loaded it or raised
try:
async for chunk in self._backend.generate(*args, **kwargs):
...if you can add a commit for this, should be good to merge in, or I can patch it on top. |
Addresses review feedback: the original lazy-reload in generate() ran without a lock, so a burst of requests landing while _backend is None could trigger multiple concurrent initialize()/load_model() calls. - ensure_backend(): fast-path check outside lock, then re-check inside _lock before initializing (double-checked locking pattern) - generate(): routes through ensure_backend() instead of inline check, eliminating the duplicate code path - test_ensure_backend_serializes_concurrent_reloads: 5-way concurrent gather confirms only one initialize/load_model cycle fires Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Good catch... the reload path wasn't protected. I added double-checked locking to ensure_backend() so concurrent requests while the model is unloaded serialize correctly: first caller loads, the rest wait and skip the reload once the lock clears. |
|
LGTM, cheers. |
Summary
POST /dev/unloadendpoint that releases the Kokoro model from GPU VRAM without stopping the containerMotivation
Homelab deployments share GPU memory across services (Ollama, Frigate, etc.). Without this endpoint the container holds model weights in VRAM indefinitely. Current workaround is stopping/restarting the container, which incurs a 25–30s cold start on every session.
Changes
api/src/inference/model_manager.pyunload(): acquires_lock, callsbackend.unload(), nulls_backend, thentorch.cuda.empty_cache()if CUDA availablegenerate(): lazy reinit when_backend is None(callsinitialize()+load_model()) instead of raisingRuntimeErrorapi/src/routers/development.pyPOST /dev/unload: returns{"status": "unloaded"}on success, 503 if manager not initialised, 500 on unexpected errorapi/src/services/tts_service.pymodel_managerannotated asOptional[ModelManager]for correct mypy type narrowing at the endpointapi/tests/test_model_unload.py(new)unload()with/without backend, CUDA available/unavailable, lazy reinit ingenerate(), and all four HTTP response pathsTest plan
uv run --extra test pytest api/tests/test_model_unload.py -v— 11/11 passuv run --with ruff ruff check api/src/inference/model_manager.py api/src/routers/development.py api/src/services/tts_service.py api/tests/test_model_unload.py— cleancurl -X POST http://localhost:8880/dev/unload→{"status":"unloaded"}nvidia-smishows VRAM released after unload/dev/captioned_speechrequest succeeds (model reloads)Closes #473
🤖 Generated with Claude Code