fix(cache): set server-side TTL on model-monitoring zset writes#2390
Open
bigbitbus wants to merge 2 commits into
Open
fix(cache): set server-side TTL on model-monitoring zset writes#2390bigbitbus wants to merge 2 commits into
bigbitbus wants to merge 2 commits into
Conversation
RedisCache.zadd() accepted an `expire` arg but never issued a Redis EXPIRE — it only recorded the intended expiry in the in-process `self.zexpires` dict, reaped by the `_expire()` daemon thread. That bookkeeping lives solely in the writing process' memory, so if the process dies before a member is trimmed (e.g. an autoscaled/serverless inference pod scaling down), the key is orphaned in Redis forever with TTL -1. The model-monitoring buffer in ModelManager writes one record per inference into `inference:<server-id>:<model>` (and `error:<...>`, `models`) via this path with `expire=METRICS_INTERVAL*2`. Because the server id is unique per pod and pods churn constantly under autoscaling, every pod death stranded its keys — observed as ~45k keys / 7GB of dead zsets (e.g. a single `inference:<dead-id>:printed-doc/2` with 282k members / 133MB, last written 46 days prior) filling a shared backing store to its memory ceiling. Issue ZADD + EXPIRE in one pipeline so the key carries a real sliding server-side TTL: a live model keeps re-arming it (and `_expire()` still trims individual members), while a dead pod's keys self-reclaim ~expire seconds after the last write regardless of process lifecycle. The in-process bookkeeping is retained as a best-effort fine-grained trim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
PR looks good, commenting just because it is a common error, ZADD EXPIRE applies to the root key, not for every item. If we keep ZADDing, I think root key will never expire. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
RedisCache.zadd()accepts anexpireargument but never issues a RedisEXPIRE— it only records the intended expiry in the in-processself.zexpiresdict, which is reaped by the_expire()daemon thread:That bookkeeping lives solely in the writing process's memory. If the process dies before a member is trimmed (e.g. an autoscaled/serverless inference pod scaling down), the key is orphaned in Redis forever with
TTL -1.The model-monitoring buffer in
ModelManager(core/managers/base.py) writes one record per inference intoinference:<GLOBAL_INFERENCE_SERVER_ID>:<model_id>(pluserror:<…>and amodelsregistry zset) via this path withexpire=METRICS_INTERVAL * 2. The intended lifetime is ~120s — just long enough for thePingbackInfoscheduler to read the trailing window and POST it to Roboflow. But becauseGLOBAL_INFERENCE_SERVER_IDis unique per pod and pods churn constantly under autoscaling, every pod death strands its keys (a surviving pod only reaps its own ids via its ownzexpires).Impact observed in production
The redis cache got full of the model-monitoring keys and had to be manually cleared.
Fix
Issue
ZADD+EXPIREin a single pipeline so the key carries a real server-side sliding TTL:_expire()still trims individual members), so an actively-served key never expires out from under its readers.~expireseconds after the last write, regardless of process lifecycle — no more orphans.MULTI/EXEC), so there is never a window where the key exists without a TTL — a crash between the two can't reintroduce the leak one key at a time.max(1, int(expire))guards againstEXPIRE 0(immediate delete) for sub-second values.self.zexpiresbookkeeping is retained as a best-effort fine-grained per-member trim (now an optimisation rather than the only safety net).This is semantically what the existing code already intended — it just put the expiry in the wrong place (a Python dict instead of Redis). It applies to every
zaddwriter (model-monitoring, WebRTC session sets, etc.) for free; no caller changes needed.Testing
New unit tests in
tests/inference/unit_tests/core/cache/test_redis_cache.py:expireset → pipelinedZADD+EXPIRE(key, 120)issued, bookkeeping recordedexpirefloored to1(neverEXPIRE 0)expire→ plainZADD, no pipeline / TTL / bookkeepingExisting model-monitoring manager tests pass unchanged (the
zaddsignature is untouched).