Skip to content

fix(cache): set server-side TTL on model-monitoring zset writes#2390

Open
bigbitbus wants to merge 2 commits into
mainfrom
fix/model-monitoring-zset-ttl-leak
Open

fix(cache): set server-side TTL on model-monitoring zset writes#2390
bigbitbus wants to merge 2 commits into
mainfrom
fix/model-monitoring-zset-ttl-leak

Conversation

@bigbitbus
Copy link
Copy Markdown
Contributor

@bigbitbus bigbitbus commented Jun 1, 2026

Problem

RedisCache.zadd() accepts an expire argument but never issues a Redis EXPIRE — it only records the intended expiry in the in-process self.zexpires dict, which is reaped by the _expire() daemon thread:

self.client.zadd(key, {value: score})
if expire:
    self.zexpires[(key, score)] = expire + time.time()   # in-process only

That bookkeeping lives solely in the writing process's memory. If the process dies before a member is trimmed (e.g. an autoscaled/serverless inference pod scaling down), the key is orphaned in Redis forever with TTL -1.

The model-monitoring buffer in ModelManager (core/managers/base.py) writes one record per inference into inference:<GLOBAL_INFERENCE_SERVER_ID>:<model_id> (plus error:<…> and a models registry zset) via this path with expire=METRICS_INTERVAL * 2. The intended lifetime is ~120s — just long enough for the PingbackInfo scheduler to read the trailing window and POST it to Roboflow. But because GLOBAL_INFERENCE_SERVER_ID is unique per pod and pods churn constantly under autoscaling, every pod death strands its keys (a surviving pod only reaps its own ids via its own zexpires).

Impact observed in production

The redis cache got full of the model-monitoring keys and had to be manually cleared.

Fix

Issue ZADD + EXPIRE in a single pipeline so the key carries a real server-side sliding TTL:

if expire:
    with self.client.pipeline() as pipe:
        pipe.zadd(key, {value: score})
        pipe.expire(key, max(1, int(expire)))
        pipe.execute()
    self.zexpires[(key, score)] = expire + time.time()
else:
    self.client.zadd(key, {value: score})
  • A live model keeps re-arming the TTL on every write (and _expire() still trims individual members), so an actively-served key never expires out from under its readers.
  • A dead pod's keys self-reclaim ~expire seconds after the last write, regardless of process lifecycle — no more orphans.
  • The two commands run in one transactional pipeline (MULTI/EXEC), so there is never a window where the key exists without a TTL — a crash between the two can't reintroduce the leak one key at a time.
  • max(1, int(expire)) guards against EXPIRE 0 (immediate delete) for sub-second values.
  • The in-process self.zexpires bookkeeping is retained as a best-effort fine-grained per-member trim (now an optimisation rather than the only safety net).

This is semantically what the existing code already intended — it just put the expiry in the wrong place (a Python dict instead of Redis). It applies to every zadd writer (model-monitoring, WebRTC session sets, etc.) for free; no caller changes needed.

Testing

New unit tests in tests/inference/unit_tests/core/cache/test_redis_cache.py:

  • expire set → pipelined ZADD + EXPIRE(key, 120) issued, bookkeeping recorded
  • sub-second expire floored to 1 (never EXPIRE 0)
  • no expire → plain ZADD, no pipeline / TTL / bookkeeping
tests/.../core/cache/test_redis_cache.py ......... 3 passed
tests/.../core/managers/test_base.py + test_metrics.py ... 41 passed

Existing model-monitoring manager tests pass unchanged (the zadd signature is untouched).

RedisCache.zadd() accepted an `expire` arg but never issued a Redis
EXPIRE — it only recorded the intended expiry in the in-process
`self.zexpires` dict, reaped by the `_expire()` daemon thread. That
bookkeeping lives solely in the writing process' memory, so if the
process dies before a member is trimmed (e.g. an autoscaled/serverless
inference pod scaling down), the key is orphaned in Redis forever with
TTL -1.

The model-monitoring buffer in ModelManager writes one record per
inference into `inference:<server-id>:<model>` (and `error:<...>`,
`models`) via this path with `expire=METRICS_INTERVAL*2`. Because the
server id is unique per pod and pods churn constantly under autoscaling,
every pod death stranded its keys — observed as ~45k keys / 7GB of dead
zsets (e.g. a single `inference:<dead-id>:printed-doc/2` with 282k
members / 133MB, last written 46 days prior) filling a shared backing
store to its memory ceiling.

Issue ZADD + EXPIRE in one pipeline so the key carries a real sliding
server-side TTL: a live model keeps re-arming it (and `_expire()` still
trims individual members), while a dead pod's keys self-reclaim ~expire
seconds after the last write regardless of process lifecycle. The
in-process bookkeeping is retained as a best-effort fine-grained trim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@iurisilvio
Copy link
Copy Markdown
Contributor

PR looks good, commenting just because it is a common error, ZADD EXPIRE applies to the root key, not for every item. If we keep ZADDing, I think root key will never expire.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants