fix(cache): set server-side TTL on model-monitoring zset writes by bigbitbus · Pull Request #2390 · roboflow/inference

bigbitbus · 2026-06-01T01:24:42Z

Problem

RedisCache.zadd() accepts an expire argument but never issues a Redis EXPIRE — it only records the intended expiry in the in-process self.zexpires dict, which is reaped by the _expire() daemon thread:

self.client.zadd(key, {value: score})
if expire:
    self.zexpires[(key, score)] = expire + time.time()   # in-process only

That bookkeeping lives solely in the writing process's memory. If the process dies before a member is trimmed (e.g. an autoscaled/serverless inference pod scaling down), the key is orphaned in Redis forever with TTL -1.

The model-monitoring buffer in ModelManager (core/managers/base.py) writes one record per inference into inference:<GLOBAL_INFERENCE_SERVER_ID>:<model_id> (plus error:<…> and a models registry zset) via this path with expire=METRICS_INTERVAL * 2. The intended lifetime is ~120s — just long enough for the PingbackInfo scheduler to read the trailing window and POST it to Roboflow. But because GLOBAL_INFERENCE_SERVER_ID is unique per pod and pods churn constantly under autoscaling, every pod death strands its keys (a surviving pod only reaps its own ids via its own zexpires).

Impact observed in production

The redis cache got full of the model-monitoring keys and had to be manually cleared.

Fix

Issue ZADD + EXPIRE in a single pipeline so the key carries a real server-side sliding TTL:

if expire:
    with self.client.pipeline() as pipe:
        pipe.zadd(key, {value: score})
        pipe.expire(key, max(1, int(expire)))
        pipe.execute()
    self.zexpires[(key, score)] = expire + time.time()
else:
    self.client.zadd(key, {value: score})

A live model keeps re-arming the TTL on every write (and _expire() still trims individual members), so an actively-served key never expires out from under its readers.
A dead pod's keys self-reclaim ~expire seconds after the last write, regardless of process lifecycle — no more orphans.
The two commands run in one transactional pipeline (MULTI/EXEC), so there is never a window where the key exists without a TTL — a crash between the two can't reintroduce the leak one key at a time.
max(1, int(expire)) guards against EXPIRE 0 (immediate delete) for sub-second values.
The in-process self.zexpires bookkeeping is retained as a best-effort fine-grained per-member trim (now an optimisation rather than the only safety net).

This is semantically what the existing code already intended — it just put the expiry in the wrong place (a Python dict instead of Redis). It applies to every zadd writer (model-monitoring, WebRTC session sets, etc.) for free; no caller changes needed.

Testing

New unit tests in tests/inference/unit_tests/core/cache/test_redis_cache.py:

expire set → pipelined ZADD + EXPIRE(key, 120) issued, bookkeeping recorded
sub-second expire floored to 1 (never EXPIRE 0)
no expire → plain ZADD, no pipeline / TTL / bookkeeping

tests/.../core/cache/test_redis_cache.py ......... 3 passed
tests/.../core/managers/test_base.py + test_metrics.py ... 41 passed

Existing model-monitoring manager tests pass unchanged (the zadd signature is untouched).

RedisCache.zadd() accepted an `expire` arg but never issued a Redis EXPIRE — it only recorded the intended expiry in the in-process `self.zexpires` dict, reaped by the `_expire()` daemon thread. That bookkeeping lives solely in the writing process' memory, so if the process dies before a member is trimmed (e.g. an autoscaled/serverless inference pod scaling down), the key is orphaned in Redis forever with TTL -1. The model-monitoring buffer in ModelManager writes one record per inference into `inference:<server-id>:<model>` (and `error:<...>`, `models`) via this path with `expire=METRICS_INTERVAL*2`. Because the server id is unique per pod and pods churn constantly under autoscaling, every pod death stranded its keys — observed as ~45k keys / 7GB of dead zsets (e.g. a single `inference:<dead-id>:printed-doc/2` with 282k members / 133MB, last written 46 days prior) filling a shared backing store to its memory ceiling. Issue ZADD + EXPIRE in one pipeline so the key carries a real sliding server-side TTL: a live model keeps re-arming it (and `_expire()` still trims individual members), while a dead pod's keys self-reclaim ~expire seconds after the last write regardless of process lifecycle. The in-process bookkeeping is retained as a best-effort fine-grained trim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…zset-ttl-leak

iurisilvio · 2026-06-01T13:13:44Z

PR looks good, commenting just because it is a common error, ZADD EXPIRE applies to the root key, not for every item. If we keep ZADDing, I think root key will never expire.

bigbitbus requested review from PawelPeczek-Roboflow, dkosowski87, grzegorz-roboflow, hansent, probicheaux, rafel-roboflow and yeldarby as code owners June 1, 2026 01:24

Merge remote-tracking branch 'origin/main' into fix/model-monitoring-…

d77f043

…zset-ttl-leak

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cache): set server-side TTL on model-monitoring zset writes#2390

fix(cache): set server-side TTL on model-monitoring zset writes#2390
bigbitbus wants to merge 2 commits into
mainfrom
fix/model-monitoring-zset-ttl-leak

bigbitbus commented Jun 1, 2026 •

edited

Loading

Uh oh!

iurisilvio commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bigbitbus commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Impact observed in production

Fix

Testing

Uh oh!

iurisilvio commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bigbitbus commented Jun 1, 2026 •

edited

Loading