feat(auth-cache): add hit/miss/eviction metrics to authentication cache#281
Merged
Conversation
Instrument AsyncTTLCache so cache effectiveness is observable per flow. Reads emit auth_cache.access tagged cache:<flow> + result:hit|miss_expired| miss_absent; capacity evictions emit auth_cache.eviction. Emission mirrors the dual OTel + StatsD pattern in db_metrics.py and no-ops when neither backend is configured. This makes it possible to distinguish a low hit rate caused by a short TTL (miss_expired) from one caused by non-repeating keys or a cold per-worker cache (miss_absent), and to surface capacity-driven eviction.
- Guard metric emission so instrumentation never raises into the caller. record_cache_access/record_cache_eviction now wrap their full body in try/except (centralized in the emitter rather than at each call site in get()/set()), matching the defensive pattern in db_metrics.py. A StatsD UDP error or OTel SDK fault can no longer propagate up the critical auth path. - Add a test asserting both emitters swallow backend errors.
NiteshDhanpal
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The in-process authentication/authorization cache (
AsyncTTLCacheinauthentication_cache.py) currently emits no metrics — cache hits log atdebug(invisible in prod),get_cache_stats()is never called, and there is no way to measure hit rate. This PR adds counters so cache effectiveness is observable per flow.This is instrumentation only — it does not change caching behavior. The goal is to get evidence before tuning anything.
What it adds
src/utils/cache_metrics.py— a dual-emit helper mirroring the OTel + StatsD pattern indb_metrics.py. Emits through OpenTelemetry whenOTEL_EXPORTER_OTLP_ENDPOINTis set, through StatsD whenDD_AGENT_HOSTis set, and no-ops when neither is configured.auth_cache.access— counter, tagscache:<flow>+result:hit|miss_expired|miss_absentauth_cache.eviction— counter, tagcache:<flow>authentication_cache.py— eachAsyncTTLCachenow carries aname(agent_identity,agent_api_key,auth_gateway,authorization_check).get()records the three read outcomes;set()records only genuine LRU evictions (not key updates, not TTL expiry).Why the 3-way miss classification
Splitting misses into
miss_expiredvsmiss_absentis what makes the metric diagnostic rather than just a hit-rate number:miss_expireddominating → TTL too short for the request rate.miss_absentdominating → keys don't repeat (e.g. unique credentials per request) or the per-worker cache is cold.auth_cache.eviction→ working set exceedsmax_size.Example hit-rate query once deployed:
sum:auth_cache.access{result:hit} by {cache} / sum:auth_cache.access{*} by {cache}Testing
tests/unit/api/test_authentication_cache_metrics.py— outcome classification (hit / miss_expired / miss_absent), deterministic expiry, eviction-vs-update distinction, per-cache name assignment.tests/unit/utils/test_cache_metrics.py— emitter no-op safety when unconfigured + StatsD emission path.ruff check/ruff formatclean.Greptile Summary
This PR adds hit/miss/eviction counters to the in-process authentication cache. It is pure instrumentation — no caching behavior is changed. Both OTel and StatsD emit paths follow the established
db_metrics.pydual-emit pattern, and all metric calls are wrapped intry/exceptso failures cannot propagate to the critical auth path.cache_metrics.py(new): lazily initialises OTel counters and emitsauth_cache.access(taggedcache+result) andauth_cache.eviction(taggedcache) through OTel and/or StatsD, no-ops when neither backend is configured.authentication_cache.py:AsyncTTLCachegains anameparameter;get()now records one of three outcomes (hit,miss_expired,miss_absent) andset()records genuine LRU evictions only (not key updates or TTL expiry).Confidence Score: 5/5
Safe to merge — instrumentation only, no caching behaviour changed, and metric failures are fully isolated from the auth path.
All metric calls are wrapped in try/except inside the helper functions, so no emission failure can propagate to callers. Tag values are bounded string constants with no cardinality risk. The LRU-eviction guard (key not in cache) correctly avoids false eviction counts on updates. Tests cover all three read outcomes deterministically and verify the no-op and StatsD paths.
No files require special attention.
Important Files Changed
nameparameter; record_cache_access/eviction calls added in get() and set() — metric functions are self-contained with try/except so no exception can escape through the cache API.Sequence Diagram
sequenceDiagram participant Caller as Auth Middleware participant Cache as AsyncTTLCache participant Metrics as cache_metrics.py participant OTel as OpenTelemetry participant StatsD as StatsD (DD Agent) Caller->>Cache: get(key) activate Cache Cache->>Cache: async with _lock alt key not in cache Cache->>Metrics: record_cache_access(name, "miss_absent") Cache-->>Caller: None else key expired Cache->>Metrics: record_cache_access(name, "miss_expired") Cache-->>Caller: None else "key present & fresh" Cache->>Metrics: record_cache_access(name, "hit") Cache-->>Caller: value end deactivate Cache activate Metrics Metrics->>Metrics: _ensure_instruments() [lazy init] opt OTel configured Metrics->>OTel: "counter.add(1, {cache, result})" end opt DD_AGENT_HOST set Metrics->>StatsD: statsd.increment("auth_cache.access", tags) end Metrics->>Metrics: swallow any Exception deactivate Metrics Caller->>Cache: set(key, value) activate Cache Cache->>Cache: async with _lock opt cache full AND key is new Cache->>Cache: popitem (LRU eviction) Cache->>Metrics: record_cache_eviction(name) end Cache->>Cache: store (value, timestamp) Cache-->>Caller: None deactivate CacheReviews (3): Last reviewed commit: "Merge branch 'main' into feat/auth-cache..." | Re-trigger Greptile