Skip to content

feat(auth-cache): add hit/miss/eviction metrics to authentication cache#281

Merged
smoreinis merged 3 commits into
mainfrom
feat/auth-cache-metrics
Jun 5, 2026
Merged

feat(auth-cache): add hit/miss/eviction metrics to authentication cache#281
smoreinis merged 3 commits into
mainfrom
feat/auth-cache-metrics

Conversation

@smoreinis
Copy link
Copy Markdown
Collaborator

@smoreinis smoreinis commented Jun 5, 2026

Summary

The in-process authentication/authorization cache (AsyncTTLCache in authentication_cache.py) currently emits no metrics — cache hits log at debug (invisible in prod), get_cache_stats() is never called, and there is no way to measure hit rate. This PR adds counters so cache effectiveness is observable per flow.

This is instrumentation only — it does not change caching behavior. The goal is to get evidence before tuning anything.

What it adds

  • src/utils/cache_metrics.py — a dual-emit helper mirroring the OTel + StatsD pattern in db_metrics.py. Emits through OpenTelemetry when OTEL_EXPORTER_OTLP_ENDPOINT is set, through StatsD when DD_AGENT_HOST is set, and no-ops when neither is configured.
    • auth_cache.access — counter, tags cache:<flow> + result:hit|miss_expired|miss_absent
    • auth_cache.eviction — counter, tag cache:<flow>
  • authentication_cache.py — each AsyncTTLCache now carries a name (agent_identity, agent_api_key, auth_gateway, authorization_check). get() records the three read outcomes; set() records only genuine LRU evictions (not key updates, not TTL expiry).

Why the 3-way miss classification

Splitting misses into miss_expired vs miss_absent is what makes the metric diagnostic rather than just a hit-rate number:

  • miss_expired dominating → TTL too short for the request rate.
  • miss_absent dominating → keys don't repeat (e.g. unique credentials per request) or the per-worker cache is cold.
  • nonzero auth_cache.eviction → working set exceeds max_size.

Example hit-rate query once deployed:
sum:auth_cache.access{result:hit} by {cache} / sum:auth_cache.access{*} by {cache}

Testing

  • tests/unit/api/test_authentication_cache_metrics.py — outcome classification (hit / miss_expired / miss_absent), deterministic expiry, eviction-vs-update distinction, per-cache name assignment.
  • tests/unit/utils/test_cache_metrics.py — emitter no-op safety when unconfigured + StatsD emission path.
  • 9 new tests, all passing. ruff check / ruff format clean.

Greptile Summary

This PR adds hit/miss/eviction counters to the in-process authentication cache. It is pure instrumentation — no caching behavior is changed. Both OTel and StatsD emit paths follow the established db_metrics.py dual-emit pattern, and all metric calls are wrapped in try/except so failures cannot propagate to the critical auth path.

  • cache_metrics.py (new): lazily initialises OTel counters and emits auth_cache.access (tagged cache + result) and auth_cache.eviction (tagged cache) through OTel and/or StatsD, no-ops when neither backend is configured.
  • authentication_cache.py: AsyncTTLCache gains a name parameter; get() now records one of three outcomes (hit, miss_expired, miss_absent) and set() records genuine LRU evictions only (not key updates or TTL expiry).
  • Tests: 9 new unit tests cover all outcome classifications, deterministic expiry, eviction-vs-update distinction, no-op safety, StatsD emission, and per-cache name assignment.

Confidence Score: 5/5

Safe to merge — instrumentation only, no caching behaviour changed, and metric failures are fully isolated from the auth path.

All metric calls are wrapped in try/except inside the helper functions, so no emission failure can propagate to callers. Tag values are bounded string constants with no cardinality risk. The LRU-eviction guard (key not in cache) correctly avoids false eviction counts on updates. Tests cover all three read outcomes deterministically and verify the no-op and StatsD paths.

No files require special attention.

Important Files Changed

Filename Overview
agentex/src/utils/cache_metrics.py New metrics helper; follows db_metrics.py pattern with lazy OTel init, StatsD fallback, and full try/except error isolation. Tag values are bounded constants so no cardinality risk.
agentex/src/api/authentication_cache.py AsyncTTLCache gains a name parameter; record_cache_access/eviction calls added in get() and set() — metric functions are self-contained with try/except so no exception can escape through the cache API.
agentex/tests/unit/api/test_authentication_cache_metrics.py Comprehensive tests for all three read outcomes, deterministic expiry backdating, eviction vs update distinction, and per-cache name assignment.
agentex/tests/unit/utils/test_cache_metrics.py Tests the no-op and StatsD emission paths; correctly patches module-level state to isolate each test.

Sequence Diagram

sequenceDiagram
    participant Caller as Auth Middleware
    participant Cache as AsyncTTLCache
    participant Metrics as cache_metrics.py
    participant OTel as OpenTelemetry
    participant StatsD as StatsD (DD Agent)

    Caller->>Cache: get(key)
    activate Cache
    Cache->>Cache: async with _lock
    alt key not in cache
        Cache->>Metrics: record_cache_access(name, "miss_absent")
        Cache-->>Caller: None
    else key expired
        Cache->>Metrics: record_cache_access(name, "miss_expired")
        Cache-->>Caller: None
    else "key present & fresh"
        Cache->>Metrics: record_cache_access(name, "hit")
        Cache-->>Caller: value
    end
    deactivate Cache

    activate Metrics
    Metrics->>Metrics: _ensure_instruments() [lazy init]
    opt OTel configured
        Metrics->>OTel: "counter.add(1, {cache, result})"
    end
    opt DD_AGENT_HOST set
        Metrics->>StatsD: statsd.increment("auth_cache.access", tags)
    end
    Metrics->>Metrics: swallow any Exception
    deactivate Metrics

    Caller->>Cache: set(key, value)
    activate Cache
    Cache->>Cache: async with _lock
    opt cache full AND key is new
        Cache->>Cache: popitem (LRU eviction)
        Cache->>Metrics: record_cache_eviction(name)
    end
    Cache->>Cache: store (value, timestamp)
    Cache-->>Caller: None
    deactivate Cache
Loading

Reviews (3): Last reviewed commit: "Merge branch 'main' into feat/auth-cache..." | Re-trigger Greptile

Instrument AsyncTTLCache so cache effectiveness is observable per flow.
Reads emit auth_cache.access tagged cache:<flow> + result:hit|miss_expired|
miss_absent; capacity evictions emit auth_cache.eviction. Emission mirrors the
dual OTel + StatsD pattern in db_metrics.py and no-ops when neither backend is
configured.

This makes it possible to distinguish a low hit rate caused by a short TTL
(miss_expired) from one caused by non-repeating keys or a cold per-worker cache
(miss_absent), and to surface capacity-driven eviction.
@smoreinis smoreinis requested a review from a team as a code owner June 5, 2026 17:54
Comment thread agentex/src/api/authentication_cache.py
Comment thread agentex/src/api/authentication_cache.py
- Guard metric emission so instrumentation never raises into the caller.
  record_cache_access/record_cache_eviction now wrap their full body in
  try/except (centralized in the emitter rather than at each call site in
  get()/set()), matching the defensive pattern in db_metrics.py. A StatsD UDP
  error or OTel SDK fault can no longer propagate up the critical auth path.
- Add a test asserting both emitters swallow backend errors.
@smoreinis smoreinis enabled auto-merge (squash) June 5, 2026 20:13
@smoreinis smoreinis merged commit 61611e1 into main Jun 5, 2026
30 checks passed
@smoreinis smoreinis deleted the feat/auth-cache-metrics branch June 5, 2026 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants