Skip to content

401 auth-failure handling: bring lifecycle to full Node parity (post recost-dev/middleware-node#16) #32

@AndresL230

Description

@AndresL230

Severity: Medium
Affected repos: `middleware-python` (this), `middleware-node` (reference)
Component boundary: middleware cloud transport / 401 lifecycle parity

Context

`recost-dev/middleware-node#16` (closed by PR #32) added the full 401-auth-failure lifecycle to the Node SDK: typed `RecostAuthError` / `RecostFatalAuthError` through `onError`, configurable threshold (`maxConsecutiveAuthFailures`, default 5), one-time stderr warning on first 401, second distinct stderr line at fatal-suspend, suspended-state silent no-op, and counter reset on every non-401 outcome (success, non-401 4xx, 5xx-after-retries, network throw).

Python's `recost/_types.py` (lines 189–209) already declares `RecostError` / `RecostAuthError` / `RecostFatalAuthError` with the same constructor shape. `recost/_transport.py` already wires partial escalation: it increments `_consecutive_auth_failures`, fires a one-time stderr warning, dispatches the typed errors through `on_error`, sets `_suspended` at threshold, and short-circuits `send()` when suspended. So the classes are wired — the remaining work is bringing the lifecycle to full Node parity.

Gaps vs Node (the actual work)

  1. Threshold is hardcoded to 5 (`recost/_transport.py:437`). Node exposes `max_consecutive_auth_failures: Optional[int] = 5` on `RecostConfig`, threaded through `_resolve_config`. Mirror that.
  2. No second stderr line at fatal-suspend. Node emits a distinct `[recost] cloud transport suspended after N consecutive auth failures. Restart the process after rotating apiKey.` line at the threshold; Python emits only the first-401 line. Add the second line.
  3. Counter does not reset on every non-401 outcome. Currently resets only on 2xx success (`_transport.py:349`). Must also reset on:
    • non-401 4xx (403/404/422) — fall-through path after `_handle_cloud_result` returns False
    • 5xx after retries-exhausted — `_post_cloud` may return error result OR the catch path runs
    • Network throw — the `except` block (`_transport.py:374-375`)
      The literal reading of "consecutive 401s" requires that any non-401 outcome resets, so transient outages do not accumulate toward the threshold.
  4. stderr text format diverges from Node. Cross-SDK log-grep parity matters; Python says `"Recost: API rejected key (401). Telemetry will be dropped."` while Node says `"[recost] HTTP 401 — API key rejected. Telemetry will stop after N consecutive failures."` Align Python to the Node format (`[recost] HTTP 401 — API key rejected. Telemetry will stop after {N} consecutive failures. Check your api_key at https://recost.dev/dashboard/account.\`).
  5. Dead 401 branch in `_report_rejection` (`_transport.py:395-396`) — never reached because `_handle_cloud_result` returns True for 401. Remove for cleanup, mirroring Node PR 401 auth-failure handling: bring lifecycle to full Node parity (post recost-dev/middleware-node#16) #32 Task 5 step 6.

Reference

The Node spec's "Decisions and rationale" table and "Lifecycle table" carry over directly; the Python work is a translation, not a redesign.

Out of scope (file separately if relevant)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions