fix(auth, cli): sliding session window + actionable check errors by hashedone · Pull Request #212 · softwaremill/tracevault

hashedone · 2026-05-28T10:21:17Z

Fixes the bug reported today: an agent's plan ended in git push --draft-pr, the push got blocked because the TraceVault session token had expired, and Claude Code's only signal was an opaque "Stream failed" — so it hallucinated commands until a human stepped in with tracevault login.

Two complementary changes. They make the bug nearly impossible and, when it does happen, ensure the next agent can self-recover.

1. Server: sliding `auth_sessions` window

Today auth_sessions rows have expires_at = login_time + 30 days. The expiry is a hard cliff: it doesn't matter whether you used TV every day for 30 days or didn't use it at all, you get logged out the same way.

Replace the SELECT-only validity check in AuthUser and OrgAuth extractors with a single atomic statement that bumps the expiry forward on every successful authentication:

UPDATE auth_sessions
SET expires_at = NOW() + INTERVAL '30 days'
WHERE token_hash = $1 AND expires_at > NOW()
RETURNING user_id

Net effect: as long as a user touches TV once per 30 days, the token never expires. Genuinely-dormant tokens still hit the cliff and require tracevault login — the right tradeoff for inactive credentials.

No new endpoint, no schema change, no breaking change.

2. CLI: actionable error messages on `tracevault check` failure

A TV-tracked repo MUST be evaluated by TV on every push — that contract doesn't change. The fix here is purely the quality of the message the user/agent reads when the block fires.

connectivity_message() sniffs the raw error and appends a specific next step:

Error pattern	Action surfaced
401 / Unauthorized	`Session token may be expired. Run \`tracevault login --server-url=<server_url>` to refresh.`
403 / Forbidden	`Your token lacks access to this repo's policies. Check your org membership and rerun \`tracevault login`.`
DNS / connect / timeout / connection refused	`Could not reach the TraceVault server. Check network and \`server_url` in .tracevault/config.toml.`
5xx Server Error / Internal	`TraceVault server returned an error. The team has likely been paged; retry shortly.`
Unrecognized	Pass through the raw error; no fake advice.

An agent reading "run tracevault login --server-url=<server_url>" has a real chance of self-repair. The previous "Stream failed (500)" gave it nothing to work with.

Verification

Layer	Status
`cargo fmt --check`	✅
`cargo clippy --workspace --tests --all-targets -- -D warnings`	✅
`cargo test --workspace`	✅ all green (30 → 30 CLI lib tests, 128 server lib tests, 16 enterprise + all repo tests)
New `sliding_session_test`	✅ 3/3 — extends-on-hit, no-revive-expired, no-op-on-unknown
New `check::tests::connectivity_message_*`	✅ 6/6 — covers 401, unauthorized-text, DNS, connection-refused, 403, fallback

What this PR does NOT change

The hook still exits non-zero on any TV-side failure. Repos opted into TV are evaluated on every push, full stop.
No new config field, no opt-out switch. Considered enforce_check = false to fail-open on connectivity errors; rejected during review as inconsistent with the enforcement contract.
Real policy violations (result.blocked from the server) still exit 1 with their existing detailed output. That path is untouched.

What's missing (future work)

Proper OAuth-style refresh tokens (short-lived access + long-lived refresh). The sliding window handles the active-user case correctly but inactive users still hit the 30-day cliff. Refresh tokens are the standard fix and remain on the roadmap; they're a half-day-plus slice that needs its own design pass.

Previously the only path to recover from an expired `auth_sessions` row was `tracevault login`. Tokens hard-expire at 30 days regardless of how active the user is — daily users hit the cliff just as easily as inactive ones, with no signal until a push gets blocked. Replace the SELECT-only validity check in both `AuthUser` and `OrgAuth` extractors with a single `UPDATE auth_sessions SET expires_at = NOW() + INTERVAL '30 days' WHERE token_hash = AND expires_at > NOW() RETURNING user_id`. Atomic: one round-trip, no race between the validity check and the extension. Net effect: * Active users (anything that talks to TV once per 30 days) never see a token expiry again — the window slides forward on every hit. * Inactive users still hit the 30-day expiry cliff and must re-login, which is the right security tradeoff for genuinely-dormant tokens. * No new endpoint, no client change, no breaking-change to existing callers. Three integration tests against a real Postgres pool exercise the SQL directly: extend-on-hit, no-revive-of-already-expired, no-op-on- unknown-token.

The pre-push hook runs `tracevault check` and exits 1 on any failure. Previously that 1 came with an opaque message — typically "Stream failed (500 Internal Server Error)" or "Server returned 401: Invalid or expired token". Both an agent and a human reader were left guessing what to do. Adds `connectivity_message()` which sniffs the raw error string and appends a specific next step: * 401 / Unauthorized: "Session token may be expired. Run `tracevault login --server-url=<server_url>` to refresh." * 403 / Forbidden: org-membership hint (not a re-login). * DNS / connect / timeout / connection refused: network + server_url hint. * 5xx Server Error / Internal: "team has been paged, retry shortly." * Unknown shapes: pass through the raw error, no fake advice. All connectivity-class errors still propagate as Err so the pre-push hook exits non-zero — a TV-tracked repo MUST be evaluated by TV on every push, that contract doesn't change. The fix here is purely the quality of the message the user/agent reads after the block. Six unit tests pin the message shapes for the common error patterns the api_client surfaces today.

Same lockfile regeneration as PR #211 carried — release-plz updated deps in v0.16.0 that this branch picks up. CI runs `cargo check --locked`, so commit the regenerated lock.

aikido-pr-checks · 2026-05-28T10:21:47Z

+        // it, (c) read back the user_id — one round-trip, no race.
        let session_row = sqlx::query_as::<_, (Uuid,)>(
-            "SELECT user_id FROM auth_sessions WHERE token_hash = $1 AND expires_at > NOW()",
+            "UPDATE auth_sessions


Replacing SELECT with UPDATE on every auth turns a read into a write for each request, increasing DB write load and contention; consider the performance cost of frequent writes.

Details

✨ AI Reasoning
The code swaps a SELECT-only check for an UPDATE..RETURNING that extends expires_at on every successful auth. While functionally intended, UPDATE issues a write to the auth_sessions table on every authenticated request. For high-throughput endpoints or many short-lived requests, this converts cheap reads into writes, increasing I/O, WAL generation, and potential contention/latency. The pattern is visible and the impact scales with authentication frequency.

🔧 How do I fix it?
Move constant work outside loops. Use StringBuilder instead of string concatenation in loops. Cache compiled regex patterns. Use hash-based lookups instead of nested loops. Batch database operations instead of N+1 queries.

_{Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.}
_{Reply @AikidoSec ignore: [REASON] to ignore this issue.}
_{More info}

aikido-pr-checks · 2026-05-28T10:22:12Z

+fn connectivity_message(raw: &str) -> String {
+    let lower = raw.to_ascii_lowercase();
+    let action = if lower.contains("401") || lower.contains("unauthorized") {
+        Some("Session token may be expired. Run `tracevault login --server-url=<server_url>` to refresh.")
+    } else if lower.contains("403") || lower.contains("forbidden") {
+        Some("Your token lacks access to this repo's policies. Check your org membership and rerun `tracevault login`.")
+    } else if lower.contains("dns")
+        || lower.contains("connect")
+        || lower.contains("timed out")
+        || lower.contains("timeout")
+        || lower.contains("connection refused")
+    {
+        Some("Could not reach the TraceVault server. Check network and `server_url` in .tracevault/config.toml.")
+    } else if lower.contains("5") && (lower.contains("server error") || lower.contains("internal"))
+    {
+        Some("TraceVault server returned an error. The team has likely been paged; retry shortly.")
+    } else {
+        None
+    };
+
+    match action {
+        Some(a) => format!("Policy check could not run: {raw}.\n  → {a}"),
+        None => format!("Policy check could not run: {raw}."),
+    }


connectivity_message builds an intermediate Option via a long if/else chain; return the formatted message immediately from each branch to flatten control flow (use early returns instead of action variable + match).

Show fix

Suggested change

fn connectivity_message(raw: &str) -> String {

let lower = raw.to_ascii_lowercase();

let action = if lower.contains("401") || lower.contains("unauthorized") {

Some("Session token may be expired. Run `tracevault login --server-url=<server_url>` to refresh.")

} else if lower.contains("403") || lower.contains("forbidden") {

Some("Your token lacks access to this repo's policies. Check your org membership and rerun `tracevault login`.")

} else if lower.contains("dns")

|| lower.contains("connect")

|| lower.contains("timed out")

|| lower.contains("timeout")

|| lower.contains("connection refused")

{

Some("Could not reach the TraceVault server. Check network and `server_url` in .tracevault/config.toml.")

} else if lower.contains("5") && (lower.contains("server error") || lower.contains("internal"))

{

Some("TraceVault server returned an error. The team has likely been paged; retry shortly.")

} else {

None

};

match action {

Some(a) => format!("Policy check could not run: {raw}.\n → {a}"),

None => format!("Policy check could not run: {raw}."),

}

fn connectivity_message(raw: &str) -> String {

if lower.contains("401") || lower.contains("unauthorized") {

return format!("Policy check could not run: {raw}.\n → Session token may be expired. Run `tracevault login --server-url=<server_url>` to refresh.");

}

if lower.contains("403") || lower.contains("forbidden") {

return format!("Policy check could not run: {raw}.\n → Your token lacks access to this repo's policies. Check your org membership and rerun `tracevault login`.");

}

if lower.contains("dns")

|| lower.contains("connect")

|| lower.contains("timed out")

|| lower.contains("timeout")

|| lower.contains("connection refused")

{

return format!("Policy check could not run: {raw}.\n → Could not reach the TraceVault server. Check network and `server_url` in .tracevault/config.toml.");

}

if lower.contains("5") && (lower.contains("server error") || lower.contains("internal")) {

return format!("Policy check could not run: {raw}.\n → TraceVault server returned an error. The team has likely been paged; retry shortly.");

}

format!("Policy check could not run: {raw}.")

Details

✨ AI Reasoning
The function connectivity_message builds an intermediate Option via a long if/else-if chain and then matches that Option to format the output. This adds an extra mental indirection; using early returns (return formatted string as soon as a pattern matches) would flatten control flow and be clearer. The change introduced this new function and its conditional chain, increasing nesting and indirection in the codebase.

_{Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.}
_{Reply @AikidoSec ignore: [REASON] to ignore this issue.}
_{More info}

hashedone added 3 commits May 28, 2026 12:20

chore: sync Cargo.lock after v0.16.0 dep churn

fc1ba1b

Same lockfile regeneration as PR #211 carried — release-plz updated deps in v0.16.0 that this branch picks up. CI runs `cargo check --locked`, so commit the regenerated lock.

aikido-pr-checks Bot reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(auth, cli): sliding session window + actionable check errors#212

fix(auth, cli): sliding session window + actionable check errors#212
hashedone wants to merge 3 commits into
mainfrom
fix/check-resilience-and-sliding-session

hashedone commented May 28, 2026

Uh oh!

aikido-pr-checks Bot May 28, 2026

Uh oh!

aikido-pr-checks Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hashedone commented May 28, 2026

1. Server: sliding auth_sessions window

2. CLI: actionable error messages on tracevault check failure

Verification

What this PR does NOT change

What's missing (future work)

Uh oh!

aikido-pr-checks Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

aikido-pr-checks Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Server: sliding `auth_sessions` window

2. CLI: actionable error messages on `tracevault check` failure