Skip to content

fix(auth, cli): sliding session window + actionable check errors#212

Open
hashedone wants to merge 3 commits into
mainfrom
fix/check-resilience-and-sliding-session
Open

fix(auth, cli): sliding session window + actionable check errors#212
hashedone wants to merge 3 commits into
mainfrom
fix/check-resilience-and-sliding-session

Conversation

@hashedone
Copy link
Copy Markdown
Contributor

Fixes the bug reported today: an agent's plan ended in git push --draft-pr, the push got blocked because the TraceVault session token had expired, and Claude Code's only signal was an opaque "Stream failed" — so it hallucinated commands until a human stepped in with tracevault login.

Two complementary changes. They make the bug nearly impossible and, when it does happen, ensure the next agent can self-recover.

1. Server: sliding auth_sessions window

Today auth_sessions rows have expires_at = login_time + 30 days. The expiry is a hard cliff: it doesn't matter whether you used TV every day for 30 days or didn't use it at all, you get logged out the same way.

Replace the SELECT-only validity check in AuthUser and OrgAuth extractors with a single atomic statement that bumps the expiry forward on every successful authentication:

UPDATE auth_sessions
SET expires_at = NOW() + INTERVAL '30 days'
WHERE token_hash = $1 AND expires_at > NOW()
RETURNING user_id

Net effect: as long as a user touches TV once per 30 days, the token never expires. Genuinely-dormant tokens still hit the cliff and require tracevault login — the right tradeoff for inactive credentials.

No new endpoint, no schema change, no breaking change.

2. CLI: actionable error messages on tracevault check failure

A TV-tracked repo MUST be evaluated by TV on every push — that contract doesn't change. The fix here is purely the quality of the message the user/agent reads when the block fires.

connectivity_message() sniffs the raw error and appends a specific next step:

Error pattern Action surfaced
401 / Unauthorized Session token may be expired. Run \tracevault login --server-url=<server_url>` to refresh.`
403 / Forbidden Your token lacks access to this repo's policies. Check your org membership and rerun \tracevault login`.`
DNS / connect / timeout / connection refused Could not reach the TraceVault server. Check network and \server_url` in .tracevault/config.toml.`
5xx Server Error / Internal TraceVault server returned an error. The team has likely been paged; retry shortly.
Unrecognized Pass through the raw error; no fake advice.

An agent reading "run tracevault login --server-url=<server_url>" has a real chance of self-repair. The previous "Stream failed (500)" gave it nothing to work with.

Verification

Layer Status
cargo fmt --check
cargo clippy --workspace --tests --all-targets -- -D warnings
cargo test --workspace ✅ all green (30 → 30 CLI lib tests, 128 server lib tests, 16 enterprise + all repo tests)
New sliding_session_test ✅ 3/3 — extends-on-hit, no-revive-expired, no-op-on-unknown
New check::tests::connectivity_message_* ✅ 6/6 — covers 401, unauthorized-text, DNS, connection-refused, 403, fallback

What this PR does NOT change

  • The hook still exits non-zero on any TV-side failure. Repos opted into TV are evaluated on every push, full stop.
  • No new config field, no opt-out switch. Considered enforce_check = false to fail-open on connectivity errors; rejected during review as inconsistent with the enforcement contract.
  • Real policy violations (result.blocked from the server) still exit 1 with their existing detailed output. That path is untouched.

What's missing (future work)

Proper OAuth-style refresh tokens (short-lived access + long-lived refresh). The sliding window handles the active-user case correctly but inactive users still hit the 30-day cliff. Refresh tokens are the standard fix and remain on the roadmap; they're a half-day-plus slice that needs its own design pass.

hashedone added 3 commits May 28, 2026 12:20
Previously the only path to recover from an expired `auth_sessions` row
was `tracevault login`. Tokens hard-expire at 30 days regardless of
how active the user is — daily users hit the cliff just as easily as
inactive ones, with no signal until a push gets blocked.

Replace the SELECT-only validity check in both `AuthUser` and `OrgAuth`
extractors with a single `UPDATE auth_sessions SET expires_at = NOW() +
INTERVAL '30 days' WHERE token_hash =  AND expires_at > NOW()
RETURNING user_id`. Atomic: one round-trip, no race between the
validity check and the extension.

Net effect:
* Active users (anything that talks to TV once per 30 days) never see
  a token expiry again — the window slides forward on every hit.
* Inactive users still hit the 30-day expiry cliff and must re-login,
  which is the right security tradeoff for genuinely-dormant tokens.
* No new endpoint, no client change, no breaking-change to existing
  callers.

Three integration tests against a real Postgres pool exercise the SQL
directly: extend-on-hit, no-revive-of-already-expired, no-op-on-
unknown-token.
The pre-push hook runs `tracevault check` and exits 1 on any failure.
Previously that 1 came with an opaque message — typically
"Stream failed (500 Internal Server Error)" or "Server returned 401:
Invalid or expired token". Both an agent and a human reader were left
guessing what to do.

Adds `connectivity_message()` which sniffs the raw error string and
appends a specific next step:

* 401 / Unauthorized: "Session token may be expired. Run
  `tracevault login --server-url=<server_url>` to refresh."
* 403 / Forbidden: org-membership hint (not a re-login).
* DNS / connect / timeout / connection refused: network + server_url
  hint.
* 5xx Server Error / Internal: "team has been paged, retry shortly."
* Unknown shapes: pass through the raw error, no fake advice.

All connectivity-class errors still propagate as Err so the pre-push
hook exits non-zero — a TV-tracked repo MUST be evaluated by TV on
every push, that contract doesn't change. The fix here is purely the
quality of the message the user/agent reads after the block.

Six unit tests pin the message shapes for the common error patterns
the api_client surfaces today.
Same lockfile regeneration as PR #211 carried — release-plz updated
deps in v0.16.0 that this branch picks up. CI runs `cargo check
--locked`, so commit the regenerated lock.
// it, (c) read back the user_id — one round-trip, no race.
let session_row = sqlx::query_as::<_, (Uuid,)>(
"SELECT user_id FROM auth_sessions WHERE token_hash = $1 AND expires_at > NOW()",
"UPDATE auth_sessions
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing SELECT with UPDATE on every auth turns a read into a write for each request, increasing DB write load and contention; consider the performance cost of frequent writes.

Details

✨ AI Reasoning
​The code swaps a SELECT-only check for an UPDATE..RETURNING that extends expires_at on every successful auth. While functionally intended, UPDATE issues a write to the auth_sessions table on every authenticated request. For high-throughput endpoints or many short-lived requests, this converts cheap reads into writes, increasing I/O, WAL generation, and potential contention/latency. The pattern is visible and the impact scales with authentication frequency.

🔧 How do I fix it?
Move constant work outside loops. Use StringBuilder instead of string concatenation in loops. Cache compiled regex patterns. Use hash-based lookups instead of nested loops. Batch database operations instead of N+1 queries.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

Comment on lines +243 to +266
fn connectivity_message(raw: &str) -> String {
let lower = raw.to_ascii_lowercase();
let action = if lower.contains("401") || lower.contains("unauthorized") {
Some("Session token may be expired. Run `tracevault login --server-url=<server_url>` to refresh.")
} else if lower.contains("403") || lower.contains("forbidden") {
Some("Your token lacks access to this repo's policies. Check your org membership and rerun `tracevault login`.")
} else if lower.contains("dns")
|| lower.contains("connect")
|| lower.contains("timed out")
|| lower.contains("timeout")
|| lower.contains("connection refused")
{
Some("Could not reach the TraceVault server. Check network and `server_url` in .tracevault/config.toml.")
} else if lower.contains("5") && (lower.contains("server error") || lower.contains("internal"))
{
Some("TraceVault server returned an error. The team has likely been paged; retry shortly.")
} else {
None
};

match action {
Some(a) => format!("Policy check could not run: {raw}.\n → {a}"),
None => format!("Policy check could not run: {raw}."),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connectivity_message builds an intermediate Option via a long if/else chain; return the formatted message immediately from each branch to flatten control flow (use early returns instead of action variable + match).

Show fix
Suggested change
fn connectivity_message(raw: &str) -> String {
let lower = raw.to_ascii_lowercase();
let action = if lower.contains("401") || lower.contains("unauthorized") {
Some("Session token may be expired. Run `tracevault login --server-url=<server_url>` to refresh.")
} else if lower.contains("403") || lower.contains("forbidden") {
Some("Your token lacks access to this repo's policies. Check your org membership and rerun `tracevault login`.")
} else if lower.contains("dns")
|| lower.contains("connect")
|| lower.contains("timed out")
|| lower.contains("timeout")
|| lower.contains("connection refused")
{
Some("Could not reach the TraceVault server. Check network and `server_url` in .tracevault/config.toml.")
} else if lower.contains("5") && (lower.contains("server error") || lower.contains("internal"))
{
Some("TraceVault server returned an error. The team has likely been paged; retry shortly.")
} else {
None
};
match action {
Some(a) => format!("Policy check could not run: {raw}.\n → {a}"),
None => format!("Policy check could not run: {raw}."),
}
fn connectivity_message(raw: &str) -> String {
if lower.contains("401") || lower.contains("unauthorized") {
return format!("Policy check could not run: {raw}.\n → Session token may be expired. Run `tracevault login --server-url=<server_url>` to refresh.");
}
if lower.contains("403") || lower.contains("forbidden") {
return format!("Policy check could not run: {raw}.\n → Your token lacks access to this repo's policies. Check your org membership and rerun `tracevault login`.");
}
if lower.contains("dns")
|| lower.contains("connect")
|| lower.contains("timed out")
|| lower.contains("timeout")
|| lower.contains("connection refused")
{
return format!("Policy check could not run: {raw}.\n → Could not reach the TraceVault server. Check network and `server_url` in .tracevault/config.toml.");
}
if lower.contains("5") && (lower.contains("server error") || lower.contains("internal")) {
return format!("Policy check could not run: {raw}.\n → TraceVault server returned an error. The team has likely been paged; retry shortly.");
}
format!("Policy check could not run: {raw}.")
Details

✨ AI Reasoning
​The function connectivity_message builds an intermediate Option via a long if/else-if chain and then matches that Option to format the output. This adds an extra mental indirection; using early returns (return formatted string as soon as a pattern matches) would flatten control flow and be clearer. The change introduced this new function and its conditional chain, increasing nesting and indirection in the codebase.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant