Skip to content

Wire authserver DCR resolver and add structured logs#5044

Open
tgrunnagle wants to merge 7 commits intomainfrom
issue_5039_dcr-2c
Open

Wire authserver DCR resolver and add structured logs#5044
tgrunnagle wants to merge 7 commits intomainfrom
issue_5039_dcr-2c

Conversation

@tgrunnagle
Copy link
Copy Markdown
Contributor

@tgrunnagle tgrunnagle commented Apr 23, 2026

Summary

  • Why: resolveDCRCredentials was wired up and unit-tested in the prior sub-issue, but EmbeddedAuthServer never called it, so any upstream configured with DCRConfig still booted without a client. This PR makes the resolver load-bearing and adds the operator-visible structured logs the Phase 2 spec calls for (cache-hit age, stale-registration warning, per-step errors, and a remediation hint when an upstream rejects a cached client as 4xx).
  • What:
    • EmbeddedAuthServer now owns an in-memory DCRCredentialStore and calls resolveDCRCredentials + consumeResolution for every OAuth2 upstream whose config requests DCR. buildPureOAuth2Config deliberately keeps its pure, network-free signature; the DCR-resolved ClientSecret is layered on afterwards via a new applyResolutionToOAuth2Config helper so the two application sites stay side-by-side.
    • Each upstream RunConfig element is shallow-copied and its OAuth2 sub-config is deep-copied before mutation, preserving the caller's RunConfig.Upstreams slice.
    • resolveDCRCredentials returns a new dcrStepError wrapping the originating step (metadata discovery, DCR call, store read, store write) instead of logging-and-returning at each branch. The boundary caller (buildUpstreamConfigs) emits one slog.Error via logDCRStepError, so there is a single error line per failure. A panic inside the singleflight closure is converted to the same dcrStepError carrying the captured debug.Stack(), so the boundary log surfaces it without an in-defer duplicate.
    • Cache-hit path emits slog.Debug with dcr_age_days and additionally slog.Warn with remediation text when the cached registration exceeds the 90-day dcrStaleAgeThreshold.
    • The /oauth/register handler success log emits at Debug with issuer, client_id, software_id, token_endpoint_auth_method, and scopes — silent at INFO+ per the project's logging convention. SoftwareID is threaded through ValidateDCRRequest, which now also caps it to 256 printable-ASCII characters.
    • MonitoredTokenSource accepts upstream and client_id at construction (replacing the earlier SetUpstreamContext setter, which had an unsynchronized writer). On the first transition to Unauthenticated driven by a permanent token-endpoint 4xx, it emits a single slog.Warn with a DCR/CIMD remediation hint (gated by stopOnce so a tight Token() loop cannot spam it, and suppressed when no client_id context is available). The classifier itself (isTransientNetworkError) is side-effect free.
    • NewHandler now fails loudly when AuthorizationServerConfig or its embedded *fosite.Config is nil; issuer() no longer carries a runtime nil-guard.
    • Error strings are run through a sanitizer that strips URL userinfo, query, and fragment components before logging. The regex matches http(s):// case-insensitively per RFC 3986 §3.1, and trailing sentence punctuation around a URL is preserved so prose is not mangled.
    • pkg/oauthproto/dcr.go handleHTTPResponse caps the upstream /register error body at 8 KiB via io.LimitReader so a non-conformant or malicious upstream cannot inflate operator log volume.

Closes #5039

Type of change

  • New feature

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

New coverage added:

  • pkg/authserver/runner/embeddedauthserver_test.go — full DCR boot against a mock AS (metadata + /register), cache-hit short-circuit (zero additional HTTP requests on the second buildUpstreamConfigs), copy-before-mutate assertion on the caller's RunConfig.Upstreams slice, and TestNewEmbeddedAuthServer_DCRBoot driving the real constructor to assert dcrStore is populated.
  • pkg/authserver/runner/dcr_test.go — cache-hit Debug log with dcr_age_days, stale-age Warn above threshold, one dcrStepError-returning case per failure step, and the panic-recovery test now asserts the wrapped error is a *dcrStepError(dcrStepRegister, …, Stack: <captured>).
  • pkg/authserver/server/handlers/handler_chain_test.goTestHandler_issuer covers both populated and zero-value AccessTokenIssuer. TestNewHandler_ErrorsOnNilConfig (in authorize_test.go) pins the new constructor invariant.
  • pkg/authserver/server/registration/dcr_test.gosoftware_id length cap and non-printable-ASCII rejection.
  • pkg/auth/monitored_token_source_test.go — permanent-4xx branch emits the remediation Warn with workload/upstream/client_id populated via the new constructor parameters.
  • pkg/runner/runner_test.goTestRemoteAuthLogContext covers the full cached CIMD > cached DCR > static precedence.
  • Sanitizer unit tests cover commas, periods, closing parens, mixed runs, a Go http.Client-style quoted URL, embedded user:pass@host userinfo (with and without query), implicit-flow #access_token=… fragments, and uppercase HTTPS:// schemes.

Changes

File Change
pkg/authserver/runner/embeddedauthserver.go Add dcrStore field and wire resolver into buildUpstreamConfigs; introduce applyResolutionToOAuth2Config; document the per-instance store vs. process-wide singleflight asymmetry.
pkg/authserver/runner/dcr.go Introduce dcrStepError / logDCRStepError; emit cache-hit Debug, stale-age Warn, and per-step error context; carry panic stack on the wrapped error so a single boundary record surfaces it; URL-secret-stripping sanitizer (userinfo + query + fragment, case-insensitive scheme match).
pkg/authserver/runner/dcr_store.go Inline doc rationale (no path citations) for resource-leak, copy-before-mutate, and constructor-validation reasoning.
pkg/authserver/server/handlers/dcr.go DCR success log emits at Debug, drop redundant upstream attribute, thread software_id through.
pkg/authserver/server/handlers/handler.go NewHandler validates config and config.Config; remove the runtime nil-guard from issuer().
pkg/authserver/server/registration/dcr.go Cap software_id to 256 printable-ASCII characters; export MaxSoftwareIDLength.
pkg/auth/monitored_token_source.go Accept upstream/client_id as constructor parameters; emit DCR/CIMD remediation Warn from markAsUnauthenticated (once per state transition, gated by stopOnce, suppressed when clientID == ""). The classifier is side-effect free.
pkg/runner/runner.go Populate new MonitoredTokenSource fields from RemoteAuthConfig with cached CIMD > cached DCR > static precedence (mirrors remote.Handler.resolveClientCredentials).
pkg/oauthproto/dcr.go Cap the upstream /register error body at 8 KiB via io.LimitReader before embedding it in the returned error string.
pkg/authserver/runner/embeddedauthserver_test.go DCR boot, cache-hit, and copy-before-mutate integration tests.
pkg/authserver/runner/dcr_test.go Log and per-step error tests, sanitizer userinfo/fragment/uppercase cases, augmented panic-recovery assertions.
pkg/authserver/server/handlers/handler_chain_test.go, authorize_test.go TestHandler_issuer and TestNewHandler_ErrorsOnNilConfig.
pkg/authserver/server/handlers/dcr_test.go, pkg/authserver/server/registration/dcr_test.go Assertions for new log fields and software_id validation.
pkg/auth/monitored_token_source_test.go, pkg/runner/runner_test.go New constructor plumbing, remediation Warn, and the CIMD-wins precedence case (closes review finding F26).

Large PR Justification

The PR exceeds the 400-LOC / 10-file guideline (1342+/104- across 18 files), but the change cannot be split without leaving the codebase in an intermediate non-functional state:

  • ~660 lines are tests (*_test.go) — table-driven coverage for the resolver wiring, the new error taxonomy, the staleness + remediation logs, and the sanitizer.
  • The remaining ~510 production lines across 9 files implement one logical change (wire the resolver into the auth-server boot path) plus four tightly-coupled sub-concerns: resolver wiring, dcrStepError taxonomy + boundary logging, monitored-token-source remediation log, and software_id validation. Splitting the resolver wiring without the matching error taxonomy would land a regression in observability; splitting the error taxonomy without the wiring would orphan the new types.

Does this introduce a user-facing change?

Operators gain two new diagnostic signals on DCR-enabled upstreams:

  1. A Warn when a cached DCR registration is older than 90 days, hinting that the operator may want to rotate.
  2. A Warn on the first transition to Unauthenticated driven by a permanent token-endpoint 4xx, hinting that the workload's cached DCR or CIMD credentials may be stale and that deleting them and restarting will trigger re-registration. The Warn fires at most once per state transition and is suppressed entirely for workloads with no client_id context.

The MonitoredTokenSource constructor signature has changed to require upstream and client_id, but it is an internal API with no public surface, and the prior SetUpstreamContext setter is removed.

Successful DCR registrations log at Debug, not Info, in keeping with the project's silent-success-at-INFO+ convention.

Special notes for reviewers

  • The purity gate on buildPureOAuth2Config is load-bearing: it takes no context.Context and performs no network I/O. DCR resolution is deliberately sequenced around it, not inside it, so the pure config builder stays trivially testable. Please flag any future change that threads ctx into it.
  • dcrStepError moves the "one error line per failure" invariant from the resolver to the boundary caller. If a future caller forgets to invoke logDCRStepError, the error will still propagate normally but the structured log will be missing. Worth keeping in mind if we add more resolver call sites.
  • consumeResolution (formerly applyResolution) is a one-shot state transition, not an idempotent default-fill: a second call after the first is a no-op only because the first cleared DCRConfig. The companion applyResolutionToOAuth2Config writes the inline-only ClientSecret onto the built *upstream.OAuth2Config; any future output path from OAuth2UpstreamRunConfig to upstream.OAuth2Config must call both helpers to get a fully-resolved DCR client.
  • The package-level dcrFlight singleflight is intentionally process-wide while EmbeddedAuthServer.dcrStore is per-instance — see the doc comments on each. The asymmetry is load-bearing for concurrent EmbeddedAuthServer instances in the same process.

Generated with Claude Code

@tgrunnagle tgrunnagle changed the base branch from main to issue_5038_dcr-2b April 23, 2026 20:47
@github-actions github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label Apr 23, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 23, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 74.11168% with 51 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.56%. Comparing base (b532156) to head (0073ad7).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/authserver/runner/dcr.go 68.53% 26 Missing and 2 partials ⚠️
pkg/auth/monitored_token_source.go 71.79% 10 Missing and 1 partial ⚠️
pkg/runner/runner.go 0.00% 9 Missing ⚠️
pkg/authserver/runner/embeddedauthserver.go 83.33% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5044      +/-   ##
==========================================
+ Coverage   67.50%   67.56%   +0.05%     
==========================================
  Files         601      601              
  Lines       61055    61262     +207     
==========================================
+ Hits        41218    41391     +173     
- Misses      16721    16745      +24     
- Partials     3116     3126      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 27, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 28, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 29, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 29, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 29, 2026
tgrunnagle added a commit that referenced this pull request Apr 29, 2026
Closes the 14 review threads from review #4199898531 that are scoped to
this PR; #3 and #5 are intentionally deferred to PR #5044 which is the
PR that consumes the ClientSecret plumbing and RedirectURI propagation.

Behaviour fixes:

- #2 synthesiseRegistrationEndpoint and resolveUpstreamRedirectURI now
  preserve the issuer's path component, so multi-tenant upstreams that
  ship DCR without advertising it (or this auth server when it itself
  has a tenant prefix in its issuer) keep their tenant prefix in the
  synthesised URL and the defaulted redirect URI.
- #7 resolveDCRCredentials wraps registration in a package-level
  singleflight.Group keyed by DCRKey so concurrent callers for the same
  upstream coalesce into a single RFC 7591 registration. The leader's
  closure rechecks the cache before any network I/O so followers that
  arrive just after a Put returns observe the fresh entry.
- #9 applyResolution writes ClientID only when the run-config copy
  leaves it empty, treating it symmetrically with the endpoint writes.
  This is defence-in-depth for callers that bypass the resolver's
  validateResolveInputs precondition check.
- #10 the cfg.RegistrationEndpoint short-circuit branch in
  resolveDCREndpoints validates URL well-formedness and HTTPS-or-
  loopback locally, so a non-HTTPS endpoint fails before
  performRegistration constructs a bearer-token transport for it.
- #13 endpointsFromMetadata validates discovered authorization_endpoint
  and token_endpoint before populating dcrEndpoints, so a self-
  consistent metadata document with http:// URLs cannot flow through
  to the auth-code or token-exchange path.

Doc / scope:

- #6 trust-assumption note added on DCRUpstreamConfig with a link to
  the follow-up SSRF-hardening issue (#5135) per the
  "Document Architectural Constraints" rule.
- #8 NewInMemoryDCRCredentialStore doc now explicitly calls out what
  the in-memory store does and does NOT solve (cross-replica sharing,
  durability, cross-process write coordination).

Test hygiene:

- #4 deleted the literal-equals-literal dcrStaleAgeThreshold test that
  added no meaningful coverage.
- #11 atomic.AddInt32 / LoadInt32 for wellKnownHits / customHits in
  TestFetchAuthorizationServerMetadataFromURL, matching the sibling
  pattern in dcr_test.go.
- #12 new failingDCRStore test double + two cases asserting the
  cache.Get / cache.Put wrap messages ("dcr: cache lookup", "dcr: cache
  put") that operators see when a future Redis backend errors.
- #14 renamed the TestNeedsDCR case from
  "client_id with dcr (invalid but returns false)" to
  "client_id wins over dcr_config (defensive AND semantic)" so the
  case label describes the intentional defensive behaviour rather
  than reading like a bug report.
- #15 removed a //nolint:lll directive that was suppressing a non-
  existent violation (the line is 108 chars; the limit is 130).
- #16 new TestInMemoryDCRCredentialStore_ConcurrentAccess fans out
  16 goroutines doing alternating Put/Get against overlapping and
  disjoint keys, with a fail-fast deadline, so go test -race actually
  exercises the sync.RWMutex guard.

New tests cover each behaviour change:

- TestResolveDCRCredentials_SingleflightCoalescesConcurrentCallers
  pins exactly-one-registration under N concurrent callers.
- TestSynthesiseRegistrationEndpoint_PreservesIssuerPath /
  TestResolveUpstreamRedirectURI_PreservesIssuerPath cover the path-
  preservation fix.
- TestApplyResolution_DoesNotOverwritePreProvisionedClientID covers
  the symmetric ClientID write.
- TestResolveDCREndpoints_DirectRegistrationEndpointValidated covers
  the local registration-endpoint guard.
- TestEndpointsFromMetadata_RejectsInsecureDiscoveredEndpoints covers
  the metadata authz/token endpoint guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tgrunnagle added a commit that referenced this pull request Apr 29, 2026
Closes the 14 review threads from review #4199898531 that are scoped to
this PR; #3 and #5 are intentionally deferred to PR #5044 which is the
PR that consumes the ClientSecret plumbing and RedirectURI propagation.

Behaviour fixes:

- #2 synthesiseRegistrationEndpoint and resolveUpstreamRedirectURI now
  preserve the issuer's path component, so multi-tenant upstreams that
  ship DCR without advertising it (or this auth server when it itself
  has a tenant prefix in its issuer) keep their tenant prefix in the
  synthesised URL and the defaulted redirect URI.
- #7 resolveDCRCredentials wraps registration in a package-level
  singleflight.Group keyed by DCRKey so concurrent callers for the same
  upstream coalesce into a single RFC 7591 registration. The leader's
  closure rechecks the cache before any network I/O so followers that
  arrive just after a Put returns observe the fresh entry.
- #9 applyResolution writes ClientID only when the run-config copy
  leaves it empty, treating it symmetrically with the endpoint writes.
  This is defence-in-depth for callers that bypass the resolver's
  validateResolveInputs precondition check.
- #10 the cfg.RegistrationEndpoint short-circuit branch in
  resolveDCREndpoints validates URL well-formedness and HTTPS-or-
  loopback locally, so a non-HTTPS endpoint fails before
  performRegistration constructs a bearer-token transport for it.
- #13 endpointsFromMetadata validates discovered authorization_endpoint
  and token_endpoint before populating dcrEndpoints, so a self-
  consistent metadata document with http:// URLs cannot flow through
  to the auth-code or token-exchange path.

Doc / scope:

- #6 trust-assumption note added on DCRUpstreamConfig with a link to
  the follow-up SSRF-hardening issue (#5135) per the
  "Document Architectural Constraints" rule.
- #8 NewInMemoryDCRCredentialStore doc now explicitly calls out what
  the in-memory store does and does NOT solve (cross-replica sharing,
  durability, cross-process write coordination).

Test hygiene:

- #4 deleted the literal-equals-literal dcrStaleAgeThreshold test that
  added no meaningful coverage.
- #11 atomic.AddInt32 / LoadInt32 for wellKnownHits / customHits in
  TestFetchAuthorizationServerMetadataFromURL, matching the sibling
  pattern in dcr_test.go.
- #12 new failingDCRStore test double + two cases asserting the
  cache.Get / cache.Put wrap messages ("dcr: cache lookup", "dcr: cache
  put") that operators see when a future Redis backend errors.
- #14 renamed the TestNeedsDCR case from
  "client_id with dcr (invalid but returns false)" to
  "client_id wins over dcr_config (defensive AND semantic)" so the
  case label describes the intentional defensive behaviour rather
  than reading like a bug report.
- #15 removed a //nolint:lll directive that was suppressing a non-
  existent violation (the line is 108 chars; the limit is 130).
- #16 new TestInMemoryDCRCredentialStore_ConcurrentAccess fans out
  16 goroutines doing alternating Put/Get against overlapping and
  disjoint keys, with a fail-fast deadline, so go test -race actually
  exercises the sync.RWMutex guard.

New tests cover each behaviour change:

- TestResolveDCRCredentials_SingleflightCoalescesConcurrentCallers
  pins exactly-one-registration under N concurrent callers.
- TestSynthesiseRegistrationEndpoint_PreservesIssuerPath /
  TestResolveUpstreamRedirectURI_PreservesIssuerPath cover the path-
  preservation fix.
- TestApplyResolution_DoesNotOverwritePreProvisionedClientID covers
  the symmetric ClientID write.
- TestResolveDCREndpoints_DirectRegistrationEndpointValidated covers
  the local registration-endpoint guard.
- TestEndpointsFromMetadata_RejectsInsecureDiscoveredEndpoints covers
  the metadata authz/token endpoint guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label May 1, 2026
Copy link
Copy Markdown
Contributor Author

@tgrunnagle tgrunnagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Agent Consensus Review

Agents consulted: security, concurrency, error-handling, tests, architecture, go-conventions, general-quality, review-friendliness. Codex cross-review skipped (codex CLI not installed).

Reviewer note: Posting as COMMENT because the authenticated reviewer is the PR author and the PR is a draft. The recommendation underneath is REQUEST CHANGES — there are three correctness/security issues (F3, F1, F2 below) that should land before this PR is review-ready.

Consensus Summary (29 retained findings)

# Finding Score Severity Action
F3 remoteAuthLogContext misses CIMD precedence; doc claim is wrong 10/10 HIGH Fix
F1 sanitizeErrorForLog does not strip URL userinfo 9/10 HIGH Fix
F2 Upstream /register error body is io.ReadAll-ed unbounded then logged 8/10 HIGH Fix or follow-up
F6 Handler.issuer() will panic on nil config — validation belongs in NewHandler 9/10 LOW Fix
F7 DCR success log promoted Debug→Info contradicts silent-success-at-INFO+ rule 9/10 LOW Fix or amend rule
F4 Singleflight panic recovery double-logs (in-defer Error + boundary Error) 8/10 MEDIUM Fix
F8 Permanent-4xx Warn has no test asserting fire/contents 8/10 MEDIUM Add test
F9 Cache-hit dcr_age_days Debug + stale-threshold Warn unexercised 8/10 MEDIUM Add test
F10 TestBuildUpstreamConfigs_DCR lacks failure-path subtest 8/10 MEDIUM Add test
F11 applyResolution is non-idempotent state transition; rename it 8/10 LOW Rename
F12 applyResolution + applyResolutionToOAuth2Config paired-call invariant unenforced 8/10 MEDIUM Collapse or document
F13 DCRStepError, LogDCRStepError, newDCRStepError exported but no external consumer 8/10 MEDIUM Lowercase
F23 trimURLTrailingPunctuation uses ContainsRune on byte input 8/10 LOW IndexByte instead
F19 Sanitizer doesn't strip URL fragments 7/10 MEDIUM One-line fix
F5 Permanent-4xx Warn risk of log-spam from Token() entry 7/10 MEDIUM Hoist out of classifier
F14 dcr.go is 1,186 lines; sanitizer is independent of DCR 7/10 MEDIUM Split file
F15 New private helpers placed above the // Private helpers divider 7/10 MEDIUM Reorder
F16 PR exceeds 400-LOC / 10-file limit; lacks Large PR Justification 7/10 MEDIUM Add section
F18 Sanitizer regex case-sensitive (HTTPS:// escapes) 7/10 LOW (?i) flag
F20 LogDCRStepError(nil) emits misleading record 7/10 LOW Nil guard
F21 dcrStore lifetime / interface lacks Close 7/10 LOW Document
F22 Process-wide dcrFlight vs per-instance dcrStore undocumented 7/10 LOW Document
F24 Production comments cite .claude/rules/... paths 7/10 LOW Inline principle
F25 Production comment names a specific test 7/10 LOW Drop ref
F26 TestRemoteAuthLogContext doesn't cover CIMD path (ties to F3) 8/10 MEDIUM Add case w/ F3
F27 TestHandler_issuer doesn't assert nil-config invariant 7/10 LOW Add assertion
F28 Mock-AS request count uses Greater(0) not exact 7/10 LOW Tighten
F29 Panic-recovery test doesn't errors.As(*DCRStepError) 7/10 LOW Augment
F30 Step-wrap tests use substring instead of errors.As 7/10 LOW Augment

Overall

The PR's structural shape is sound. The wiring of resolveDCRCredentials into EmbeddedAuthServer.NewEmbeddedAuthServer correctly observes the copy-before-mutate rule (deep-copying both the outer UpstreamRunConfig and the inner *OAuth2UpstreamRunConfig before passing to the resolver), and TestBuildUpstreamConfigs_DCR + TestNewEmbeddedAuthServer_DCRBoot pin those guarantees explicitly via assert.Same on the caller's pointer. The DCRStepError step taxonomy + boundary LogDCRStepError is a genuinely good observability pattern for a multi-step resolver. The new sanitizer's trailing-punctuation handling is well-tested with regression cases.

However, three correctness/security items should land before merge:

  1. F3 (HIGH, score 10/10)remoteAuthLogContext claims to mirror the precedence of pkg/auth/remote/handler.go::resolveClientCredentials, but the actual precedence in that function is CIMD > DCR > static, while the new helper handles only DCR > static. Workloads using cached CIMD client_ids will see a stale or empty client_id in the new DCR remediation Warn — defeating the operator-correlation the field exists for. Either fix the precedence or rewrite the comment.
  2. F1 (HIGH, score 9/10)sanitizeErrorForLog only clears u.RawQuery. URLs with embedded credentials (https://user:pass@host) have their userinfo preserved; URLs with fragments (e.g., implicit-flow callbacks #access_token=...) keep them too (F19). The function is documented as defense-in-depth, so the gap is precisely what the function exists to prevent. Two-line fix in the same block: u.User = nil; u.Fragment = "".
  3. F2 (HIGH, score 8/10)pkg/oauthproto/dcr.go:307 reads the upstream's error body via io.ReadAll(resp.Body) with no LimitReader (the success path correctly caps at 1 MiB), then embeds the body verbatim in fmt.Errorf("...status %d: %s", code, body). That error flows into LogDCRStepError, which only sanitizes URL query strings. A misbehaving or compromised upstream can echo arbitrary content into operator logs unbounded. The bug is technically in oauthproto/dcr.go (outside this PR's diff), but this PR is what surfaces the risk operationally — worth either fixing in this PR or filing a follow-up that blocks merge.

Beyond the HIGH items, the test-coverage cluster (F8, F9, F10, F26, F27–F30) is the next-most-important bucket: the PR's central observability contributions (the 4xx Warn, the staleness Warn, the boundary Error log) are not asserted by any test. A regression that drops or changes those records would not fail CI. Adding a slog-handler-tee fixture and exercising the three branches would close the gap.

The architecture cluster (F11–F13) is polish but worth considering before merge: the API surface widening (exported DCRStepError, etc.) and the paired-helpers contract (applyResolution + applyResolutionToOAuth2Config) are both maintenance hazards if left as comment-enforced invariants.

F14–F16 are review-experience findings; the github-actions bot has already left a CHANGES_REQUESTED for the >1000-line size, and adding a Large PR Justification section addresses that mechanically.

Notes for items that don't have an inline comment

  • F2 (unbounded upstream error body) — the affected file pkg/oauthproto/dcr.go is not in this PR's diff. Fix at pkg/oauthproto/dcr.go:307: replace io.ReadAll(resp.Body) with io.ReadAll(io.LimitReader(resp.Body, 8*1024)) and consider parsing as RFC 7591 JSON instead of %s-interpolating the raw body.
  • F8/F9/F10/F26–F30 (test gaps) — see the per-finding lines in the consensus table; rather than 7 tiny inline comments, group these into one PR-level test plan: (1) capture slog records via a tee handler in monitored_token_source_test.go, (2) drive the cache-hit + stale-threshold path in dcr_test.go with a pre-populated old CreatedAt, (3) add a failure-path subtest to TestBuildUpstreamConfigs_DCR with a 500 mock, (4) add a CIMD case to TestRemoteAuthLogContext, (5) augment the existing wrapped-error tests with errors.As(*DCRStepError) assertions.
  • F14 (dcr.go file size) — split sanitizeErrorForLog, trimURLTrailingPunctuation, queryStrippingPattern, and TestSanitizeErrorForLog into a new pkg/authserver/runner/logsanitize.go and logsanitize_test.go. Optionally also split DCRStepError + LogDCRStepError + the dcrStep* constants into pkg/authserver/runner/dcr_errors.go.
  • F15 (file org) — move newDCRStepError, sanitizeErrorForLog, trimURLTrailingPunctuation, and queryStrippingPattern below the // ----- Private helpers ----- divider in dcr.go (line ~573).
  • F16 (PR meta) — add a ## Large PR Justification section to the PR description: ~620 of the 1133 additions are tests, the remaining ~510 across 7 production files implement one logical change, and the four sub-concerns (resolver wiring, step taxonomy, monitored-token-source remediation log, software_id validation) cannot be split without orphaned-code intermediate states. Alternatively, run /split-pr to evaluate carving out the URL sanitizer + tests as a standalone PR.
  • F22 (dcrFlight lifetime) — the var declaration is package-level pre-existing code (not in any hunk). Add a doc comment explaining the intentional process-wide lifetime, contrasting with the per-instance dcrStore.
  • F26 is structurally identical to F3; fixing F3 closes both.

Generated with Claude Code

Comment thread pkg/runner/runner.go Outdated
Comment thread pkg/authserver/runner/dcr.go
Comment thread pkg/authserver/runner/dcr.go Outdated
Comment thread pkg/authserver/runner/dcr.go Outdated
Comment thread pkg/authserver/runner/dcr.go Outdated
Comment thread pkg/auth/monitored_token_source.go Outdated
Comment thread pkg/authserver/server/handlers/dcr.go Outdated
Comment thread pkg/authserver/server/handlers/handler.go
Comment thread pkg/authserver/server/handlers/handler.go Outdated
Comment thread pkg/authserver/runner/embeddedauthserver.go
tgrunnagle and others added 5 commits May 1, 2026 09:42
A misbehaving or malicious authorization server could echo arbitrary
content into the upstream's /register error response. handleHTTPResponse
read that body via io.ReadAll with no LimitReader, then embedded it
verbatim in the returned error — which downstream callers log. Cap the
read at 8 KiB (far larger than any conformant RFC 7591 error response)
so operator log volume cannot be inflated by a non-conformant upstream.

Addresses #5044 review finding F2 (HIGH).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The doc comment claimed remoteAuthLogContext mirrored the precedence of
remote.Handler.resolveClientCredentials, but the implementation skipped
the CachedCIMDClientID check entirely. For CIMD-authenticated workloads
the new DCR remediation Warn would have reported a stale or empty
client_id rather than the CIMD URL actually being sent on token refresh,
defeating the operator-correlation the field exists for.

Restore the documented precedence (CachedCIMDClientID >
CachedClientID > ClientID) and add a TestRemoteAuthLogContext case
covering the CIMD-wins path.

Addresses #5044 review findings F3 (HIGH) and F26
(MEDIUM, closed by the new test case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
isTransientNetworkError previously emitted the cached-DCR-client
remediation Warn from inside its classifier on every permanent 4xx.
A tight Token() loop hitting the same condition would spam the same
record on every call before the workload's Unauthenticated status
propagated. The same branch also fired for non-DCR workloads, which
saw a remediation telling them to "delete cached credentials" they
never had.

Strip the side effect from the classifier and emit the Warn from
markAsUnauthenticated, which already gates the close-monitoring
transition through stopOnce. The Warn now fires at most once per
state transition, is suppressed when no client_id context is
available, and reads honestly about the variability ("if this
workload uses cached DCR or CIMD credentials they may be stale").

Addresses #5044 review finding F5 (MEDIUM).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related polish items in the authserver handlers package:

NewHandler did not validate that AuthorizationServerConfig (or its
embedded *fosite.Config) was non-nil. issuer() reached into the
config at request time, so a misconfigured caller would panic deep
inside an HTTP handler instead of failing at startup. Add the nil
check to the constructor and simplify issuer() to rely on the
constructor invariant. Pin the new invariant with
TestNewHandler_ErrorsOnNilConfig.

The /oauth/register success log was promoted to Info even though the
operation is neither long-running nor exceptional. Demote back to
Debug; an audit-log path is the right home for the audit signal if
that becomes a requirement.

Addresses #5044 review findings F6 and F7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles ten review-feedback items in pkg/authserver/runner; each
addresses internal-API hygiene or doc-comment drift in the DCR
resolver, no behaviour change for callers.

Sanitizer hardening (F1, F19, F18, F23): sanitizeErrorForLog now
strips userinfo and fragment in addition to the query, since either
can carry credentials or tokens (https://user:pass@host,
implicit-flow #access_token=...). queryStrippingPattern matches
http/https case-insensitively per RFC 3986 §3.1 so HTTPS://...
cannot escape sanitisation. trimURLTrailingPunctuation switches to
strings.IndexByte to match the ASCII-only terminator set without
the rune-decoding overhead. Test cases added for each.

Resolver error API (F4, F13, F20): the dcrStepRegister panic-recovery
branch no longer emits a duplicate slog.Error; the captured stack
travels with the wrapped *dcrStepError to the boundary log so a
single panic produces a single record. DCRStepError /
LogDCRStepError lowercased to dcrStepError / logDCRStepError since
no caller lives outside the package. logDCRStepError now no-ops on
nil err so the unknown-step branch cannot fire on a missing failure.

Resolution helpers (F11, F12, F25): applyResolution renamed to
consumeResolution to communicate that it is a one-shot state
transition (clearing DCRConfig is unconditional), not an idempotent
defaulting step. The applyResolutionToOAuth2Config doc now states
the paired-call invariant explicitly without referencing a specific
test.

Lifecycle docs (F21, F22): the per-instance dcrStore vs.
process-wide dcrFlight asymmetry is now stated on both sides, and
EmbeddedAuthServer.Close documents the future-Close hook for a
backend with handles.

Inline rules-file rationale (F24): production comments no longer
cite .claude/rules/... by path; the principle is inlined.

Addresses #5044 review findings F1, F4, F11, F12,
F13, F18, F19, F20, F21, F22, F23, F24, F25.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tgrunnagle
Copy link
Copy Markdown
Contributor Author

Body-only review items follow-up:

F2 (HIGH) — Addressed in 4d504fe. pkg/oauthproto/dcr.go was outside this PR's diff; the unbounded io.ReadAll on the upstream's /register error body is now wrapped in io.LimitReader(resp.Body, 8*1024), so a non-conformant upstream cannot inflate operator log volume. I kept the existing %s-into-fmt.Errorf shape rather than parsing as RFC 7591 JSON; the cap is the load-bearing fix here.

F8 / F9 / F10 / F26-F30 (test-gap cluster) — Deferred. F26 was closed mechanically by the new CIMD case in TestRemoteAuthLogContext that landed with the F3 fix. The remaining items (slog-handler tee, failure-path subtest, panic-recovery errors.As(*dcrStepError) augmentation, etc.) I'd like to land as a follow-up so this PR stays focused on the correctness/security fixes. Happy to open a tracking issue if useful.

Generated with Claude Code

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 1, 2026
@github-actions github-actions Bot dismissed their stale review May 1, 2026 16:54

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 1, 2026
@tgrunnagle tgrunnagle marked this pull request as ready for review May 1, 2026 16:54
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes — none blocking.

Comment thread pkg/auth/monitored_token_source.go Outdated
Comment thread pkg/oauthproto/dcr.go
Comment thread pkg/runner/runner.go Outdated
Three follow-up items from review:

isPermanentTokenEndpointError used to treat every non-5xx
RetrieveError as permanent, which meant the DCR remediation Warn
fired on transient back-pressure (408 Request Timeout, 429 Too Many
Requests). Narrow the classifier to an explicit 400 / 401 / 403
allowlist so the "delete the cached credentials and restart" hint
fires only on responses where stale cached client credentials are a
plausible cause. Retry behaviour is unaffected — the retry decision
is gated entirely by isTransientNetworkError, which already returns
false for the whole RetrieveError branch.

handleHTTPResponse in pkg/oauthproto/dcr.go now drains the response
body after the LimitReader-bounded ReadAll. Without this, an upstream
returning more than 8 KiB on /register error left the connection mid-
stream when the deferred Close fired, preventing TCP connection
reuse. Mirrors what the non-JSON content-type branch a few lines
down already does.

remoteAuthLogContext moved out of pkg/runner into pkg/auth/remote as
*Config.LogContext(), collocating it with resolveClientCredentials
so the cached-CIMD > cached-DCR > static precedence has one home.
Adding a fourth cached field in future updates both call sites in
one place. Tests moved to pkg/auth/remote/config_test.go.

Addresses #5044 review #4212656411 comments
3174540081, 3174540082, and 3174540086.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 1, 2026
@tgrunnagle tgrunnagle requested a review from jhrozek May 1, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Authserver DCR: wire resolver into authserver and add structured logs (Phase 2, Steps 2d/2g)

2 participants