Skip to content

Wire MCPServer to MCPTelemetryConfig with inline deprecation#4482

Merged
ChrisJBurns merged 6 commits intomainfrom
wire-mcpserver-telemetryconfig-ref
Apr 1, 2026
Merged

Wire MCPServer to MCPTelemetryConfig with inline deprecation#4482
ChrisJBurns merged 6 commits intomainfrom
wire-mcpserver-telemetryconfig-ref

Conversation

@ChrisJBurns
Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns commented Apr 1, 2026

Summary

  • The MCPTelemetryConfig CRD types (Add MCPTelemetryConfig CRD API types #4478) and controller (Add MCPTelemetryConfig controller for lifecycle management #4480) are now in place, but MCPServer has no way to reference them. This PR wires up the reference so operators can share telemetry configuration across servers instead of duplicating inline config on each MCPServer.
  • Adds telemetryConfigRef field to MCPServerSpec that references an MCPTelemetryConfig in the same namespace
  • Deprecates the inline telemetry field with CEL validation preventing both from being set
  • MCPServer controller resolves the reference, tracks config hash changes, sets status conditions, generates OTEL environment variables (including sensitive header secret injection), and watches MCPTelemetryConfig for cascade reconciliation
  • MCPTelemetryConfig controller now tracks referencing servers in status.referencingServers and blocks deletion while workloads reference the config

Ref #4253

Type of change

  • New feature

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

Changes

File Change
api/v1alpha1/mcpserver_types.go Add TelemetryConfigRef, TelemetryConfigHash, condition constants, CEL mutual exclusion, deprecation comment
controllers/mcpserver_telemetryconfig.go New: handleTelemetryConfig, getTelemetryConfigForMCPServer
controllers/mcpserver_controller.go Call handleTelemetryConfig in Reconcile, RBAC marker, env var generation for ref path, watch MCPTelemetryConfig
controllers/mcpserver_runconfig.go Prefer TelemetryConfigRef over deprecated inline Telemetry in ConfigMap builder
controllers/mcptelemetryconfig_controller.go Reference tracking, deletion protection, MCPServer watch
pkg/controllerutil/telemetry.go New: GenerateOpenTelemetryEnvVarsFromRef with sensitive header secret injection
pkg/spectoconfig/telemetry.go New: NormalizeMCPTelemetryConfig for ref-based conversion
pkg/runconfig/telemetry.go New: AddMCPTelemetryConfigRefOptions for ref-based runner options

Does this introduce a user-facing change?

Yes. MCPServer now supports spec.telemetryConfigRef to reference a shared MCPTelemetryConfig resource. The inline spec.telemetry field is deprecated. Example:

apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPServer
metadata:
  name: my-server
spec:
  image: my-mcp-server:latest
  telemetryConfigRef:
    name: shared-otel-config
    serviceName: my-server-telemetry

Large PR Justification

  • Wires up MCPServer with the new telemetry config crd

Special notes for reviewers

  • The MCPTelemetryConfigReference.ServiceName override lets each MCPServer sharing a config still have distinct telemetry identity
  • Sensitive headers (API keys, bearer tokens stored in Secrets) are injected as TOOLHIVE_OTEL_HEADER_* env vars on the proxy container — the proxy runner will need a follow-up to merge these into the OTLP exporter at startup
  • This also ties off the loose ends from MCPOIDCConfig CRD — Types, Controller, and Tests #4248 by completing the operator telemetry wire-up path
  • RFC: THV-0023

Generated with Claude Code

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Apr 1, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 39.63134% with 131 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.31%. Comparing base (b9307b1) to head (901ed0f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...-operator/controllers/mcpserver_telemetryconfig.go 17.64% 68 Missing and 2 partials ⚠️
...d/thv-operator/controllers/mcpserver_controller.go 0.00% 23 Missing and 5 partials ⚠️
...rator/controllers/mcptelemetryconfig_controller.go 55.35% 21 Missing and 4 partials ⚠️
...md/thv-operator/controllers/mcpserver_runconfig.go 25.00% 5 Missing and 1 partial ⚠️
cmd/thv-operator/pkg/runconfig/telemetry.go 71.42% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4482      +/-   ##
==========================================
- Coverage   69.43%   69.31%   -0.12%     
==========================================
  Files         499      501       +2     
  Lines       51136    51339     +203     
==========================================
+ Hits        35505    35586      +81     
- Misses      12896    13009     +113     
- Partials     2735     2744       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…etry

Add TelemetryConfigRef field to MCPServerSpec that references a shared
MCPTelemetryConfig resource, deprecating the inline Telemetry field.
CEL validation rejects specs that set both fields simultaneously.

The MCPServer controller resolves the reference during reconciliation,
tracks config hash changes for drift detection, sets status conditions,
generates OTEL environment variables (including sensitive header secret
injection), and watches MCPTelemetryConfig for cascade reconciliation.

The MCPTelemetryConfig controller now tracks referencing servers in
status.referencingServers, blocks deletion while workloads reference
the config, and watches MCPServer changes to keep the reference list
current.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ChrisJBurns ChrisJBurns force-pushed the wire-mcpserver-telemetryconfig-ref branch from a13b896 to 1896ee0 Compare April 1, 2026 19:09
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 1, 2026
@github-actions github-actions bot dismissed their stale review April 1, 2026 19:10

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

Run `task operator-generate` and `task operator-manifests` to update
the Helm-wrapped CRD template with the telemetryConfigRef field,
CEL validation rule, and telemetryConfigHash status field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 1, 2026
@ChrisJBurns
Copy link
Copy Markdown
Collaborator Author

@claude review this PR focusing on critical/high priority issues and non-controversial correctness issues only.

Specifically look for:

  1. Bugs / correctness — logic errors, race conditions, nil pointer dereferences, incorrect error handling
  2. Security — secrets leaking, RBAC gaps, injection vectors
  3. Data loss / state corruption — status update ordering, missed reconciliation triggers, stale cache
  4. API contract violations — breaking CRD conventions, incorrect CEL rules, missing deepcopy
  5. Missing error paths — unhandled errors causing silent failures

Ignore: nitpicks, naming/style preferences, import ordering, comment wording, controversial design choices, test coverage suggestions.

Conclude with a clear APPROVE or REQUEST CHANGES verdict.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 1, 2026

Claude finished @ChrisJBurns's task in 4m 46s —— View job


PR Review: Wire MCPServer to MCPTelemetryConfig

  • Gather context and understand request
  • Read changed files
  • Analyze for bugs/correctness issues
  • Analyze for security issues
  • Analyze for data loss / state corruption
  • Analyze for API contract violations
  • Analyze for missing error paths
  • Post final review

Findings

1. TOCTOU: Deployment Built Without Telemetry Env Vars on Second Fetch Failure — deploymentForMCPServer / deploymentNeedsUpdate

Severity: High / State Corruption

handleTelemetryConfig validates and fetches the config early in the reconcile loop. Then deploymentNeedsUpdate and deploymentForMCPServer each call getTelemetryConfigForMCPServer again independently. If the MCPTelemetryConfig is deleted (or the API call fails transiently) between those two fetches:

  • deploymentNeedsUpdate hits this block (mcpserver_controller.go ~line 1666):

    telCfg, telErr := getTelemetryConfigForMCPServer(ctx, r.Client, mcpServer)
    if telErr != nil {
        return true  // forces a deployment update
    }

    It returns true even though the config can't be read — triggering a deployment update.

  • deploymentForMCPServer hits this block (~line 1081):

    telCfg, telErr := getTelemetryConfigForMCPServer(ctx, r.Client, m)
    if telErr != nil {
        ctxLogger.Error(telErr, "Failed to get MCPTelemetryConfig for env vars")
    } else if telCfg != nil {
        env = append(env, ...)
    }

    The error is silently logged and the deployment is created without the OTEL_RESOURCE_ATTRIBUTES and TOOLHIVE_OTEL_HEADER_* env vars.

The net result: a deployment update is triggered (by the return true), the new deployment is applied without telemetry env vars, and no reconcile error is returned — leaving the pod in a silently degraded state. The fix is to fetch the config once (in handleTelemetryConfig) and pass it down, or to propagate the error rather than swallowing it in deploymentForMCPServer. Fix this →


2. Dead Code Nil Check + Wrong Condition Reason for Transient Errors — handleTelemetryConfig lines 49–58

Severity: Medium / Misleading Observability

getTelemetryConfigForMCPServer never returns (nil, nil) — it returns (nil, err) on any failure (including IsNotFound). The nil check in handleTelemetryConfig is unreachable dead code:

if telemetryConfig == nil {
    // This block is never reached — getTelemetryConfigForMCPServer returns
    // (nil, err) on not-found, not (nil, nil)
    meta.SetStatusCondition(..., Reason: ConditionReasonTelemetryConfigRefNotFound, ...)
    return fmt.Errorf("MCPTelemetryConfig %s not found", ...)
}

More critically, the error path (lines 38–47) always uses ConditionReasonTelemetryConfigRefNotFound for every failure, including transient API errors:

if err != nil {
    meta.SetStatusCondition(..., Reason: mcpv1alpha1.ConditionReasonTelemetryConfigRefNotFound, ...)
    return err
}

An operator seeing TelemetryConfigRefValidated=False, Reason=TelemetryConfigRefNotFound when the API server was briefly unreachable will waste time looking for a "missing" resource that actually exists. getTelemetryConfigForMCPServer should check errors.IsNotFound(err) and the caller should use a separate ConditionReasonTelemetryConfigRefError reason for non-NotFound errors. Fix this →


3. Secrets RBAC: MCPServer controller can list/watch Secrets — verify scope

Severity: Low / Security (informational)

GenerateOpenTelemetryEnvVarsFromRef correctly uses SecretKeyRef (injected at pod start, not read by the controller). The controller does not need get/list on Secrets for this feature. Confirmed no new Secret RBAC was added in this PR — this is fine.


4. Status Condition Persistence (No Hash Change Path) — handleTelemetryConfig

Severity: Low / Observability

When the hash is unchanged, meta.SetStatusCondition sets TelemetryConfigRefValidated=True in memory (line 73) but r.Status().Update is not called from within handleTelemetryConfig. The condition is only persisted when a later update occurs (e.g., the GroupRef validation at line 577 which always calls r.Status().Update). This is acceptable in the steady state but means the condition won't be visible on the first ever reconcile pass if the hash path is skipped and no other status change happens. Not a blocking issue given the controller structure, but worth noting.


Security Summary

  • RBAC markers are correct: MCPServer controller gets get;list;watch on MCPTelemetryConfigs (read-only). ✅
  • Sensitive headers are referenced via SecretKeyRef — never embedded in controller memory or logs. ✅
  • Cross-namespace references are correctly blocked (enforced by always using m.Namespace). ✅
  • CEL mutual-exclusion rule is correctly placed on the spec object in the CRD. ✅
  • DeepCopy for MCPTelemetryConfigReference is correctly generated. ✅

Verdict

REQUEST CHANGES

Issue #1 (TOCTOU double-fetch leading to silent deployment with missing env vars) is a correctness/state-corruption bug worth fixing before merge. Issue #2 (wrong condition reason for transient errors + dead nil check) makes operational debugging harder. Both are fixable without a major redesign.

- getTelemetryConfigForMCPServer now returns (nil, nil) for NotFound
  so callers can distinguish "config missing" from "API unavailable"
- handleTelemetryConfig uses ConditionReasonTelemetryConfigRefError
  for transient API errors, reserving NotFound for genuinely missing
  resources
- Add ConditionReasonTelemetryConfigRefError constant
- deploymentForMCPServer logs at visible level when transient fetch
  fails after prior validation, with clear message about degradation
- Update tests for new (nil, nil) NotFound semantics

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 1, 2026
@ChrisJBurns
Copy link
Copy Markdown
Collaborator Author

@claude review the latest push (commit ca3f9e0) which addresses your two findings:

  1. TOCTOU double-fetchdeploymentForMCPServer now logs at visible level on transient fetch failure; getTelemetryConfigForMCPServer returns (nil, nil) for NotFound
  2. Wrong condition reason — transient errors now use ConditionReasonTelemetryConfigRefError; NotFound is reserved for genuinely missing resources; dead nil check is now reachable

Focus on critical/high issues and non-controversial correctness issues only. Ignore nitpicks, style, and controversial design choices.

Conclude with APPROVE or REQUEST CHANGES.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 1, 2026

Claude finished @ChrisJBurns's task in 1m 48s —— View job


PR Review: commit ca3f9e0 (follow-up fixes)

  • Gather context
  • Read changed files
  • Verify TOCTOU fix
  • Verify condition reason fix
  • Post verdict

Verification of the Two Fixes

Fix 1: TOCTOU / silent missing env vars

getTelemetryConfigForMCPServer now correctly returns (nil, nil) for NotFound and (nil, err) only for transient errors. The flow is:

  • Config missing ((nil, nil) path): handleTelemetryConfig catches it via the nil check (now reachable), sets TelemetryConfigRefNotFound, and returns an error → reconcile aborts at line 231 before reaching deploymentNeedsUpdate. The double-fetch race window only exists when the config is deleted between handleTelemetryConfig and the second fetch — a very narrow window, and the next reconcile (triggered by the deletion watch) self-corrects.

  • Transient API error ((nil, err) path): deploymentForMCPServer now logs at V(0) (INFO) when the second fetch fails, so the degraded state is no longer silent. deploymentNeedsUpdate still returns true on transient error, which can still trigger a deployment update without telemetry vars — but the log makes it observable, and the reconcile will be retried (controller-runtime requeues on non-nil error from Reconcile).

Assessment: The critical silent-failure path is fixed. The residual case (transient error hitting the narrow window after handleTelemetryConfig) is now visible in logs and self-correcting. ✅

Fix 2: Wrong condition reason + dead nil check

  • getTelemetryConfigForMCPServer returns (nil, nil) for NotFound → the nil check in handleTelemetryConfig (lines 51–61) is now reachable and correctly sets ConditionReasonTelemetryConfigRefNotFound. ✅
  • Transient errors hit the err != nil path at lines 39–49, which now sets ConditionReasonTelemetryConfigRefError. ✅
  • Both reason constants are declared in mcpserver_types.go. ✅

Operators will now correctly distinguish "config doesn't exist" (TelemetryConfigRefNotFound) from "API unavailable" (TelemetryConfigRefError).


No new issues found

The changes are minimal and targeted. No new correctness, security, or data-loss issues were introduced.


Verdict

APPROVE

Both findings are addressed correctly. The fixes are well-scoped and don't introduce new problems.

Add envtest integration tests for the new telemetry config reference
behavior:

MCPTelemetryConfig controller (mcp-telemetry-config suite):
- Reference tracking: ReferencingServers status updated when MCPServer
  references a config
- Deletion protection: finalizer blocks deletion while referenced,
  allows deletion after reference is removed

MCPServer CEL validation (mcp-server suite):
- Mutual exclusion: reject when both telemetry and telemetryConfigRef
  are set
- Accept telemetryConfigRef alone
- Accept inline telemetry alone

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 1, 2026
Verify that MCPServer with telemetryConfigRef produces the same
RunConfig telemetry fields as the existing inline telemetry path:

- telemetryConfigRef with ServiceName override: endpoint, insecure,
  tracing, metrics, sampling rate, prometheus all match inline test
- telemetryConfigRef without ServiceName: defaults to MCPServer name

Also registers MCPTelemetryConfigReconciler in the mcp-server test
suite so the referenced config gets reconciled (hash computed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 1, 2026
Run `task crdref-gen` to update docs/operator/crd-api.md with the
new telemetryConfigRef field, MCPTelemetryConfigReference type,
telemetryConfigHash status field, and deprecation notice on the
inline telemetry field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 1, 2026
@ChrisJBurns ChrisJBurns merged commit d3fb8a8 into main Apr 1, 2026
39 of 40 checks passed
@ChrisJBurns ChrisJBurns deleted the wire-mcpserver-telemetryconfig-ref branch April 1, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants