Skip to content

feat: Administrate supernode via MCP#29

Merged
gonzalezzfelipe merged 12 commits into
mainfrom
feat/mcp
May 18, 2026
Merged

feat: Administrate supernode via MCP#29
gonzalezzfelipe merged 12 commits into
mainfrom
feat/mcp

Conversation

@gonzalezzfelipe
Copy link
Copy Markdown
Contributor

@gonzalezzfelipe gonzalezzfelipe commented May 15, 2026

Summary by CodeRabbit

  • New Features

    • Introduced MCP (Model Context Protocol) server enabling automation of Supernode workload deployment and management through AI tools
    • Added extension catalog supporting Cardano relay, Dolos, and Hydra node deployment
    • Implemented workload lifecycle tools for installing, upgrading, and deleting extensions
    • Added Vault integration for runtime secret management
    • Enabled Kubernetes resource discovery and workload inspection via MCP tools
  • Documentation

    • Added comprehensive guides for MCP-driven extension deployment workflows
    • Documented Hydra node deployment and operations procedures
    • Updated control-plane configuration documentation with MCP options
  • Chores

    • Added GitHub Actions CI/CD pipelines for MCP server builds and testing
    • Updated Helm charts with MCP integration and secret management enhancements

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

📝 Walkthrough

Walkthrough

Adds a new Rust MCP server with policy/audit, Kubernetes and Helm integration, Vault client, SQLite sessions, tool router and workload tools; introduces CI workflows and Dockerfiles; wires control-plane Helm templates and post-install Vault token flow; updates Hydra-node chart/templates/metrics; and expands documentation and skills.

Changes

End-to-end MCP server, control-plane wiring, Hydra updates, and docs

Layer / File(s) Summary
Complete change set
.github/..., mcp-server/..., extensions/control-plane/..., extensions/hydra-node/..., extensions/cardano-node/..., bootstrap/kind/README.md, skills/*
All code/config/docs in this PR including MCP server crate and tools, CI/image builds, control-plane Helm resources and script, Hydra-node chart and metrics, and docs/skills updates.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Axum
  participant ToolRouter
  participant K8s
  participant Helm
  participant Vault
  Client->>Axum: /mcp call_tool (workloads.install)
  Axum->>ToolRouter: dispatch
  ToolRouter->>K8s: discover releases/services
  ToolRouter->>Helm: upgrade --install (if not dryRun)
  ToolRouter-->>Axum: result (plan or execution)
  Axum-->>Client: JSON response
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • txpipe/metis#14 — Prior Hydra Node Helm extension work; this PR updates the same chart/templates.
  • txpipe/metis#25 — Cardano node metrics flow touched; this PR adds the role field and MCP metrics consumption.
  • txpipe/metis#9 — Control-plane chart foundation; this PR adds MCP templates/values and post-install flow.

Suggested reviewers

  • mduthey
  • scarmuega

Poem

In burrows of code where YAMLs grow,
A rabbit ships an MCP to go.
Helm bells ring, K8s sings low,
Vault keeps secrets none may know.
Hydra wakes, extensions bloom—
Hop, deploy—reviews resume! 🐇🚀

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/mcp

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (20)
extensions/control-plane/templates/stage4-02-clusterrole-supernode-mcp.yaml-10-36 (1)

10-36: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Cluster-wide RBAC is over-privileged for a “readonly” role.

This role currently allows mutating cluster resources (including full secret CRUD) across all namespaces. That’s a high-risk permission set for the default deployment path and doesn’t match the role intent.

Please split permissions into:

  • a namespace-scoped Role/RoleBinding for mutable workload operations in the target namespace, and
  • a minimal ClusterRole only for truly cluster-scoped reads (if required).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/templates/stage4-02-clusterrole-supernode-mcp.yaml`
around lines 10 - 36, The ClusterRole is over-privileged: remove all mutating
verbs (create, update, patch, delete) and secret CRUD from the cluster-scoped
role and instead create a namespace-scoped Role + RoleBinding for mutable
workload ops. Concretely, change the ClusterRole entries that currently list
resources like
"serviceaccounts","services","persistentvolumeclaims","configmaps","secrets","deployments","statefulsets","podmonitors"
and verbs ["create","get","list","watch","update","patch","delete"] to a minimal
ClusterRole that only has get/list/watch for truly cluster-scoped reads (e.g.,
"namespaces","storageclasses","pods" as needed) and ensure "pods/log" and
"pods/exec" remain namespace-scoped; then add a Role (target namespace) that
grants create/get/list/watch/update/patch/delete for
"serviceaccounts","services","persistentvolumeclaims","configmaps","secrets","deployments","statefulsets","podmonitors"
and bind it to the same subject via a RoleBinding. Use the resource names and
verbs in the diff to locate the rules to modify and add corresponding
Role/RoleBinding objects.
extensions/control-plane/values.yaml-166-174 (1)

166-174: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden defaults: avoid enabling MCP with auth.mode: trusted by default.

Shipping enabled + trusted auth as defaults exposes an unauthenticated admin surface to any in-cluster caller unless users override values.

Consider defaulting to one of:

  • enabled: false, or
  • a non-trusted auth mode with explicit credentials configuration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/values.yaml` around lines 166 - 174, Change the
insecure default values in the Helm values file: do not ship enabled: true with
auth.mode: trusted. Update the defaults (the keys referenced: enabled,
auth.mode, and related image/replicaCount) so that either enabled: false is the
default OR auth.mode is a non-trusted mode that requires explicit credentials
(and add placeholder credential configuration fields) so the MCP does not expose
an unauthenticated admin surface out-of-the-box; adjust values.yaml accordingly
and document the required credentials for any non-disabled default.
extensions/cardano-node/templates/configmap-metrics.yaml-459-459 (1)

459-459: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not hardcode role to relay in emitted metrics.

The script computes node_role from METIS_NODE_ROLE, but the output now always reports "relay", which mislabels block-producer nodes.

Suggested fix
-    jq -n \
+    jq -n \
+      --arg nodeRole "$node_role" \
@@
-          role: "relay",
+          role: (nullable_string($nodeRole) // "relay"),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/cardano-node/templates/configmap-metrics.yaml` at line 459, The
emitted metrics hardcode role: "relay" instead of using the computed node role;
update the template in configmap-metrics.yaml so the role field uses the
computed variable (node_role / METIS_NODE_ROLE) rather than the literal "relay".
Locate the metrics emission block that sets role: "relay" and replace that
hardcoded string with the template variable that holds the result of
METIS_NODE_ROLE (e.g., use the node_role/template variable used earlier) so
block-producer nodes are labeled correctly.
extensions/control-plane/templates/stage4-00-pvc-supernode-mcp.yaml-1-24 (1)

1-24: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the standalone PVC; it’s unused with the current StatefulSet flow.

When persistence.enabled=true and existingClaim is empty, the StatefulSet already provisions session-store via volumeClaimTemplates. This template creates a second PVC (supernode-mcp) that is never mounted and can still consume storage.

Suggested fix
-{{- if and .Values.supernodeMcp.enabled .Values.supernodeMcp.persistence.enabled (not .Values.supernodeMcp.persistence.existingClaim) }}
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: supernode-mcp
-  namespace: '{{ .Values.namespace }}'
-  labels:
-    app.kubernetes.io/name: supernode-mcp
-    app.kubernetes.io/component: mcp
-{{- with .Values.supernodeMcp.persistence.annotations }}
-  annotations:
-{{ toYaml . | indent 4 }}
-{{- end }}
-spec:
-  accessModes:
-{{ toYaml .Values.supernodeMcp.persistence.accessModes | indent 4 }}
-{{- $storageClass := default .Values.storageClass .Values.supernodeMcp.persistence.storageClass }}
-{{- if $storageClass }}
-  storageClassName: {{ $storageClass }}
-{{- end }}
-  resources:
-    requests:
-      storage: {{ .Values.supernodeMcp.persistence.size }}
-{{- end }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/templates/stage4-00-pvc-supernode-mcp.yaml` around
lines 1 - 24, This template creates an unused PersistentVolumeClaim named
"supernode-mcp" which duplicates storage when the StatefulSet already provisions
"session-store" via volumeClaimTemplates; remove this PersistentVolumeClaim
resource (the apiVersion: v1 kind: PersistentVolumeClaim block that references
.Values.supernodeMcp.persistence and creates name: supernode-mcp) so PVCs are
only created by the StatefulSet's volumeClaimTemplates, or alternatively gate
rendering behind a new flag (e.g., .Values.supernodeMcp.useStandalonePVC) and
update the top conditional to require that flag before emitting the
supernode-mcp PVC.
.github/image/Dockerfile-5-5 (1)

5-5: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid piping mutable remote Helm installer scripts into bash; pin and verify Helm artifact instead.

Line 5 introduces a supply-chain and reproducibility gap. The Helm installer script cannot verify checksums (requires openssl, which is not installed on line 4), and no version is pinned—the script always fetches from the main branch. This creates a non-deterministic, unverified artifact in your build.

Use the suggested diff approach: download and verify a pinned Helm tarball explicitly, then extract it. This ensures reproducibility and eliminates reliance on a mutable remote script.

Suggested hardening diff
 RUN apt-get update \
     && apt-get install -y --no-install-recommends ca-certificates curl \
-    && curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash \
+    && HELM_VERSION="v3.19.0" \
+    && curl -fsSLo /tmp/helm.tgz "https://get.helm.sh/helm-${HELM_VERSION}-linux-${TARGETARCH}.tar.gz" \
+    && curl -fsSLo /tmp/helm.tgz.sha256 "https://get.helm.sh/helm-${HELM_VERSION}-linux-${TARGETARCH}.tar.gz.sha256" \
+    && echo "$(cat /tmp/helm.tgz.sha256)  /tmp/helm.tgz" | sha256sum -c - \
+    && tar -xzf /tmp/helm.tgz -C /tmp \
+    && mv "/tmp/linux-${TARGETARCH}/helm" /usr/local/bin/helm \
+    && chmod +x /usr/local/bin/helm \
+    && rm -rf /tmp/helm.tgz /tmp/helm.tgz.sha256 /tmp/linux-${TARGETARCH} \
     && apt-get purge -y --auto-remove curl \
     && rm -rf /var/lib/apt/lists/*
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/image/Dockerfile at line 5, Replace the unsafe piped installer
command ("curl -fsSL
https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash")
with a pinned, verifiable Helm release flow: choose a specific Helm version
(e.g., vX.Y.Z), download the corresponding release tarball from GitHub Releases,
fetch the official sha256 checksum for that exact version, verify the downloaded
tarball checksum with a checksum tool (openssl/sha256sum), extract the helm
binary and install it to the target location, and remove the downloaded
artifacts; ensure any required tooling for verification (openssl or coreutils)
is installed in the image before verification to make the process reproducible
and supply-chain safe.
extensions/control-plane/scripts/post_install.sh-34-41 (1)

34-41: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make MCP Vault provisioning opt-in.

Line 34 defaults MCP_VAULT_ENABLED to true, so a normal post-install now mints a long-lived orphan token with write access to the runtime KV even when supernode-mcp is not enabled. That silently expands the credential footprint for existing clusters.

Suggested fix
-MCP_VAULT_ENABLED="${MCP_VAULT_ENABLED:-true}"
+MCP_VAULT_ENABLED="${MCP_VAULT_ENABLED:-false}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/scripts/post_install.sh` around lines 34 - 41, The
script currently defaults MCP_VAULT_ENABLED to true causing automatic Vault
token provisioning; change the default to opt-in by setting
MCP_VAULT_ENABLED="${MCP_VAULT_ENABLED:-false}" and ensure any conditional
checks that rely on MCP_VAULT_ENABLED (the Vault provisioning branch that uses
MCP_VAULT_POLICY_NAME, MCP_VAULT_TOKEN_SECRET_NAME, MCP_VAULT_TOKEN_SECRET_KEY,
MCP_VAULT_TOKEN_DISPLAY_NAME, MCP_VAULT_TOKEN_TTL, MCP_VAULT_TOKEN_PERIOD,
MCP_VAULT_TOKEN_ORPHAN) respect the new default so provisioning only runs when
MCP_VAULT_ENABLED is explicitly enabled.
extensions/control-plane/scripts/post_install.sh-212-237 (1)

212-237: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Revoke the MCP token if Secret creation fails.

Lines 229-237 create the Vault token before the kubectl apply pipeline. If that pipeline fails, the script exits and cleanup only removes the temp file, leaving a live orphan token behind.

Suggested fix
+MCP_VAULT_SECRET_APPLIED="false"
+
 cleanup() {
   if [[ -n "${PORT_FORWARD_PID}" ]] && kill -0 "${PORT_FORWARD_PID}" >/dev/null 2>&1; then
     kill "${PORT_FORWARD_PID}" >/dev/null 2>&1 || true
     wait "${PORT_FORWARD_PID}" >/dev/null 2>&1 || true
   fi
   rm -f "${PORT_FORWARD_LOG}"
   if [[ -n "${MCP_VAULT_TOKEN_FILE}" ]]; then
+    if [[ "${MCP_VAULT_SECRET_APPLIED}" != "true" ]]; then
+      vault token revoke "$(cat "${MCP_VAULT_TOKEN_FILE}")" >/dev/null 2>&1 || true
+    fi
     rm -f "${MCP_VAULT_TOKEN_FILE}"
   fi
 }
 ...
   vault token create "${token_args[@]}" >"$MCP_VAULT_TOKEN_FILE"

   kubectl "${KUBECTL_ARGS[@]}" -n "$NAMESPACE" create secret generic "$MCP_VAULT_TOKEN_SECRET_NAME" \
     "--from-file=${MCP_VAULT_TOKEN_SECRET_KEY}=${MCP_VAULT_TOKEN_FILE}" \
     --dry-run=client \
     -o yaml \
     | kubectl "${KUBECTL_ARGS[@]}" -n "$NAMESPACE" apply -f -
+  MCP_VAULT_SECRET_APPLIED="true"
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/scripts/post_install.sh` around lines 212 - 237, The
script currently creates the Vault token via the vault token create call and
writes it to MCP_VAULT_TOKEN_FILE but doesn't revoke it if the kubectl apply
pipeline fails; change the flow so after running kubectl ... apply -f - you
check the pipeline exit status and if it failed, read the token from
MCP_VAULT_TOKEN_FILE and revoke it with the vault token revoke command (using
the token string read from MCP_VAULT_TOKEN_FILE), then remove the temp file;
alternatively add a local failure handler/trap that revokes the token on error
using the same MCP_VAULT_TOKEN_FILE token contents so no live orphan token
remains when kubectl apply fails.
.github/workflows/test_mcp_server.yml-25-25 (1)

25-25: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update to actions/checkout@v4 for runner compatibility.

The actions/checkout@v3 action runner is outdated. As per static analysis, update to v4 for compatibility with current GitHub Actions runners.

🔄 Proposed fix
       - name: Checkout
-        uses: actions/checkout@v3
+        uses: actions/checkout@v4
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test_mcp_server.yml at line 25, Replace the outdated
checkout action reference "uses: actions/checkout@v3" with the newer
"actions/checkout@v4" in the workflow so the runner uses the v4-compatible
checkout action; locate the line containing the "uses: actions/checkout@v3"
entry and update it to reference v4.
.github/workflows/check_mcp_server.yml-25-25 (1)

25-25: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update to actions/checkout@v4 for runner compatibility.

The actions/checkout@v3 action runner is outdated. As per static analysis, update to v4 for compatibility with current GitHub Actions runners.

🔄 Proposed fix
       - name: Checkout
-        uses: actions/checkout@v3
+        uses: actions/checkout@v4
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/check_mcp_server.yml at line 25, Update the GitHub Actions
workflow step that currently uses actions/checkout@v3 to actions/checkout@v4:
locate the workflow step with the uses entry "actions/checkout@v3" and change it
to "actions/checkout@v4" to ensure runner compatibility with current GitHub
Actions environments.
mcp-server/Dockerfile-11-15 (1)

11-15: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pin Helm version instead of using the latest script.

Downloading and executing the Helm install script via curl | bash without version pinning is a security risk. If the script or Helm upstream is compromised, or if breaking changes are introduced, builds could fail or include malicious code.

🔒 Proposed fix with pinned version
 RUN apt-get update \
-    && apt-get install -y --no-install-recommends ca-certificates curl \
-    && curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash \
-    && apt-get purge -y --auto-remove curl \
+    && apt-get install -y --no-install-recommends ca-certificates curl wget \
+    && wget https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz \
+    && tar -zxvf helm-v3.14.0-linux-amd64.tar.gz \
+    && mv linux-amd64/helm /usr/local/bin/helm \
+    && rm -rf helm-v3.14.0-linux-amd64.tar.gz linux-amd64 \
+    && apt-get purge -y --auto-remove curl wget \
     && rm -rf /var/lib/apt/lists/*

Note: Update the version number to match your requirements. Verify the latest stable release.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/Dockerfile` around lines 11 - 15, Replace the unpinned "curl |
bash" Helm install in the RUN line with a pinned-version installation: add a
HELM_VERSION build-arg or ENV, download the specific release tarball for that
version (from
github.com/helm/helm/releases/download/v${HELM_VERSION}/helm-v${HELM_VERSION}-linux-amd64.tar.gz),
verify it against a known SHA256 checksum, extract the binary and move it to
/usr/local/bin, and only then remove the tarball and cleanup apt lists; do not
pipe the upstream install script to bash and ensure checksum verification to
prevent supply-chain risks (update the existing RUN command that currently
invokes the Helm install script).
.github/workflows/build_mcp_server.yml-14-14 (1)

14-14: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove continue-on-error from the build job.

Allowing the build job to fail silently defeats the purpose of CI and can mask critical build failures. If the build fails, the workflow should fail.

🔧 Proposed fix
   build:
-    continue-on-error: true

     strategy:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build_mcp_server.yml at line 14, The CI build job
currently sets the YAML key continue-on-error: true which allows failures to be
ignored; remove that key (or set it to false) from the build job definition so
the workflow fails on build errors—look for the build job block that contains
continue-on-error and delete or change that setting to ensure the job's failures
surface.
.github/workflows/build_mcp_server.yml-41-43 (1)

41-43: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update to a supported Rust toolchain action.

The actions-rs/toolchain@v1 action is deprecated and its runner is too old for current GitHub Actions. As per static analysis, update to a maintained alternative.

🔄 Proposed fix using dtolnay/rust-toolchain
       - name: Install stable toolchain
-        uses: actions-rs/toolchain@v1
+        uses: dtolnay/rust-toolchain@stable
         with:
           toolchain: stable
+          targets: ${{ matrix.target }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build_mcp_server.yml around lines 41 - 43, Replace the
deprecated actions-rs/toolchain@v1 usage with a maintained Rust toolchain action
(e.g., dtolnay/rust-toolchain) in the workflow: update the uses entry that
currently references actions-rs/toolchain@v1 and keep the intended
channel/toolchain configuration (toolchain: stable) in the new action's inputs,
adjusting the input key names if the alternative action expects different field
names; then run the workflow to ensure the runner and toolchain install step
succeed.
mcp-server/src/k8s/helm_releases.rs-61-67 (1)

61-67: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

A single malformed Helm secret should not fail the entire discovery response.

collect::<Result<Vec<_>, _>>()? makes discovery brittle: one bad secret blocks all valid releases.

Proposed fix (best-effort decode)
-        Ok(latest_releases(
-            secrets
-                .items
-                .iter()
-                .map(decode_helm_release_secret)
-                .collect::<Result<Vec<_>, _>>()?,
-            include_control_plane,
-        ))
+        let decoded = secrets
+            .items
+            .iter()
+            .filter_map(|secret| decode_helm_release_secret(secret).ok())
+            .collect::<Vec<_>>();
+        Ok(latest_releases(decoded, include_control_plane))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/k8s/helm_releases.rs` around lines 61 - 67, The current use of
collect::<Result<Vec<_>, _>>()? over secrets.items with
decode_helm_release_secret makes the whole discovery fail on one bad secret;
change the flow so you best-effort decode each item: iterate secrets.items,
attempt decode_helm_release_secret for each (calling the same function), collect
only the successful Ok values into a Vec (e.g., via filter_map or flat_map
equivalent) and skip or log Err results, then pass that Vec into latest_releases
along with include_control_plane so a single malformed secret does not break the
entire response.
mcp-server/src/k8s/helm_releases.rs-255-259 (1)

255-259: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not exclude all releases just because they are in the control-plane namespace.

The namespace check causes false positives and hides legitimate non-control-plane releases in that namespace.

Proposed fix
 fn is_control_plane_release(release: &HelmReleaseSummary) -> bool {
     release.name == CONTROL_PLANE_RELEASE_NAME
         || release.chart.name.as_deref() == Some(CONTROL_PLANE_RELEASE_NAME)
-        || release.namespace == CONTROL_PLANE_RELEASE_NAME
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/k8s/helm_releases.rs` around lines 255 - 259, The function
is_control_plane_release in helm_releases.rs is incorrectly treating any release
in the CONTROL_PLANE_RELEASE_NAME namespace as the control plane, causing false
positives; remove the namespace equality check (the condition "||
release.namespace == CONTROL_PLANE_RELEASE_NAME") and only consider release.name
and release.chart.name.as_deref() == Some(CONTROL_PLANE_RELEASE_NAME) when
determining control-plane releases (adjust or add tests for HelmReleaseSummary
if needed).
mcp-server/src/helm.rs-60-60 (1)

60-60: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden Helm values-file handling for secrets.

Lines 84-97 write raw values to disk with default permissions, and Line 60 ignores cleanup failures. If values contain secrets, this risks leakage on shared nodes. Create files with restrictive perms (0600), prefer secure temp-file primitives, and surface/log delete failures.

Also applies to: 84-97

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/helm.rs` at line 60, The code currently writes raw Helm values
to disk with default permissions and silently ignores cleanup
(std::fs::remove_file(&values_path)); fix by creating the values file with
restrictive permissions and using a secure temp-file primitive: replace the raw
write at the "write raw values to disk" block (lines 84-97) with a
tempfile::NamedTempFile (or tempfile::Builder) to create an atomic,
securely-permissioned temporary file (or explicitly create with OpenOptions +
set_mode(0o600) on Unix), write the values into that temp file, then atomically
persist or move it to values_path if needed; also change the remove_file call
that currently ignores errors to check the Result and log or return the error
(e.g., using processLogger/tracing::error) instead of discarding it so deletion
failures are surfaced.
mcp-server/src/helm.rs-59-62 (1)

59-62: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a process-level timeout around Helm execution.

Line 59 can block indefinitely if the Helm process hangs; --timeout does not guarantee subprocess termination in all failure modes. Wrap command execution with tokio::time::timeout and return a structured timeout error.

Suggested change
+use tokio::time::{timeout, Duration};
...
-    let output = Command::new(&helm_bin).args(&args).output().await;
+    let output = timeout(
+        Duration::from_secs(15 * 60),
+        Command::new(&helm_bin).args(&args).output(),
+    )
+    .await
+    .map_err(|_| HelmInstallError::Execute(std::io::Error::new(
+        std::io::ErrorKind::TimedOut,
+        "helm command timed out",
+    )))?;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/helm.rs` around lines 59 - 62, The Helm command execution can
hang indefinitely; wrap the await on
Command::new(&helm_bin).args(&args).output().await in tokio::time::timeout with
a reasonable Duration and map a timeout result to a new structured
HelmInstallError variant (e.g., HelmInstallError::Timeout) so callers get a
clear error; ensure you still remove the temp values file (values_path) and
convert the successful output via output.map_err(HelmInstallError::Execute) as
before, returning the timeout error when tokio::time::timeout returns Err.
mcp-server/src/policy/scopes.rs-23-40 (1)

23-40: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Prevent maintenance errors when adding new scopes.

The all() method manually lists every enum variant. When a new Scope is added to the enum (lines 7-20), the developer must remember to also add it to this array. The compiler won't catch omissions, creating a subtle bug where the new scope won't be included in policy checks that rely on all().

🔧 Consider using a macro or const array

One approach is to use a declarative macro that generates both the enum and the all() method from a single source of truth:

macro_rules! define_scopes {
    ($($variant:ident),* $(,)?) => {
        #[derive(Debug, Clone, Copy, Eq, PartialEq, Ord, PartialOrd, Serialize)]
        #[serde(rename_all = "kebab-case")]
        pub enum Scope {
            $($variant,)*
        }

        impl Scope {
            pub fn all() -> BTreeSet<Self> {
                [$(Self::$variant,)*].into_iter().collect()
            }
        }
    };
}

define_scopes! {
    Discover,
    Debug,
    WorkloadsInstall,
    WorkloadsUpgrade,
    WorkloadsDelete,
    VaultRuntimeMetadata,
    VaultRuntimeRead,
    VaultRuntimeWrite,
    VaultOperatorMetadata,
    VaultOperatorWrite,
    VaultOperatorRead,
    Admin,
}

Alternatively, use the strum crate with EnumIter to generate iteration automatically.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/policy/scopes.rs` around lines 23 - 40, The current
Scope::all() method manually lists every Scope variant so newly added variants
can be omitted; change the design to generate the enum and its all() collection
from a single source of truth (e.g., introduce a define_scopes! macro or use
strum::EnumIter) so adding a variant updates both enum and all() automatically;
update the Scope declaration and the impl Scope::all() to be produced by that
macro (or derive EnumIter and implement all() by iterating over Variant::iter())
and remove the hand-maintained array of variants in all().
mcp-server/src/tools/router.rs-772-777 (1)

772-777: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clamp user-provided limit to a bounded maximum.

list_params accepts any u32 limit, so callers can request very large lists and create avoidable API pressure.

Proposed fix
 fn list_params(arguments: Option<&JsonObject>, default_limit: Option<u32>) -> ResourceListParams {
+    const MAX_LIMIT: u32 = 200;
+    let limit = optional_u32(arguments, "limit")
+        .map(|v| v.min(MAX_LIMIT))
+        .or(default_limit);
+
     ResourceListParams {
         label_selector: optional_string(arguments, "labelSelector"),
         field_selector: optional_string(arguments, "fieldSelector"),
-        limit: optional_u32(arguments, "limit").or(default_limit),
+        limit,
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/router.rs` around lines 772 - 777, list_params currently
accepts any u32 for limit which can cause unbounded requests; change its limit
construction to clamp user-provided values to a safe maximum: introduce a
MAX_LIST_LIMIT constant (e.g., 1000) and compute limit as
optional_u32(arguments, "limit").map(|l|
l.min(MAX_LIST_LIMIT)).or(default_limit) (or equivalent), keeping the rest of
ResourceListParams fields the same; reference the function name list_params, the
types ResourceListParams and optional_u32, and the default_limit parameter when
making this change.
mcp-server/src/tools/router.rs-869-877 (1)

869-877: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Schema type validation is incomplete for JSON Schema primitive types.

Unknown type values currently pass validation (_ => true), so declared array/null constraints are silently ignored.

Proposed fix
         let matches = match expected_type {
             "boolean" => value.is_boolean(),
             "integer" => value.as_i64().is_some(),
             "number" => value.as_f64().is_some(),
             "object" => value.is_object(),
             "string" => value.is_string(),
-            _ => true,
+            "array" => value.is_array(),
+            "null" => value.is_null(),
+            _ => false,
         };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/router.rs` around lines 869 - 877, The type-check branch
around schema.get("type").and_then(Value::as_str) currently treats unknown types
as valid (the _ => true arm) and omits "array" and "null"; update the match for
expected_type in the same block to include "array" => value.is_array() and
"null" => value.is_null(), and change the fallback arm from _ => true to _ =>
false so unknown/unsupported type strings fail validation; keep the rest of the
arms ("boolean", "integer", "number", "object", "string") as-is and ensure you
update the variable referenced as matches to use the new logic.
mcp-server/src/tools/dynamic.rs-33-48 (1)

33-48: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

On error, keep existing definitions instead of clearing them.

When discover_definitions() fails (e.g., Kubernetes temporarily unavailable), the current code returns an empty Vec, which causes all dynamic tools to disappear from the tool list. This degrades user experience—during transient failures, users see tools vanish and may think they're permanently gone. The function should preserve the existing definitions when refresh fails.

🔧 Proposed fix to preserve definitions on error
 pub(crate) async fn refresh(&self) -> bool {
     let definitions = match discover_definitions().await {
         Ok(definitions) => definitions,
         Err(error) => {
             tracing::debug!(%error, "failed to refresh dynamic MCP tools");
-            Vec::new()
+            return false;
         }
     };

     let mut current = self.definitions.write().await;
     let changed = tool_signature(current.as_ref()) != tool_signature(&definitions);
     if changed {
         *current = Arc::new(definitions);
     }
     changed
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/dynamic.rs` around lines 33 - 48, The refresh function
currently replaces definitions with an empty Vec on discover_definitions()
error; instead, when discover_definitions() returns Err, log the error and
return false without modifying self.definitions so existing dynamic tools are
preserved. Update the logic in refresh: call discover_definitions().await, and
on Err(error) do tracing::debug!(%error, "failed to refresh dynamic MCP tools")
(or similar) and return false; only compute tool_signature and update *current =
Arc::new(definitions) when discover_definitions() returns Ok. Reference:
refresh, discover_definitions, self.definitions, and tool_signature.
🟡 Minor comments (7)
mcp-server/src/resources/uri.rs-29-31 (1)

29-31: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add validation to ensure consistency with parser.

The function creates URIs without validating that extension_id is non-empty and contains no / characters. The parser at line 23 rejects such URIs, creating an inconsistency where this function can produce URIs that parse() would reject.

🛡️ Proposed fix to add validation
 pub fn extension_catalog_entry_uri(extension_id: &str) -> String {
+    assert!(!extension_id.is_empty() && !extension_id.contains('/'),
+        "extension_id must be non-empty and cannot contain '/'");
     format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}")
 }

Alternatively, return Option<String> to handle invalid input gracefully:

-pub fn extension_catalog_entry_uri(extension_id: &str) -> String {
+pub fn extension_catalog_entry_uri(extension_id: &str) -> Option<String> {
+    if extension_id.is_empty() || extension_id.contains('/') {
+        return None;
+    }
-    format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}")
+    Some(format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}"))
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/resources/uri.rs` around lines 29 - 31, The
extension_catalog_entry_uri function can produce URIs that the parser rejects;
change its signature to return Option<String>, validate that extension_id is
non-empty and does not contain '/' (e.g., if extension_id.is_empty() ||
extension_id.contains('/') return None), and otherwise return
Some(format!("{EXTENSION_CATALOG_ENTRY_PREFIX}{extension_id}")); also update any
callers to handle the Option and keep behavior consistent with the parse()
function used elsewhere.
extensions/control-plane/scripts/post_install.sh-263-271 (1)

263-271: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Print the correct controller kind in the restart instructions.

Line 271 tells users to restart a Deployment, but this PR deploys supernode-mcp as a StatefulSet. The command shown here will fail as written.

Suggested fix
-Wire the Secret into the Helm-managed MCP Deployment with:
+Wire the Secret into the Helm-managed MCP StatefulSet with:
   helm upgrade ${RELEASE_NAME} . \\
     --namespace ${NAMESPACE} \\
     --reuse-values \\
     --set supernodeMcp.vault.tokenSecretRef.name=${MCP_VAULT_TOKEN_SECRET_NAME} \\
     --set supernodeMcp.vault.tokenSecretRef.key=${MCP_VAULT_TOKEN_SECRET_KEY}

-If the MCP Deployment is already running, restart it after the Helm upgrade:
-  kubectl ${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart deployment/supernode-mcp
+If the MCP StatefulSet is already running, restart it after the Helm upgrade:
+  kubectl ${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart statefulset/supernode-mcp
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/scripts/post_install.sh` around lines 263 - 271, The
restart instruction currently references a Deployment but the controller is a
StatefulSet; update the kubectl rollback/restart command shown in the
post-install instructions so it targets the correct kind by replacing "rollout
restart deployment/supernode-mcp" with "rollout restart
statefulset/supernode-mcp" (the line that contains the kubectl ... rollout
restart command for supernode-mcp in post_install.sh).
.github/workflows/test_mcp_server.yml-18-18 (1)

18-18: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename job from lint to test for clarity.

The job is named lint but actually runs tests via cargo test. This is misleading and inconsistent with the workflow name.

📝 Proposed fix
 jobs:
-  lint:
+  test:
     runs-on: ubuntu-latest
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test_mcp_server.yml at line 18, The job currently declared
as lint should be renamed to test to reflect that it runs cargo test: update the
job identifier and any job name fields from "lint" to "test" in the workflow
(the job block starting with "lint:"), and adjust any references or dependent
job names accordingly so the job id and display name match the actual command
(cargo test).
bootstrap/kind/README.md-112-112 (1)

112-112: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix workload type references from Deployment to StatefulSet.

The rollout status commands reference deployment/supernode-mcp (lines 112 and 126), but the control-plane Helm chart deploys supernode-mcp as a StatefulSet (defined in stage4-04-statefulset-supernode-mcp.yaml). These commands should use statefulset/supernode-mcp instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bootstrap/kind/README.md` at line 112, The README currently references
deployment/supernode-mcp but the chart actually creates a StatefulSet; update
both occurrences that mention deployment/supernode-mcp to
statefulset/supernode-mcp so commands like kubectl -n control-plane rollout
status target the correct resource (supernode-mcp) and ensure any other
rollout/status examples use statefulset/... instead of deployment/....
mcp-server/src/config.rs-72-77 (1)

72-77: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject non-positive session TTL values.

Line 76 currently accepts 0 and negative TTLs. Validate TTL as strictly positive during config parsing to avoid silent misconfiguration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/config.rs` around lines 72 - 77, The session_ttl_seconds()
parser currently accepts zero and negative values; after parsing the env var in
session_ttl_seconds(), validate that the parsed i64 is strictly > 0 and return
an Err(ConfigError::InvalidSessionTtl(value)) (or appropriate ConfigError
variant) when value <= 0 instead of returning Ok(Some(value)); implement this by
mapping the parsed result to check the positivity (e.g., replace the direct
parse mapping with a closure that parses then checks value > 0 and returns an
Err on invalid values) while keeping the existing transpose() flow.
mcp-server/src/tools/router.rs-81-82 (1)

81-82: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the invoked tool name when mapping vault write/patch errors.

vault.runtime.patch currently reports failures under "vault.runtime.write", which mislabels audit/error details and complicates debugging.

Proposed fix
-            "vault.runtime.write" => vault_runtime_write(arguments, WriteMode::Replace).await,
-            "vault.runtime.patch" => vault_runtime_write(arguments, WriteMode::Patch).await,
+            "vault.runtime.write" => {
+                vault_runtime_write("vault.runtime.write", arguments, WriteMode::Replace).await
+            }
+            "vault.runtime.patch" => {
+                vault_runtime_write("vault.runtime.patch", arguments, WriteMode::Patch).await
+            }

 async fn vault_runtime_write(
+    tool_name: &str,
     arguments: Option<&JsonObject>,
     default_mode: WriteMode,
 ) -> CallToolResult {
@@
-        Err(error) => return vault_error("vault.runtime.write", error),
+        Err(error) => return vault_error(tool_name, error),
@@
-        Err(error) => return vault_error("vault.runtime.write", error),
+        Err(error) => return vault_error(tool_name, error),
@@
-        Err(error) => vault_error("vault.runtime.write", error),
+        Err(error) => vault_error(tool_name, error),
     }
 }

Also applies to: 116-145

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/router.rs` around lines 81 - 82, The error/audit entries
for the vault write/patch branches are mislabelled because both match arms call
vault_runtime_write(arguments, WriteMode::...) but still record the tool as
"vault.runtime.write"; update the router match arms for "vault.runtime.write"
and "vault.runtime.patch" so the invoked tool name is passed into or set on the
call site (e.g., pass an extra parameter like tool_name or set a context) before
calling vault_runtime_write, or make vault_runtime_write accept and use the tool
name when composing errors/audit logs; ensure both branches
("vault.runtime.write" and "vault.runtime.patch") supply their correct string so
failures are mapped to the actual invoked tool name.
mcp-server/src/tools/workloads/mod.rs-24-25 (1)

24-25: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

workloads.list input schema is missing supported filter fields.

The handler accepts namespace, labelSelector, fieldSelector, and limit, but schema currently rejects/omits them. This creates avoidable client-contract drift.

Proposed fix
-            input_schema: r#"{"type":"object","properties":{"includeControlPlane":{"type":"boolean"}},"additionalProperties":false}"#,
+            input_schema: r#"{"type":"object","properties":{"namespace":{"type":"string"},"includeControlPlane":{"type":"boolean"},"labelSelector":{"type":"string"},"fieldSelector":{"type":"string"},"limit":{"type":"integer","minimum":1,"maximum":200}},"additionalProperties":false}"#,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/workloads/mod.rs` around lines 24 - 25, The input_schema
for the workloads.list handler currently omits supported filter fields, causing
client/schema drift; update the input_schema JSON (the value assigned to
input_schema in the workloads.list handler) to include properties for
"namespace" (string), "labelSelector" (string), "fieldSelector" (string) and
"limit" (integer), keep additionalProperties as false and ensure the root type
remains "object" so the handler's accepted params (namespace, labelSelector,
fieldSelector, limit) validate against the schema.
🧹 Nitpick comments (8)
mcp-server/src/catalog/mod.rs (1)

42-44: ⚡ Quick win

Add is_empty() method to complement len().

Rust convention recommends providing an is_empty() method whenever len() is exposed, which avoids the clippy::len_without_is_empty lint and improves API ergonomics.

✨ Proposed addition
     pub fn len(&self) -> usize {
         self.extensions.len()
     }
+
+    pub fn is_empty(&self) -> bool {
+        self.extensions.is_empty()
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/catalog/mod.rs` around lines 42 - 44, Add a complementary
is_empty() method to the same impl that exposes len(): implement pub fn
is_empty(&self) -> bool that forwards to self.extensions.is_empty() (or returns
self.extensions.len() == 0) so callers can check emptiness and to satisfy the
clippy::len_without_is_empty lint; add it alongside the existing pub fn
len(&self) -> usize.
mcp-server/Dockerfile (1)

3-7: ⚡ Quick win

Optimize Docker layer caching by separating dependency and source builds.

The current approach copies all source files before building, which invalidates the cache on every source change. Building dependencies in a separate layer first would significantly speed up rebuilds.

⚡ Proposed optimization
 WORKDIR /app
 COPY Cargo.toml Cargo.lock ./
-COPY src ./src

+# Build dependencies first (creates a cache layer)
+RUN mkdir src && echo "fn main() {}" > src/main.rs \
+    && cargo build --release \
+    && rm -rf src
+
+# Now copy real source and build
+COPY src ./src
 RUN cargo build --release

This technique creates a dummy main.rs, builds dependencies, then replaces with real source. Dependencies are cached until Cargo.toml or Cargo.lock change.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/Dockerfile` around lines 3 - 7, The Dockerfile copies the entire
src before building which busts the layer cache on any source change; to fix,
implement a multi-step build: first COPY Cargo.toml and Cargo.lock and add a
minimal/dummy src/main.rs (or touch a placeholder) then RUN cargo build
--release to cache dependency compilation, then COPY the real src/ (replacing
the dummy) and RUN cargo build --release again to produce the final artifact;
update the Dockerfile steps that reference COPY Cargo.toml Cargo.lock ./, COPY
src ./src and RUN cargo build --release to follow this dependency-first pattern
so rebuilding only invalidates when Cargo.toml or Cargo.lock change.
.github/workflows/build_mcp_server.yml (1)

94-95: ⚡ Quick win

Consider removing the unmaintained Docker layer caching action.

The satackey/action-docker-layer-caching action is deprecated and unmaintained. Docker Buildx (already in use at line 78) provides built-in caching via BuildKit that is more reliable and actively maintained.

♻️ Proposed fix
       - uses: satackey/action-docker-layer-caching@v0.0.11
-        continue-on-error: true

Remove these lines entirely. The docker/build-push-action@v5 already uses BuildKit caching.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build_mcp_server.yml around lines 94 - 95, Remove the
deprecated satackey/action-docker-layer-caching step: delete the action entry
"uses: satackey/action-docker-layer-caching@v0.0.11" and its "continue-on-error:
true" configuration from the workflow; rely on the existing
docker/build-push-action@v5 / Docker Buildx caching already configured (the
buildx step referenced in the workflow) instead of adding this unmaintained
action.
.github/workflows/test_mcp_server.yml (1)

23-28: ⚡ Quick win

Consider adding Rust caching to speed up CI.

The build workflow uses Swatinem/rust-cache@v2 to cache dependencies. Adding the same caching here would speed up test runs.

⚡ Proposed addition
     steps:
       - name: Checkout
         uses: actions/checkout@v4
+
+      - uses: Swatinem/rust-cache@v2
+        with:
+          shared-key: "test"
+          workspaces: mcp-server -> target

       - name: Test
         run: cargo test
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test_mcp_server.yml around lines 23 - 28, Add a Rust
dependency cache step before the existing "Test" step to speed up CI: insert a
step (e.g., name: "Setup Rust cache") that uses Swatinem/rust-cache@v2 and
configures caching for cargo (registry, index, and target/build artifacts) keyed
on Cargo.lock (or equivalent) so subsequent "Test" (cargo test) runs reuse
dependencies/build outputs; reference the new step name "Setup Rust cache" and
the existing "Test" step when placing it in the workflow.
.github/workflows/check_mcp_server.yml (1)

23-28: ⚡ Quick win

Consider adding Rust caching to speed up CI.

The build workflow uses Swatinem/rust-cache@v2 to cache dependencies. Adding the same caching here would speed up the clippy checks.

⚡ Proposed addition
     steps:
       - name: Checkout
         uses: actions/checkout@v4
+
+      - uses: Swatinem/rust-cache@v2
+        with:
+          shared-key: "lint"
+          workspaces: mcp-server -> target

       - name: Clippy check lints
         run: cargo clippy -- -D warnings
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/check_mcp_server.yml around lines 23 - 28, The workflow
currently runs the "Clippy check lints" step without restoring a Rust/Cargo
cache, which slows CI; add a step using Swatinem/rust-cache@v2 before the clippy
run to restore and save Cargo cache (registry, git index, and target) so
subsequent runs are faster — insert a new step named e.g. "Restore Rust cache"
that uses Swatinem/rust-cache@v2 with appropriate inputs (rust-version or
default) and ensure the cache step runs before the existing "Clippy check lints"
step and will also save the cache after the job completes.
mcp-server/src/audit/event.rs (1)

60-60: ⚡ Quick win

Consider whether secret_values_included should be parameterized.

The field is currently hardcoded to false in the factory method. If there are scenarios where secret values should be included in audit events (e.g., in non-production environments or with special audit flags), you may want to add a parameter to from_policy_outcome rather than hardcoding this value.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/audit/event.rs` at line 60, The factory currently hardcodes
secret_values_included: false in from_policy_outcome; make it configurable by
adding a parameter (e.g., secret_values_included: bool) to the
from_policy_outcome signature, use that parameter when constructing the Event
(replace the hardcoded false), and update all call sites to pass the appropriate
flag (or provide an overloaded/optional default wrapper that supplies false for
existing callers); ensure the field and any docs/tests reflect the new
parameter.
mcp-server/src/session/sqlite.rs (1)

77-94: ⚖️ Poor tradeoff

Expired session cleanup runs on every load.

The delete_expired call on line 80 runs for every load() invocation. If expired sessions accumulate, this could add latency to session lookups. Consider whether periodic background cleanup (via a separate task or cron job) would be more efficient for production workloads with high session churn.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/session/sqlite.rs` around lines 77 - 94, Expired-session
cleanup is being executed on every load (delete_expired inside load), which can
add latency; remove the delete_expired call from async fn load(&self, ...) and
instead perform cleanup from a dedicated background mechanism that invokes
delete_expired via the existing with_connection(...) helper: either spawn a
periodic async task (e.g., tokio::spawn) at store initialization to call
delete_expired on an interval, or implement a lightweight probabilistic/interval
guard inside the session store (track last_cleanup timestamp in the store struct
and only run delete_expired when enough time has passed) so load() only performs
the SELECT logic and returns state_json without running delete_expired each
call.
mcp-server/src/tools/workloads/dolos.rs (1)

150-155: ⚡ Quick win

Consider logging scale-back failures during error recovery.

When wait_for_dolos_pods_absent times out or PVC deletion fails, the code attempts to restore the original replica count but silently ignores errors with let _ = .... If this recovery fails, the StatefulSet remains scaled to zero, and there's no indication in the logs. While returning the original error to the user is correct, logging recovery failures would aid debugging.

📊 Add recovery failure logging
 if let Err(error) = wait_for_dolos_pods_absent(&client, &namespace, &release_name).await {
-    let _ = client
+    if let Err(recovery_error) = client
         .scale_stateful_set(&namespace, &stateful_set_name, original_replicas)
-        .await;
+        .await
+    {
+        tracing::warn!(
+            %recovery_error,
+            namespace,
+            stateful_set = stateful_set_name,
+            "failed to restore StatefulSet replicas after timeout"
+        );
+    }
     return error;
 }

Also applies to: 161-165

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/workloads/dolos.rs` around lines 150 - 155, The recovery
path that calls client.scale_stateful_set(&namespace, &stateful_set_name,
original_replicas).await currently ignores errors; change it to capture the
result and, on Err(e), log a clear recovery-failure message including the
namespace, stateful_set_name and original_replicas along with the recovery error
before returning the original error from wait_for_dolos_pods_absent; apply the
same change to the other identical block (lines 161-165) so scale-back failures
are recorded for debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5f3be41-6308-4a52-b115-be6598276fb5

📥 Commits

Reviewing files that changed from the base of the PR and between 637efa2 and da7019c.

⛔ Files ignored due to path filters (1)
  • mcp-server/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (70)
  • .github/image/Dockerfile
  • .github/workflows/build_mcp_server.yml
  • .github/workflows/check_mcp_server.yml
  • .github/workflows/test_mcp_server.yml
  • .gitignore
  • bootstrap/kind/README.md
  • extensions/cardano-node/templates/configmap-metrics.yaml
  • extensions/control-plane/README.md
  • extensions/control-plane/scripts/post_install.sh
  • extensions/control-plane/templates/stage4-00-pvc-supernode-mcp.yaml
  • extensions/control-plane/templates/stage4-01-serviceaccount-supernode-mcp.yaml
  • extensions/control-plane/templates/stage4-02-clusterrole-supernode-mcp.yaml
  • extensions/control-plane/templates/stage4-03-clusterrolebinding-supernode-mcp.yaml
  • extensions/control-plane/templates/stage4-04-statefulset-supernode-mcp.yaml
  • extensions/control-plane/templates/stage4-05-service-supernode-mcp.yaml
  • extensions/control-plane/values.yaml
  • mcp-server/.dockerignore
  • mcp-server/Cargo.toml
  • mcp-server/Dockerfile
  • mcp-server/src/audit/event.rs
  • mcp-server/src/audit/mod.rs
  • mcp-server/src/audit/sink.rs
  • mcp-server/src/auth/mod.rs
  • mcp-server/src/catalog/cardano_node_relay.rs
  • mcp-server/src/catalog/dolos.rs
  • mcp-server/src/catalog/extension.rs
  • mcp-server/src/catalog/mod.rs
  • mcp-server/src/catalog/schema.rs
  • mcp-server/src/config.rs
  • mcp-server/src/errors.rs
  • mcp-server/src/helm.rs
  • mcp-server/src/k8s/client.rs
  • mcp-server/src/k8s/helm_releases.rs
  • mcp-server/src/k8s/mod.rs
  • mcp-server/src/main.rs
  • mcp-server/src/mcp.rs
  • mcp-server/src/policy/approvals.rs
  • mcp-server/src/policy/mod.rs
  • mcp-server/src/policy/roles.rs
  • mcp-server/src/policy/scopes.rs
  • mcp-server/src/prompts/catalog.rs
  • mcp-server/src/prompts/mod.rs
  • mcp-server/src/resources/mod.rs
  • mcp-server/src/resources/router.rs
  • mcp-server/src/resources/uri.rs
  • mcp-server/src/server.rs
  • mcp-server/src/session/mod.rs
  • mcp-server/src/session/sqlite.rs
  • mcp-server/src/tools/common.rs
  • mcp-server/src/tools/dynamic.rs
  • mcp-server/src/tools/k8s_summaries.rs
  • mcp-server/src/tools/mod.rs
  • mcp-server/src/tools/router.rs
  • mcp-server/src/tools/supernode.rs
  • mcp-server/src/tools/vault.rs
  • mcp-server/src/tools/workloads/dolos.rs
  • mcp-server/src/tools/workloads/install.rs
  • mcp-server/src/tools/workloads/logs.rs
  • mcp-server/src/tools/workloads/metrics.rs
  • mcp-server/src/tools/workloads/mod.rs
  • mcp-server/src/tools/workloads/outputs.rs
  • mcp-server/src/tools/workloads/registry.rs
  • mcp-server/src/vault/client.rs
  • mcp-server/src/vault/mod.rs
  • mcp-server/src/vault/paths.rs
  • mcp-server/src/vault/redaction.rs
  • skills/README.md
  • skills/cardano-relay-setup.md
  • skills/dolos-supernode-deployment.md
  • skills/workload-output-port-forward.md

Comment on lines +18 to +27
include:
- release_for: Linux-x86_64
build_on: ubuntu-22.04
target: x86_64-unknown-linux-gnu
args: "--locked --release"

- release_for: Linux-arm64
build_on: ubuntu-22.04-arm
target: "aarch64-unknown-linux-gnu"
args: "--locked --release"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Add missing ext property to matrix entries.

Lines 50-51 and 57 reference matrix.ext, but the matrix entries do not define this property, causing a workflow syntax error.

🐛 Proposed fix
         include:
           - release_for: Linux-x86_64
             build_on: ubuntu-22.04
             target: x86_64-unknown-linux-gnu
             args: "--locked --release"
+            ext: ""

           - release_for: Linux-arm64
             build_on: ubuntu-22.04-arm
             target: "aarch64-unknown-linux-gnu"
             args: "--locked --release"
+            ext: ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
include:
- release_for: Linux-x86_64
build_on: ubuntu-22.04
target: x86_64-unknown-linux-gnu
args: "--locked --release"
- release_for: Linux-arm64
build_on: ubuntu-22.04-arm
target: "aarch64-unknown-linux-gnu"
args: "--locked --release"
include:
- release_for: Linux-x86_64
build_on: ubuntu-22.04
target: x86_64-unknown-linux-gnu
args: "--locked --release"
ext: ""
- release_for: Linux-arm64
build_on: ubuntu-22.04-arm
target: "aarch64-unknown-linux-gnu"
args: "--locked --release"
ext: ""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build_mcp_server.yml around lines 18 - 27, The matrix
entries under include (the objects with keys release_for, build_on, target,
args) are missing the ext property referenced elsewhere as matrix.ext; add an
ext field to each included matrix entry (e.g., add ext: "<appropriate
extension>" to the Linux-x86_64 and Linux-arm64 include objects) so matrix.ext
resolves correctly where used on lines referencing matrix.ext.

Comment on lines +63 to +68
strategy:
matrix:
include:
- tags: ghcr.io/txpipe/metis-supernode-mcp,ghcr.io/txpipe/metis-supernode-mcp:${{ github.sha }}
binary: supernode-mcp

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Move dynamic tag generation out of the matrix definition.

GitHub Actions does not interpolate expressions like ${{ github.sha }} inside matrix definitions. The SHA tag will be literal text, not the actual commit SHA.

🔧 Proposed fix

Move tag generation to a step that sets an output or environment variable, then reference it in the build step:

     strategy:
       matrix:
         include:
-          - tags: ghcr.io/txpipe/metis-supernode-mcp,ghcr.io/txpipe/metis-supernode-mcp:${{ github.sha }}
-            binary: supernode-mcp
+          - binary: supernode-mcp

     permissions:
       contents: read
       packages: write

     steps:
       - name: Checkout repository
         uses: actions/checkout@v4
+
+      - name: Prepare tags
+        id: tags
+        run: |
+          echo "tags=ghcr.io/txpipe/metis-supernode-mcp,ghcr.io/txpipe/metis-supernode-mcp:${{ github.sha }}" >> $GITHUB_OUTPUT

       - name: Set up Docker Buildx
         uses: docker/setup-buildx-action@v3

Then update line 107:

-          tags: ${{ matrix.tags }}
+          tags: ${{ steps.tags.outputs.tags }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build_mcp_server.yml around lines 63 - 68, The matrix
currently hardcodes tags including the expression `${{ github.sha }}` which
won’t be interpolated; move dynamic tag generation out of the matrix by
replacing the matrix "tags" entry with a static placeholder (or remove the
SHA-specific tag) and add a prior step that computes the tags string using `${{
github.sha }}` and exposes it as an output or env var (e.g., set-output/tag or
echo to GITHUB_ENV). Then update the build/publish step that uses the matrix
`tags` (the step that references `tags`/`binary`) to instead consume the
computed output/env variable so the actual commit SHA is included in the runtime
tag list.

Comment thread mcp-server/src/server.rs
Comment on lines +52 to +55
let auth_context = match auth_mode {
AuthMode::Trusted => AuthContext::trusted(),
AuthMode::OAuth => anyhow::bail!("MCP_AUTH_MODE=oauth is not implemented yet"),
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fail closed when secure auth is unavailable.

Line 54 rejects OAuth, so the server effectively runs only in trusted mode. Combined with non-loopback binds, this creates a privileged unauthenticated control surface. Add a startup guard that forbids trusted mode on non-loopback addresses (or require explicit insecure opt-in).

Suggested mitigation
 fn router(config: Config, cancellation_token: CancellationToken) -> anyhow::Result<Router> {
     let auth_mode = config.auth_mode;
+    if matches!(auth_mode, AuthMode::Trusted) && !config.bind_addr.ip().is_loopback() {
+        anyhow::bail!(
+            "trusted auth mode is only allowed on loopback interfaces; configure secure auth before exposing the service"
+        );
+    }
     let auth_context = match auth_mode {
         AuthMode::Trusted => AuthContext::trusted(),
         AuthMode::OAuth => anyhow::bail!("MCP_AUTH_MODE=oauth is not implemented yet"),
     };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/server.rs` around lines 52 - 55, The code allows
AuthMode::Trusted unconditionally, which creates an unauthenticated control
surface when the server is bound to non-loopback addresses; update startup to
detect when auth_mode == AuthMode::Trusted and the server bind/listen address
(e.g., bind_addr or listen_addr used when starting the server) is not a loopback
IP, and then fail-fast with an error (or require an explicit insecure opt-in
flag/env var such as INSECURE_ALLOW_TRUSTED) instead of proceeding; ensure this
guard lives alongside the existing auth_mode match where AuthContext::trusted()
is created so the process exits on unsafe configuration.

Comment on lines +66 to +70
pub async fn runtime_metadata(
&self,
path: &VaultPath,
) -> Result<VaultSecretMetadata, VaultError> {
let metadata_response = self.get_metadata(path).await?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Enforce Runtime path kind in runtime-scoped Vault APIs.

These methods currently accept operator paths too, which breaks the runtime/operator boundary and can allow unintended secret domain access.

Proposed fix
 use crate::vault::paths::VaultPath;
+use crate::vault::paths::VaultPathKind;
@@
     pub async fn runtime_metadata(
         &self,
         path: &VaultPath,
     ) -> Result<VaultSecretMetadata, VaultError> {
+        ensure_runtime_path(path)?;
         let metadata_response = self.get_metadata(path).await?;
         let secret_response = self.get_secret(path).await?;
@@
     pub async fn write_runtime_secret(
         &self,
         path: &VaultPath,
         secret: &SecretObject,
         mode: WriteMode,
     ) -> Result<VaultWriteReceipt, VaultError> {
+        ensure_runtime_path(path)?;
         let data = match mode {
@@
 }
+
+fn ensure_runtime_path(path: &VaultPath) -> Result<(), VaultError> {
+    if path.kind() != VaultPathKind::Runtime {
+        return Err(VaultError::Path(crate::vault::paths::VaultPathError::NotAllowed));
+    }
+    Ok(())
+}

Also applies to: 87-92

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/vault/client.rs` around lines 66 - 70, The runtime-scoped
Vault APIs (e.g., runtime_metadata) currently accept any VaultPath; update these
functions to validate that the provided VaultPath.kind == VaultPathKind::Runtime
(or equivalent enum) and return a VaultError (or Err) if not Runtime to prevent
operator paths from being used; add the same validation at the start of other
runtime methods referenced (e.g., runtime_get, runtime_put around lines 87–92)
so each function checks the path kind before calling
get_metadata/get_secret/put_secret and aborts with a clear error when the path
is not Runtime.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
extensions/hydra-node/templates/podmonitor.yaml (1)

1-15: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate PodMonitor rendering behind CRD availability (and ideally a values flag).

Rendering this unconditionally can fail Helm installs when monitoring.coreos.com/v1 CRDs are not installed.

💡 Suggested patch
+{{- if and (.Capabilities.APIVersions.Has "monitoring.coreos.com/v1/PodMonitor") (default true .Values.monitoring.podMonitor.enabled) }}
 apiVersion: monitoring.coreos.com/v1
 kind: PodMonitor
 metadata:
   name: {{ include "hydra-node.fullname" . }}
@@
   podMetricsEndpoints:
     - port: monitoring
       path: /metrics
+{{- end }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/hydra-node/templates/podmonitor.yaml` around lines 1 - 15, Wrap
the PodMonitor template in a conditional so it only renders when the CRD is
available and when the chart value enables it: check
.Capabilities.APIVersions.Has "monitoring.coreos.com/v1/PodMonitor" and an
opt-in flag like .Values.monitoring.enabled (or similar) and surround the entire
PodMonitor manifest (the resource with kind: PodMonitor and name from include
"hydra-node.fullname") with if/else/end so Helm skips rendering when the CRD or
flag is absent.
🧹 Nitpick comments (1)
skills/hydra-node-troubleshooting.md (1)

47-47: ⚡ Quick win

Make internal doc references clickable for faster navigation.

These two references are currently plain inline code; converting them to relative Markdown links improves operator workflow in GitHub/docs viewers.

✍️ Suggested doc tweak
-If deeper API state is needed, use `workloads.get` to discover the `api` output and use `workload-output-port-forward.md` to expose it locally. Then query Hydra's API through the port-forward:
+If deeper API state is needed, use `workloads.get` to discover the `api` output and use [workload-output-port-forward.md](./workload-output-port-forward.md) to expose it locally. Then query Hydra's API through the port-forward:
...
-Sideloading the last confirmed snapshot can recover a stuck local state, but it must be coordinated by participants. See `hydra-head-operations.md` for the API procedure.
+Sideloading the last confirmed snapshot can recover a stuck local state, but it must be coordinated by participants. See [hydra-head-operations.md](./hydra-head-operations.md) for the API procedure.

Also applies to: 91-91

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/hydra-node-troubleshooting.md` at line 47, Convert the inline code
mentions of workloads.get and workload-output-port-forward.md into relative
Markdown links so readers can click through; update the sentence that references
`workloads.get` and `workload-output-port-forward.md` to use link syntax
pointing to the corresponding internal docs (e.g.
[workloads.get](relative/path/to/workloads.get) and
[workload-output-port-forward.md](relative/path/to/workload-output-port-forward.md)),
and apply the same change to the other occurrence noted at lines 91-91 to make
both references clickable in GitHub/docs viewers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@extensions/control-plane/scripts/post_install.sh`:
- Line 277: The rollout restart command is targeting the wrong resource type:
change the kubectl invocation that currently restarts "deployment/supernode-mcp"
to restart the StatefulSet instead by replacing "deployment/supernode-mcp" with
"statefulset/supernode-mcp" in the line containing "kubectl
${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart deployment/supernode-mcp"
so the command matches the actual resource defined in
stage4-04-statefulset-supernode-mcp.yaml.

In `@extensions/hydra-node/templates/configmap-metrics.yaml`:
- Around line 28-31: The http_get() helper makes blocking HTTP calls without
timeouts; update the http_get function to add explicit network timeouts/retries
for both curl and wget (e.g., set connect and overall timeouts and avoid
infinite retries) so a stalled localhost endpoint can't hang metrics
collection—modify the curl invocation in http_get to include appropriate
--connect-timeout and --max-time flags and modify the wget invocation to include
--timeout and --tries (or equivalent) flags and preserve the existing
silent/fail behavior.

In `@extensions/hydra-node/templates/statefulset.yaml`:
- Around line 149-157: The template currently skips invalid entries in
.Values.node.peers silently; modify the range over $peer (range $peer :=
.Values.node.peers) to add a final else branch that calls the Helm fail function
with a clear message (e.g. fail (printf "invalid .Values.node.peers entry: %v"
$peer)) so rendering stops on malformed entries instead of continuing silently;
reference the existing branches that check kindIs "string" $peer and the else if
with $peer.host $peer.port and add the failing else to surface the bad config.

In `@mcp-server/src/catalog/hydra_node.rs`:
- Around line 58-62: The required_when expression for the "tlsKey" schema entry
(and the other TLS-related entries in the same hydra node section) incorrectly
references "api.tls == true" which does not exist; update the required_when
expressions for tlsKey and the related fields between lines 94-240 to reference
the actual config field that controls the unauthenticated Hydra API TLS (replace
"api.tls == true" with the real schema path used in the codebase, e.g., the
correct top-level or nested field like "tls == true" or "unauthenticated_api.tls
== true"), and ensure the boolean equality syntax matches other required_when
usages in this file.
- Around line 158-169: The "peers" schema currently only accepts object entries,
but deployment templates accept both "host:port" strings and {host, port}
objects; update the "peers" items definition to allow either form by replacing
the single object type with a oneOf that permits a string (matching "host:port"
via a regex/pattern) or the existing object with required "host" and "port"
properties (keeping additionalProperties: false), and update the "peers"
description accordingly so both string and object representations are validated;
refer to the "peers" items, "host", and "port" symbols when making this change.

In `@mcp-server/src/helm.rs`:
- Around line 211-218: The code uses a timestamp-based 'unique' to build 'path'
and write the values file, which races under concurrent requests; replace the
timestamp filename scheme with an atomic temp-file creation (e.g.
tempfile::Builder::tempfile_in(dir) or use a UUID like Uuid::new_v4() for the
filename) and write into that temp file instead of constructing 'path' manually
so creation is atomic and cannot collide; update the logic that currently calls
std::fs::write(&path, payload) and the error mapping
(HelmValuesFileError::WriteValues) to operate on the tempfile handle or the
chosen unique UUID filename to avoid overwrites and races.
- Around line 113-203: The Helm invocations in install, upgrade, and uninstall
run Command::output().await with no timeout; wrap each call in
tokio::time::timeout(...) using a configurable Duration constant (e.g.
HELM_CMD_TIMEOUT) so a hung Helm process returns quickly; handle timeout by
mapping the timeout error into the function's error type (add a Timeout variant
to HelmInstallError/HelmUpgradeError/HelmUninstallError or return an appropriate
Execute/Timeout mapping), keep the existing mapping for Command::output() errors
(currently mapped via Helm*Error::Execute), and ensure the temporary values file
removal (values_path) still happens on timeout/failure; update the calls that
assign to output (the variables named output after
Command::new(&helm_bin).args(&args).output().await) to handle the timeout result
before converting stdout/stderr and building the command vector.

In `@mcp-server/src/tools/hydra.rs`:
- Around line 58-79: The current flow does a check-then-write
(client.runtime_metadata(...) then later patch) which has a race; replace it
with an atomic conditional write instead: when overwrite==false, fetch the
current metadata once (runtime_metadata) to obtain the metadata version or
existence info and then call the client's write/patch method with an expectation
(CAS/if-match/if-none-match) so the operation will fail if the keys were created
concurrently; reference the existing symbols runtime_metadata, signing_key_name,
verification_key_name and the later patch/write call (the code path around the
current write at lines ~106–108) and use the client's
conditional/compare-and-set option (or pass the retrieved version) to perform a
single atomic update that errors if conflicting keys appeared between the read
and write.
- Around line 129-137: The code currently generates 16 bytes and hashes them in
generate_hydra_key_pair/hydra_key_pair_from_entropy which reduces effective
entropy; change generate_hydra_key_pair to fill a [u8; 32] with CSPRNG
(getrandom) and pass that 32-byte array directly into
hydra_key_pair_from_entropy (or rename accordingly) so the SigningKey is
constructed from the raw 32-byte seed instead of Sha256(entropy); alternatively,
if you must expand limited entropy, use a KDF (e.g., HKDF) explicitly rather
than a plain hash—update function signatures (entropy types) and remove the
Sha256-based stretching in hydra_key_pair_from_entropy and call
SigningKey::from_bytes with the 32-byte seed.

In `@mcp-server/src/tools/workloads/upgrade.rs`:
- Around line 250-258: The schema type check block that matches on expected_type
(using variables schema, expected_type, value and the local matches binding)
currently omits the "array" branch, allowing non-array values to pass; add an
"array" arm to the match that returns value.is_array() so arrays are validated
correctly (i.e., update the match in upgrade.rs where expected_type is matched
to include "array" => value.is_array()).

---

Outside diff comments:
In `@extensions/hydra-node/templates/podmonitor.yaml`:
- Around line 1-15: Wrap the PodMonitor template in a conditional so it only
renders when the CRD is available and when the chart value enables it: check
.Capabilities.APIVersions.Has "monitoring.coreos.com/v1/PodMonitor" and an
opt-in flag like .Values.monitoring.enabled (or similar) and surround the entire
PodMonitor manifest (the resource with kind: PodMonitor and name from include
"hydra-node.fullname") with if/else/end so Helm skips rendering when the CRD or
flag is absent.

---

Nitpick comments:
In `@skills/hydra-node-troubleshooting.md`:
- Line 47: Convert the inline code mentions of workloads.get and
workload-output-port-forward.md into relative Markdown links so readers can
click through; update the sentence that references `workloads.get` and
`workload-output-port-forward.md` to use link syntax pointing to the
corresponding internal docs (e.g.
[workloads.get](relative/path/to/workloads.get) and
[workload-output-port-forward.md](relative/path/to/workload-output-port-forward.md)),
and apply the same change to the other occurrence noted at lines 91-91 to make
both references clickable in GitHub/docs viewers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c3302455-7556-4088-9f99-bb5162b3e3ec

📥 Commits

Reviewing files that changed from the base of the PR and between da7019c and fc88182.

⛔ Files ignored due to path filters (1)
  • mcp-server/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (36)
  • extensions/control-plane/scripts/post_install.sh
  • extensions/control-plane/templates/stage4-02-clusterrole-supernode-mcp.yaml
  • extensions/hydra-node/Chart.yaml
  • extensions/hydra-node/README.md
  • extensions/hydra-node/ci/values-offline-existing.yaml
  • extensions/hydra-node/ci/values-offline-inline.yaml
  • extensions/hydra-node/ci/values-online-inline.yaml
  • extensions/hydra-node/ci/values-vault-static-secret.yaml
  • extensions/hydra-node/templates/_helpers.tpl
  • extensions/hydra-node/templates/configmap-metrics.yaml
  • extensions/hydra-node/templates/podmonitor.yaml
  • extensions/hydra-node/templates/secret-cardano-signing.yaml
  • extensions/hydra-node/templates/secret-hydra-signing.yaml
  • extensions/hydra-node/templates/serviceaccount-vault-auth.yaml
  • extensions/hydra-node/templates/statefulset.yaml
  • extensions/hydra-node/templates/vaultstaticsecrets.yaml
  • extensions/hydra-node/values.yaml
  • mcp-server/Cargo.toml
  • mcp-server/src/catalog/extension.rs
  • mcp-server/src/catalog/hydra_node.rs
  • mcp-server/src/catalog/mod.rs
  • mcp-server/src/helm.rs
  • mcp-server/src/tools/hydra.rs
  • mcp-server/src/tools/mod.rs
  • mcp-server/src/tools/router.rs
  • mcp-server/src/tools/workloads/delete.rs
  • mcp-server/src/tools/workloads/install.rs
  • mcp-server/src/tools/workloads/metrics.rs
  • mcp-server/src/tools/workloads/mod.rs
  • mcp-server/src/tools/workloads/outputs.rs
  • mcp-server/src/tools/workloads/registry.rs
  • mcp-server/src/tools/workloads/upgrade.rs
  • skills/README.md
  • skills/hydra-head-operations.md
  • skills/hydra-node-deployment.md
  • skills/hydra-node-troubleshooting.md
💤 Files with no reviewable changes (2)
  • extensions/hydra-node/templates/secret-hydra-signing.yaml
  • extensions/hydra-node/templates/secret-cardano-signing.yaml
✅ Files skipped from review due to trivial changes (1)
  • extensions/hydra-node/Chart.yaml
🚧 Files skipped from review as they are similar to previous changes (8)
  • mcp-server/src/tools/mod.rs
  • mcp-server/Cargo.toml
  • mcp-server/src/tools/workloads/registry.rs
  • mcp-server/src/tools/workloads/metrics.rs
  • mcp-server/src/tools/workloads/outputs.rs
  • mcp-server/src/tools/workloads/mod.rs
  • mcp-server/src/tools/router.rs
  • mcp-server/src/tools/workloads/install.rs

--set supernodeMcp.vault.tokenSecretRef.key=${MCP_VAULT_TOKEN_SECRET_KEY}

If the MCP Deployment is already running, restart it after the Helm upgrade:
kubectl ${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart deployment/supernode-mcp
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Incorrect resource type: MCP is a StatefulSet, not a Deployment.

The rollout restart command references deployment/supernode-mcp, but according to stage4-04-statefulset-supernode-mcp.yaml in this PR, the MCP is deployed as a StatefulSet. This command will fail with a "not found" error.

Proposed fix
-  kubectl ${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart deployment/supernode-mcp
+  kubectl ${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart statefulset/supernode-mcp
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
kubectl ${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart deployment/supernode-mcp
kubectl ${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart statefulset/supernode-mcp
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/control-plane/scripts/post_install.sh` at line 277, The rollout
restart command is targeting the wrong resource type: change the kubectl
invocation that currently restarts "deployment/supernode-mcp" to restart the
StatefulSet instead by replacing "deployment/supernode-mcp" with
"statefulset/supernode-mcp" in the line containing "kubectl
${KUBECTL_CONTEXT_TEXT}-n ${NAMESPACE} rollout restart deployment/supernode-mcp"
so the command matches the actual resource defined in
stage4-04-statefulset-supernode-mcp.yaml.

Comment on lines +28 to +31
http_get() {
url="$1"
curl -s --fail "$url" 2>/dev/null || wget -qO- "$url" 2>/dev/null || true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add network timeouts to prevent hanging metrics collection.

Line 30 performs blocking HTTP calls with no timeout. If localhost endpoints stall, this script can hang and break scrape freshness.

💡 Suggested patch
 http_get() {
   url="$1"
-  curl -s --fail "$url" 2>/dev/null || wget -qO- "$url" 2>/dev/null || true
+  curl -sS --fail --connect-timeout 2 --max-time 5 "$url" 2>/dev/null \
+    || wget -qO- --timeout=5 "$url" 2>/dev/null \
+    || true
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/hydra-node/templates/configmap-metrics.yaml` around lines 28 - 31,
The http_get() helper makes blocking HTTP calls without timeouts; update the
http_get function to add explicit network timeouts/retries for both curl and
wget (e.g., set connect and overall timeouts and avoid infinite retries) so a
stalled localhost endpoint can't hang metrics collection—modify the curl
invocation in http_get to include appropriate --connect-timeout and --max-time
flags and modify the wget invocation to include --timeout and --tries (or
equivalent) flags and preserve the existing silent/fail behavior.

Comment on lines +149 to +157
{{- range $peer := .Values.node.peers }}
{{- if kindIs "string" $peer }}
{{- $args = append $args "--peer" }}
{{- $args = append $args $peer }}
{{- else if and $peer.host $peer.port }}
{{- $args = append $args "--peer" }}
{{- $args = append $args (printf "%v:%v" $peer.host $peer.port) }}
{{- end }}
{{- end }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast on malformed node.peers entries instead of silently skipping.

At Line 153-156 only valid shapes are appended; any other entry is ignored with no signal. That can mask config mistakes and leave nodes disconnected.

💡 Suggested patch
   {{- range $peer := .Values.node.peers }}
     {{- if kindIs "string" $peer }}
       {{- $args = append $args "--peer" }}
       {{- $args = append $args $peer }}
     {{- else if and $peer.host $peer.port }}
       {{- $args = append $args "--peer" }}
       {{- $args = append $args (printf "%v:%v" $peer.host $peer.port) }}
+    {{- else }}
+      {{- fail "Each node.peers entry must be either 'host:port' string or object with host/port" }}
     {{- end }}
   {{- end }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{{- range $peer := .Values.node.peers }}
{{- if kindIs "string" $peer }}
{{- $args = append $args "--peer" }}
{{- $args = append $args $peer }}
{{- else if and $peer.host $peer.port }}
{{- $args = append $args "--peer" }}
{{- $args = append $args (printf "%v:%v" $peer.host $peer.port) }}
{{- end }}
{{- end }}
{{- range $peer := .Values.node.peers }}
{{- if kindIs "string" $peer }}
{{- $args = append $args "--peer" }}
{{- $args = append $args $peer }}
{{- else if and $peer.host $peer.port }}
{{- $args = append $args "--peer" }}
{{- $args = append $args (printf "%v:%v" $peer.host $peer.port) }}
{{- else }}
{{- fail "Each node.peers entry must be either 'host:port' string or object with host/port" }}
{{- end }}
{{- end }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/hydra-node/templates/statefulset.yaml` around lines 149 - 157, The
template currently skips invalid entries in .Values.node.peers silently; modify
the range over $peer (range $peer := .Values.node.peers) to add a final else
branch that calls the Helm fail function with a clear message (e.g. fail (printf
"invalid .Values.node.peers entry: %v" $peer)) so rendering stops on malformed
entries instead of continuing silently; reference the existing branches that
check kindIs "string" $peer and the else if with $peer.host $peer.port and add
the failing else to surface the bad config.

Comment on lines +58 to +62
"tlsKey",
"Optional Hydra API TLS private key when TLS is enabled for the unauthenticated Hydra API endpoint.",
false,
Some("api.tls == true"),
"runtime",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix tlsKey.required_when expression: it references a schema field that does not exist.

Line 61 uses api.tls == true, but the configuration schema does not define api. This makes the condition misleading and effectively non-actionable.

💡 Suggested patch
-        ExtensionSecretDefinition::new(
+        ExtensionSecretDefinition::new(
             "tlsKey",
             "Optional Hydra API TLS private key when TLS is enabled for the unauthenticated Hydra API endpoint.",
             false,
-            Some("api.tls == true"),
+            Some("tls.enabled == true"),
             "runtime",
             "tls-private-key",
             true,
             vec!["vaultStaticSecret"],
         ),
         "properties": {
+            "tls": {
+                "type": "object",
+                "properties": {
+                    "enabled": { "type": "boolean", "default": false }
+                },
+                "additionalProperties": false
+            },
             "namespace": {

Also applies to: 94-240

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/catalog/hydra_node.rs` around lines 58 - 62, The required_when
expression for the "tlsKey" schema entry (and the other TLS-related entries in
the same hydra node section) incorrectly references "api.tls == true" which does
not exist; update the required_when expressions for tlsKey and the related
fields between lines 94-240 to reference the actual config field that controls
the unauthenticated Hydra API TLS (replace "api.tls == true" with the real
schema path used in the codebase, e.g., the correct top-level or nested field
like "tls == true" or "unauthenticated_api.tls == true"), and ensure the boolean
equality syntax matches other required_when usages in this file.

Comment on lines +158 to +169
"peers": {
"type": "array",
"description": "Static Hydra peer endpoints. All participants must agree on topology.",
"items": {
"type": "object",
"required": ["host", "port"],
"properties": {
"host": { "type": "string", "minLength": 1 },
"port": { "type": "integer", "minimum": 1, "maximum": 65535 }
},
"additionalProperties": false
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align peers schema with deployment behavior (string and object forms).

The schema only allows object peers, but the Helm template accepts both "host:port" strings and {host,port} objects. This contract mismatch can reject valid inputs upstream.

💡 Suggested patch
             "peers": {
                 "type": "array",
                 "description": "Static Hydra peer endpoints. All participants must agree on topology.",
                 "items": {
-                    "type": "object",
-                    "required": ["host", "port"],
-                    "properties": {
-                        "host": { "type": "string", "minLength": 1 },
-                        "port": { "type": "integer", "minimum": 1, "maximum": 65535 }
-                    },
-                    "additionalProperties": false
+                    "oneOf": [
+                        { "type": "string", "minLength": 1 },
+                        {
+                            "type": "object",
+                            "required": ["host", "port"],
+                            "properties": {
+                                "host": { "type": "string", "minLength": 1 },
+                                "port": { "type": "integer", "minimum": 1, "maximum": 65535 }
+                            },
+                            "additionalProperties": false
+                        }
+                    ]
                 },
                 "default": []
             },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"peers": {
"type": "array",
"description": "Static Hydra peer endpoints. All participants must agree on topology.",
"items": {
"type": "object",
"required": ["host", "port"],
"properties": {
"host": { "type": "string", "minLength": 1 },
"port": { "type": "integer", "minimum": 1, "maximum": 65535 }
},
"additionalProperties": false
},
"peers": {
"type": "array",
"description": "Static Hydra peer endpoints. All participants must agree on topology.",
"items": {
"oneOf": [
{ "type": "string", "minLength": 1 },
{
"type": "object",
"required": ["host", "port"],
"properties": {
"host": { "type": "string", "minLength": 1 },
"port": { "type": "integer", "minimum": 1, "maximum": 65535 }
},
"additionalProperties": false
}
]
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/catalog/hydra_node.rs` around lines 158 - 169, The "peers"
schema currently only accepts object entries, but deployment templates accept
both "host:port" strings and {host, port} objects; update the "peers" items
definition to allow either form by replacing the single object type with a oneOf
that permits a string (matching "host:port" via a regex/pattern) or the existing
object with required "host" and "port" properties (keeping additionalProperties:
false), and update the "peers" description accordingly so both string and object
representations are validated; refer to the "peers" items, "host", and "port"
symbols when making this change.

Comment thread mcp-server/src/helm.rs
Comment on lines +113 to +203
pub async fn install(plan: &HelmInstallPlan) -> Result<HelmInstallResult, HelmInstallError> {
let values_path = write_values_file(&plan.values).map_err(HelmInstallError::from)?;
let helm_bin = std::env::var("MCP_HELM_BIN").unwrap_or_else(|_| DEFAULT_HELM_BIN.to_string());
let args = install_args(plan, &values_path);

let output = Command::new(&helm_bin).args(&args).output().await;
let _ = std::fs::remove_file(&values_path);
let output = output.map_err(HelmInstallError::Execute)?;

let stdout = String::from_utf8_lossy(&output.stdout).to_string();
let stderr = String::from_utf8_lossy(&output.stderr).to_string();

if !output.status.success() {
return Err(HelmInstallError::Failed {
status: output.status.code().unwrap_or(-1),
stdout,
stderr,
});
}

let mut command = vec![helm_bin];
command.extend(args);

Ok(HelmInstallResult {
command,
stdout,
stderr,
})
}

pub async fn upgrade(plan: &HelmUpgradePlan) -> Result<HelmUpgradeResult, HelmUpgradeError> {
let values_path = write_values_file(&plan.values).map_err(HelmUpgradeError::from)?;
let helm_bin = std::env::var("MCP_HELM_BIN").unwrap_or_else(|_| DEFAULT_HELM_BIN.to_string());
let args = upgrade_args(plan, &values_path);

let output = Command::new(&helm_bin).args(&args).output().await;
let _ = std::fs::remove_file(&values_path);
let output = output.map_err(HelmUpgradeError::Execute)?;

let stdout = String::from_utf8_lossy(&output.stdout).to_string();
let stderr = String::from_utf8_lossy(&output.stderr).to_string();

if !output.status.success() {
return Err(HelmUpgradeError::Failed {
status: output.status.code().unwrap_or(-1),
stdout,
stderr,
});
}

let mut command = vec![helm_bin];
command.extend(args);

Ok(HelmUpgradeResult {
command,
stdout,
stderr,
})
}

pub async fn uninstall(
plan: &HelmUninstallPlan,
) -> Result<HelmUninstallResult, HelmUninstallError> {
let helm_bin = std::env::var("MCP_HELM_BIN").unwrap_or_else(|_| DEFAULT_HELM_BIN.to_string());
let args = uninstall_args(plan);
let output = Command::new(&helm_bin)
.args(&args)
.output()
.await
.map_err(HelmUninstallError::Execute)?;

let stdout = String::from_utf8_lossy(&output.stdout).to_string();
let stderr = String::from_utf8_lossy(&output.stderr).to_string();

if !output.status.success() {
return Err(HelmUninstallError::Failed {
status: output.status.code().unwrap_or(-1),
stdout,
stderr,
});
}

let mut command = vec![helm_bin];
command.extend(args);

Ok(HelmUninstallResult {
command,
stdout,
stderr,
})
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit timeouts to Helm command execution.

Line 118, Line 148, and Line 178 execute Helm without a timeout. A stuck Helm process can block the tool indefinitely and degrade service reliability.

Proposed fix
+const DEFAULT_HELM_TIMEOUT_SECS: u64 = 300;
+
+fn helm_timeout_secs() -> u64 {
+    std::env::var("MCP_HELM_TIMEOUT_SECS")
+        .ok()
+        .and_then(|v| v.parse::<u64>().ok())
+        .filter(|v| *v > 0)
+        .unwrap_or(DEFAULT_HELM_TIMEOUT_SECS)
+}
+
-    let output = Command::new(&helm_bin).args(&args).output().await;
+    let output = tokio::time::timeout(
+        std::time::Duration::from_secs(helm_timeout_secs()),
+        Command::new(&helm_bin).args(&args).output(),
+    )
+    .await
+    .map_err(|_| HelmInstallError::Execute(std::io::Error::new(std::io::ErrorKind::TimedOut, "helm command timed out")))?;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/helm.rs` around lines 113 - 203, The Helm invocations in
install, upgrade, and uninstall run Command::output().await with no timeout;
wrap each call in tokio::time::timeout(...) using a configurable Duration
constant (e.g. HELM_CMD_TIMEOUT) so a hung Helm process returns quickly; handle
timeout by mapping the timeout error into the function's error type (add a
Timeout variant to HelmInstallError/HelmUpgradeError/HelmUninstallError or
return an appropriate Execute/Timeout mapping), keep the existing mapping for
Command::output() errors (currently mapped via Helm*Error::Execute), and ensure
the temporary values file removal (values_path) still happens on
timeout/failure; update the calls that assign to output (the variables named
output after Command::new(&helm_bin).args(&args).output().await) to handle the
timeout result before converting stdout/stderr and building the command vector.

Comment thread mcp-server/src/helm.rs
Comment on lines +211 to +218
let unique = SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap_or_default()
.as_nanos();
let path = dir.join(format!("values-{unique}.json"));
let payload =
serde_json::to_vec_pretty(values).map_err(HelmValuesFileError::SerializeValues)?;
std::fs::write(&path, payload).map_err(HelmValuesFileError::WriteValues)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Values file uniqueness is racy under concurrent requests.

Line 211–Line 216 rely on timestamp nanos for unique file names. Concurrent calls can collide and lead to file overwrite or deleting another request’s values file.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/helm.rs` around lines 211 - 218, The code uses a
timestamp-based 'unique' to build 'path' and write the values file, which races
under concurrent requests; replace the timestamp filename scheme with an atomic
temp-file creation (e.g. tempfile::Builder::tempfile_in(dir) or use a UUID like
Uuid::new_v4() for the filename) and write into that temp file instead of
constructing 'path' manually so creation is atomic and cannot collide; update
the logic that currently calls std::fs::write(&path, payload) and the error
mapping (HelmValuesFileError::WriteValues) to operate on the tempfile handle or
the chosen unique UUID filename to avoid overwrites and races.

Comment on lines +58 to +79
if !overwrite {
match client.runtime_metadata(&path).await {
Ok(metadata) => {
let conflicting_keys = metadata
.key_names
.iter()
.filter(|key| *key == &signing_key_name || *key == &verification_key_name)
.cloned()
.collect::<Vec<_>>();
if !conflicting_keys.is_empty() {
return tool_error(
"vault_secret_key_exists",
"target Vault path already contains one or more Hydra key fields; set overwrite=true to replace them",
json!({
"path": path.as_str(),
"conflictingKeys": conflicting_keys,
}),
);
}
}
Err(error) => return vault_error(TOOL_NAME, error),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

overwrite=false has a check-then-write race.

Line 58–Line 79 checks existing keys, but Line 106–Line 108 writes later with patch semantics. A concurrent writer can still create/replace the same fields between those steps.

Also applies to: 106-108

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/hydra.rs` around lines 58 - 79, The current flow does a
check-then-write (client.runtime_metadata(...) then later patch) which has a
race; replace it with an atomic conditional write instead: when
overwrite==false, fetch the current metadata once (runtime_metadata) to obtain
the metadata version or existence info and then call the client's write/patch
method with an expectation (CAS/if-match/if-none-match) so the operation will
fail if the keys were created concurrently; reference the existing symbols
runtime_metadata, signing_key_name, verification_key_name and the later
patch/write call (the code path around the current write at lines ~106–108) and
use the client's conditional/compare-and-set option (or pass the retrieved
version) to perform a single atomic update that errors if conflicting keys
appeared between the read and write.

Comment on lines +129 to +137
fn generate_hydra_key_pair() -> Result<HydraKeyPair, getrandom::Error> {
let mut entropy = [0u8; 16];
getrandom::getrandom(&mut entropy)?;
Ok(hydra_key_pair_from_entropy(entropy))
}

fn hydra_key_pair_from_entropy(entropy: [u8; 16]) -> HydraKeyPair {
let signing_seed = Sha256::digest(entropy);
let signing_key = SigningKey::from_bytes(&signing_seed.into());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

For ed25519-dalek 2.x, is generating a SigningKey from only 16 bytes of entropy (then SHA-256) recommended, or should 32 bytes of CSPRNG output be used directly?

💡 Result:

For ed25519-dalek 2.x, it is strongly recommended to use 32 bytes of high-quality CSPRNG output directly to generate a SigningKey [1]. The standard and secure way to generate a SigningKey in ed25519-dalek is to use the provided SigningKey::generate method, which accepts a cryptographically secure pseudorandom number generator (CSPRNG) [2][1]. This method ensures that the resulting key is generated correctly according to the Ed25519 specification [3][1]. Using only 16 bytes of entropy is insufficient for a 128-bit security level, as it provides only 128 bits of entropy, which is the absolute minimum and leaves no margin for error [4]. Furthermore, simply hashing 16 bytes of entropy with SHA-256 to produce a 32-byte key does not increase the entropy of the input; it merely stretches the 128 bits of entropy into a 256-bit string [5]. This approach is not a standard practice for Ed25519 key generation and introduces unnecessary complexity and potential security risks compared to using 32 bytes of direct CSPRNG output [1][5]. If you have a source of 32 bytes of high-quality entropy, you should use those bytes directly to construct the SigningKey using SigningKey::from_bytes [1]. If your entropy source is smaller than 32 bytes, you should use a proper Key Derivation Function (KDF), such as HKDF, to expand that entropy into a 32-byte seed before using it to generate the key, rather than relying on a simple hash function [6][5].

Citations:


🏁 Script executed:

# Verify the file exists and check the relevant lines
if [ -f "mcp-server/src/tools/hydra.rs" ]; then
  echo "=== File exists, checking lines 129-137 ==="
  sed -n '125,140p' mcp-server/src/tools/hydra.rs | cat -n
else
  echo "File not found, searching for hydra.rs"
  find . -name "hydra.rs" -type f
fi

Repository: txpipe/metis

Length of output: 722


Use 32 bytes of CSPRNG output directly for Ed25519 signing keys instead of hashing 16 bytes of entropy.

Lines 130–131 generate only 16 random bytes and stretch them with SHA-256 to create a 32-byte seed. This approach is insufficient: hashing 16 bytes of entropy does not increase the underlying entropy (it remains 128 bits); it merely expands the output size. Ed25519 keys should be derived from 32 bytes of direct CSPRNG output, or if entropy is limited, use a proper Key Derivation Function like HKDF rather than a simple hash function.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/hydra.rs` around lines 129 - 137, The code currently
generates 16 bytes and hashes them in
generate_hydra_key_pair/hydra_key_pair_from_entropy which reduces effective
entropy; change generate_hydra_key_pair to fill a [u8; 32] with CSPRNG
(getrandom) and pass that 32-byte array directly into
hydra_key_pair_from_entropy (or rename accordingly) so the SigningKey is
constructed from the raw 32-byte seed instead of Sha256(entropy); alternatively,
if you must expand limited entropy, use a KDF (e.g., HKDF) explicitly rather
than a plain hash—update function signatures (entropy types) and remove the
Sha256-based stretching in hydra_key_pair_from_entropy and call
SigningKey::from_bytes with the 32-byte seed.

Comment on lines +250 to +258
if let Some(expected_type) = schema.get("type").and_then(Value::as_str) {
let matches = match expected_type {
"boolean" => value.is_boolean(),
"integer" => value.as_i64().is_some(),
"number" => value.as_f64().is_some(),
"object" => value.is_object(),
"string" => value.is_string(),
_ => true,
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Schema type validation currently skips array.

Line 250–Line 258 do not validate "array" types, so invalid values can pass schema validation and fail later in planning/execution.

Proposed fix
         let matches = match expected_type {
             "boolean" => value.is_boolean(),
             "integer" => value.as_i64().is_some(),
             "number" => value.as_f64().is_some(),
             "object" => value.is_object(),
             "string" => value.is_string(),
+            "array" => value.is_array(),
+            "null" => value.is_null(),
             _ => true,
         };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcp-server/src/tools/workloads/upgrade.rs` around lines 250 - 258, The schema
type check block that matches on expected_type (using variables schema,
expected_type, value and the local matches binding) currently omits the "array"
branch, allowing non-array values to pass; add an "array" arm to the match that
returns value.is_array() so arrays are validated correctly (i.e., update the
match in upgrade.rs where expected_type is matched to include "array" =>
value.is_array()).

@gonzalezzfelipe gonzalezzfelipe merged commit e825e4e into main May 18, 2026
16 checks passed
@gonzalezzfelipe gonzalezzfelipe deleted the feat/mcp branch May 18, 2026 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant