Skip to content

fix: harden infrastructure and CLI for production readiness#34

Merged
stepandel merged 2 commits intomainfrom
fix/production-hardening
Feb 10, 2026
Merged

fix: harden infrastructure and CLI for production readiness#34
stepandel merged 2 commits intomainfrom
fix/production-hardening

Conversation

@stepandel
Copy link
Owner

@stepandel stepandel commented Feb 10, 2026

Summary

Production hardening fixes identified during a full audit of the repo. Addresses 7 issues across security, resilience, and UX:

  • Path traversal prevention — Workspace file injection now validates paths, rejecting .. and absolute paths
  • SSH locked down by default — Security groups no longer expose port 22 to 0.0.0.0/0; SSH requires explicit allowedSshCidrs (matching the Hetzner component's existing approach)
  • IMDSv2 enforced — EC2 instances require token-based metadata access, preventing secrets from being read via curl to the metadata endpoint
  • Silent errors eliminated — 6 catch blocks across 5 files now log warnings instead of silently swallowing failures
  • Graceful shutdown — New process.ts module tracks child processes; SIGINT/SIGTERM are forwarded before exit so pulumi up/destroy aren't orphaned
  • Retry with backoff — Tailscale API calls retry 3 times with exponential backoff (1s → 2s → 4s)
  • Hetzner hidden — Unimplemented Hetzner provider is no longer selectable in agent-army init

Test plan

  • pnpm run build passes for both root and CLI packages (verified locally)
  • agent-army init skips Hetzner provider selection, shows "AWS (Hetzner coming soon)"
  • pulumi preview shows metadataOptions on EC2 instances and empty SSH ingress on security groups
  • Interrupt agent-army deploy with Ctrl+C — child pulumi process should be killed cleanly
  • agent-army destroy logs warnings on Tailscale API failures instead of silently continuing

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Graceful shutdown handling for child processes.
    • Cloud provider auto-selection when AWS is the only available option; Hetzner marked "coming soon".
    • Retry mechanism with exponential backoff for Tailscale API calls.
    • SSH access now restricted to specified CIDR blocks via new allowedSshCidrs parameter.
    • IMDSv2 enforcement for improved instance metadata security.
  • Improvements

    • Enhanced error logging and validation across modules.

- Prevent directory traversal in workspace file injection (cloud-init.ts)
- Restrict SSH to explicit CIDRs, default to no SSH ingress (shared-vpc.ts, openclaw-agent.ts)
- Enforce IMDSv2 on EC2 instances to block unauthenticated metadata access
- Replace silent catch blocks with warning logs across 6 files
- Add graceful shutdown: track child processes, forward SIGINT/SIGTERM on exit
- Add exponential backoff retry (3 attempts) to Tailscale API calls
- Hide unimplemented Hetzner provider from CLI init wizard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

The pull request introduces graceful shutdown for child processes, adds exponential backoff retry logic for Tailscale API calls, enhances error logging across modules, implements runtime validation for workspace file paths, restricts SSH access via configurable CIDR blocks, and enforces IMDSv2 for instance metadata security.

Changes

Cohort / File(s) Summary
Child Process Management
cli/bin.ts, cli/adapters/cli-adapter.ts, cli/lib/exec.ts, cli/lib/process.ts
New module process.ts introduces trackChild() and setupGracefulShutdown() to centralize child process cleanup. Both CLI and exec modules now register spawned children for tracking; graceful shutdown forwards SIGINT/SIGTERM to all tracked children before exit.
Error Handling & Logging Enhancements
cli/lib/config.ts, cli/tools/deploy.ts, cli/tools/destroy.ts
Enhanced catch blocks to capture and log error messages instead of silent failures, preserving control flow while improving observability.
Provider Selection Logic
cli/commands/init.ts, cli/lib/constants.ts
Cloud provider selection now auto-selects AWS when only one provider is implemented; Hetzner label updated to "(coming soon)" with disabled hint to reflect incomplete implementation.
Tailscale API Resilience
cli/lib/tailscale.ts
Added exponential backoff retry logic (max 3 attempts, 1s initial delay) to listTailscaleDevices() and deleteTailscaleDevice(); logs warnings with error details on exhaustion.
SSH Access Control
shared-vpc.ts, src/components/openclaw-agent.ts
New allowedSshCidrs property enables restricted SSH ingress rules; replaces static 0.0.0.0/0 access. OpenClawAgent also enforces IMDSv2 via metadataOptions.
Workspace Path Validation
src/components/cloud-init.ts
Runtime validation in generateWorkspaceFilesScript() rejects paths containing .., leading slashes, or null characters to prevent injection.

Sequence Diagram(s)

sequenceDiagram
    participant OS as OS/Shell
    participant CLI as CLI Process
    participant SigHandler as Signal Handler
    participant Children as Child Processes
    participant Cleanup as Process Cleanup
    
    OS->>CLI: User sends SIGINT/SIGTERM
    CLI->>SigHandler: Receives signal
    SigHandler->>SigHandler: Get tracked children from Set
    
    loop For each tracked child
        SigHandler->>Children: Forward SIGINT/SIGTERM
        Children->>Children: Graceful shutdown
        Children->>Cleanup: Emit close/error event
        Cleanup->>SigHandler: Auto-unregister from tracked Set
    end
    
    SigHandler->>CLI: Exit with code (130 for SIGINT, 143 for SIGTERM)
    CLI->>OS: Process terminates
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 A child runs free, but now must heed the call—
When signals ring, we gather them all.
Retry, retry, with patience we wait,
SSH gates close, no wider than fate.
Validation stands guard, paths pure and bright,
Infrastructure strong, security tight!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: harden infrastructure and CLI for production readiness' accurately summarizes the main objectives of the changeset, which focus on security hardening, resilience improvements, and production-ready changes across infrastructure and CLI components.
Docstring Coverage ✅ Passed Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/production-hardening

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@cli/lib/tailscale.ts`:
- Around line 37-56: The code interpolates apiKey directly into a shell command
passed to execSync (e.g., in the device-list loop) which risks command
injection; change the calls that build the curl command (the execSync usage that
includes the Authorization header) to avoid shell interpolation by either using
child_process.execFileSync/ spawnSync with the curl binary and an args array or
by passing the key via an environment variable (e.g., set options.env = {
...process.env, TS_API_KEY: apiKey } and use the header "Authorization: Bearer
$TS_API_KEY"), and apply the same change to the deleteTailscaleDevice call so
neither function ever injects apiKey directly into a shell string.

In `@src/components/cloud-init.ts`:
- Around line 328-339: The validation uses the normalized variable but the code
later writes using the original filePath, which leaves Windows backslashes in
filenames; update the write logic in the workspaceFiles loop to use the
normalized path (the normalized variable) as the target for directory creation
and file writes—convert back to platform-specific paths as needed (e.g., using
path.join or path.resolve with workspace root) and ensure all checks
(startsWith, includes(".."), null byte) are performed on normalized before
creating directories or calling the file write functions so files are created in
the proper nested directories rather than as backslash-containing filenames.
🧹 Nitpick comments (2)
cli/lib/tailscale.ts (1)

16-17: Minor: Naming could be clearer for retry semantics.

MAX_RETRIES = 3 with loop attempt <= MAX_RETRIES yields 4 total attempts. The warning message correctly says "4 attempts", but the constant name suggests 3 retries (implying 1 initial + 3 retries = 4 total). Consider renaming to MAX_ATTEMPTS = 4 or RETRY_COUNT = 3 with loop attempt < RETRY_COUNT + 1 for clarity.

This is a minor readability nit — the behavior is correct.

Also applies to: 37-37

cli/tools/destroy.ts (1)

50-73: Consider consolidating with cli/lib/tailscale.ts.

This file has its own listTailscaleDevices and deleteTailscaleDevice implementations that don't have the retry logic added in cli/lib/tailscale.ts. The destroy flow would benefit from the same exponential backoff resilience.

This can be deferred, but worth tracking to avoid divergent implementations.

Also applies to: 78-90

Comment on lines +37 to 56
for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) {
try {
const result = execSync(
`curl -sf -H "Authorization: Bearer ${apiKey}" "https://api.tailscale.com/api/v2/tailnet/${tailnet}/devices?fields=default"`,
{ encoding: "utf-8", stdio: ["pipe", "pipe", "pipe"], timeout: 15000 }
);
const data = JSON.parse(result);
if (!data.devices || !Array.isArray(data.devices)) return null;
return data.devices.map((d: Record<string, unknown>) => ({
id: d.id as string,
name: (d.name as string) ?? "",
hostname: (d.hostname as string) ?? "",
}));
} catch (err) {
lastError = err;
if (attempt < MAX_RETRIES) {
sleepSync(BASE_DELAY_MS * Math.pow(2, attempt));
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential command injection via API key.

The apiKey is interpolated directly into a shell command string passed to execSync. If the API key contains shell metacharacters (unlikely for legitimate Tailscale keys, but possible with malicious input), this could lead to command injection.

Consider using an environment variable or writing to a temp file to pass the authorization header safely.

🔒 Proposed fix using environment variable
 export function listTailscaleDevices(
   apiKey: string,
   tailnet: string
 ): TailscaleDevice[] | null {
   let lastError: unknown;

   for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) {
     try {
       const result = execSync(
-        `curl -sf -H "Authorization: Bearer ${apiKey}" "https://api.tailscale.com/api/v2/tailnet/${tailnet}/devices?fields=default"`,
-        { encoding: "utf-8", stdio: ["pipe", "pipe", "pipe"], timeout: 15000 }
+        `curl -sf -H "Authorization: Bearer $TAILSCALE_API_KEY" "https://api.tailscale.com/api/v2/tailnet/${encodeURIComponent(tailnet)}/devices?fields=default"`,
+        { encoding: "utf-8", stdio: ["pipe", "pipe", "pipe"], timeout: 15000, env: { ...process.env, TAILSCALE_API_KEY: apiKey } }
       );

Apply similar fix to deleteTailscaleDevice.

Also applies to: 77-90

🤖 Prompt for AI Agents
In `@cli/lib/tailscale.ts` around lines 37 - 56, The code interpolates apiKey
directly into a shell command passed to execSync (e.g., in the device-list loop)
which risks command injection; change the calls that build the curl command (the
execSync usage that includes the Authorization header) to avoid shell
interpolation by either using child_process.execFileSync/ spawnSync with the
curl binary and an args array or by passing the key via an environment variable
(e.g., set options.env = { ...process.env, TS_API_KEY: apiKey } and use the
header "Authorization: Bearer $TS_API_KEY"), and apply the same change to the
deleteTailscaleDevice call so neither function ever injects apiKey directly into
a shell string.

Comment on lines 328 to +339
for (const [filePath, content] of Object.entries(workspaceFiles)) {
// Validate path to prevent directory traversal
const normalized = filePath.replace(/\\/g, "/");
if (
normalized.includes("..") ||
normalized.startsWith("/") ||
normalized.includes("\0")
) {
throw new Error(
`Invalid workspace file path: "${filePath}". Paths must be relative and cannot contain "..".`
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use the normalized path for the actual write target.
Right now validation uses normalized, but the write path uses the original filePath, so Windows-style paths become filenames with backslashes instead of directories.

🛠️ Proposed fix
-    const fullPath = `/home/ubuntu/.openclaw/workspace/${filePath}`;
+    const safePath = normalized;
+    const fullPath = `/home/ubuntu/.openclaw/workspace/${safePath}`;
🤖 Prompt for AI Agents
In `@src/components/cloud-init.ts` around lines 328 - 339, The validation uses the
normalized variable but the code later writes using the original filePath, which
leaves Windows backslashes in filenames; update the write logic in the
workspaceFiles loop to use the normalized path (the normalized variable) as the
target for directory creation and file writes—convert back to platform-specific
paths as needed (e.g., using path.join or path.resolve with workspace root) and
ensure all checks (startsWith, includes(".."), null byte) are performed on
normalized before creating directories or calling the file write functions so
files are created in the proper nested directories rather than as
backslash-containing filenames.

The main stack now fully supports HetznerOpenClawAgent, so the provider
selection and label should remain active instead of filtering to AWS-only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@stepandel stepandel merged commit bbe955c into main Feb 10, 2026
1 check passed
@stepandel stepandel deleted the fix/production-hardening branch February 10, 2026 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant