Skip to content

Pipeline Design 22

Seth Ford edited this page Feb 13, 2026 · 2 revisions

Design: Add multi-repo fleet auto-discovery from GitHub org

Context

Shipwright's fleet orchestrator (scripts/sw-fleet.sh) currently requires manual configuration of repos in .claude/fleet-config.json. Users managing GitHub organizations with many repos must hand-edit this config for each repo, which is error-prone and doesn't adapt as repos are created, archived, or change activity levels.

Constraints from the codebase:

  • All scripts must be Bash 3.2 compatible (no associative arrays, no readarray, no ${var,,})
  • Scripts use set -euo pipefail, atomic file writes (tmp + mv), and jq --arg for JSON
  • GitHub API calls must respect $NO_GITHUB env var (existing pattern across all modules)
  • The fleet already has a background loop pattern: fleet_rebalance() runs on an interval via sleep in a backgrounded subshell — the rediscovery loop should follow the same pattern
  • gh api is the standard GitHub client (not raw curl), used throughout sw-github-graphql.sh and sw-fleet.sh
  • Fleet config lives at .claude/fleet-config.json with a known schema (repos[], worker_pool, etc.)

Decision

Extend sw-fleet.sh with a discover subcommand and fleet_rediscover_loop() background process.

Data Flow

gh api /orgs/{org}/repos --paginate
  → JSON array of repos
  → Filter: language, pushed_at > activity_days, topics, has_open_issues, !archived, !disabled, !fork (unless --include-forks)
  → Opt-out check: skip repos with "shipwright-ignore" topic
  → Opt-out check: skip repos where gh api /repos/{owner}/{repo}/contents/.shipwright-ignore returns 200
  → Generate fleet-config.json entries (or merge with existing)
  → Output summary / dry-run report

CLI Interface

shipwright fleet discover --org <org> [flags]

Flags:

  • --org <name> — GitHub org (required)
  • --language <lang> — filter by primary language
  • --activity-days <N> — only repos pushed within N days (default: 90)
  • --topic <topic> — require this topic (repeatable via comma-separated)
  • --has-issues — only repos with open issues
  • --include-forks — include forked repos (excluded by default)
  • --merge — merge discovered repos into existing config rather than overwriting
  • --dry-run — print what would be added, don't write config

Config Schema Addition

{
  "repos": [...],
  "worker_pool": {...},
  "auto_discover": {
    "enabled": false,
    "org": "my-org",
    "interval_seconds": 3600,
    "filters": {
      "language": null,
      "activity_days": 90,
      "has_issues": false,
      "topics": [],
      "include_forks": false
    }
  }
}

Background Re-discovery

fleet_rediscover_loop() follows the identical pattern to fleet_rebalance():

  1. Spawned as a backgrounded subshell from fleet_start()
  2. Sleeps for interval_seconds, then calls fleet_discover --org "$org" --merge
  3. On new repos found, writes a fleet-rediscover.flag file
  4. Main fleet loop checks for this flag file and calls fleet_add_repo() to hot-add repos to running daemons
  5. Flag file removed after processing

Hot-Add Mechanism

fleet_add_repo() adds a repo entry to the in-memory config, starts a daemon for the new repo (following existing fleet_start_repo() patterns), and updates the fleet status file. This avoids restarting the entire fleet for newly discovered repos.

Topology in fleet_status()

Extend the existing status output with a topology section:

  • Repos grouped by machine (local vs. each remote)
  • Workers allocated per repo
  • Active/queued job counts
  • Auto-discover: enabled/disabled, last scan timestamp, next scan ETA

Error Handling

  • $NO_GITHUB set → fleet_discover() prints warning and exits 0 (no-op, consistent with other modules)
  • gh api failures → logged via warn(), discovery aborted for that run, next interval retries
  • Invalid org / 404 → error() + exit 1 for CLI, warn() + continue for background loop
  • Rate limiting → gh api handles retry headers natively; if pagination fails mid-stream, partial results are discarded (no partial writes)
  • .shipwright-ignore file check failure (network error) → repo is included (fail-open, user can always add shipwright-ignore topic as the reliable opt-out)
  • Atomic config writes: write to fleet-config.json.tmp, then mv into place

Pagination

gh api --paginate handles GitHub's Link-header pagination automatically. For orgs with 1000+ repos, this produces a single concatenated JSON array. We pipe through jq filters in a single pass.

Alternatives Considered

  1. GitHub GraphQL API via sw-github-graphql.sh — Pros: single request for all data including topics, richer filtering server-side, lower API call count. Cons: sw-github-graphql.sh is designed for per-repo queries within a known repo context, not org-wide scans; would require new query templates and caching logic; REST gh api --paginate is simpler and already used in fleet for health checks; GraphQL org queries require different auth scopes. Rejected: unnecessary complexity for the use case.

  2. Separate discovery script (sw-fleet-discover.sh) — Pros: smaller files, clear separation. Cons: discovery is tightly coupled to fleet config schema and hot-add; a separate file would need to import fleet internals or duplicate them; the existing fleet script already handles config loading, status, and rebalancing — discovery is a natural extension. Rejected: would create coupling issues without meaningful separation benefit.

  3. GitHub App / webhook-based discovery — Pros: real-time repo creation events, no polling. Cons: requires a running server or Lambda, GitHub App setup, dramatically increases infrastructure complexity; polling at 1-hour intervals is sufficient for fleet management where daemon startup itself takes seconds. Rejected: over-engineered for the use case.

Implementation Plan

  • Files to create: None
  • Files to modify:
    • scripts/sw-fleet.sh — Add fleet_discover(), fleet_rediscover_loop(), fleet_add_repo(), topology in fleet_status(), CLI parsing for discover subcommand, load_fleet_config() updates for auto_discover block
    • scripts/sw-fleet-test.sh — 13 new test cases covering discover, filters, opt-out, merge, dry-run, rediscovery loop, hot-add, topology display, NO_GITHUB handling
    • .claude/CLAUDE.md — Document fleet discover command, auto_discover config keys, topology status output
  • Dependencies: None new. Uses existing gh, jq, standard POSIX tools.
  • Risk areas:
    • Pagination memory for large orgs: gh api --paginate concatenates all pages into memory. For orgs with 5000+ repos, this could be several MB of JSON. Acceptable for bash fleet management; not a realistic bottleneck.
    • .shipwright-ignore file checks: One API call per discovered repo to check for the file. For 100 repos, that's 100 sequential API calls. Mitigate by checking the shipwright-ignore topic first (free, already in the repo listing response) and only checking the file for repos that pass all other filters. Consider caching results in the rediscovery loop.
    • Race condition on hot-add: If rediscovery and rebalancer both modify config simultaneously. Mitigate with atomic writes and flag-file signaling (rebalancer processes flag after its current cycle).
    • --merge correctness: Must match repos by path field (local repos) or remote URL. Repos already in config should not be duplicated. Use jq to deduplicate by a canonical key.

Validation Criteria

  • shipwright fleet discover --org test-org --dry-run lists repos without modifying config
  • shipwright fleet discover --org test-org generates valid fleet-config.json with discovered repos
  • --language, --activity-days, --topic, --has-issues, --include-forks filters reduce the repo list correctly
  • Repos with shipwright-ignore topic are excluded from discovery
  • Repos with .shipwright-ignore file are excluded from discovery
  • --merge adds new repos to existing config without duplicating or removing existing entries
  • auto_discover config block is parsed by load_fleet_config() and drives fleet_rediscover_loop()
  • Background rediscovery loop fires at configured interval and hot-adds new repos via flag file
  • fleet_status() displays topology with repos grouped by machine
  • NO_GITHUB=1 causes discover to no-op with a warning
  • All 13 new test cases pass in sw-fleet-test.sh
  • All 22 existing test suites continue to pass (npm test)
  • No Bash 3.2 incompatibilities (no associative arrays, no readarray, no ${var,,})

Clone this wiki locally