Pipeline Design 22

Design: Add multi-repo fleet auto-discovery from GitHub org

Context

Shipwright's fleet orchestrator (scripts/sw-fleet.sh) currently requires manual configuration of repos in .claude/fleet-config.json. Users managing GitHub organizations with many repos must hand-edit this config for each repo, which is error-prone and doesn't adapt as repos are created, archived, or change activity levels.

Constraints from the codebase:

All scripts must be Bash 3.2 compatible (no associative arrays, no readarray, no ${var,,})
Scripts use set -euo pipefail, atomic file writes (tmp + mv), and jq --arg for JSON
GitHub API calls must respect $NO_GITHUB env var (existing pattern across all modules)
The fleet already has a background loop pattern: fleet_rebalance() runs on an interval via sleep in a backgrounded subshell — the rediscovery loop should follow the same pattern
gh api is the standard GitHub client (not raw curl), used throughout sw-github-graphql.sh and sw-fleet.sh
Fleet config lives at .claude/fleet-config.json with a known schema (repos[], worker_pool, etc.)

Decision

Extend sw-fleet.sh with a discover subcommand and fleet_rediscover_loop() background process.

Data Flow

gh api /orgs/{org}/repos --paginate
  → JSON array of repos
  → Filter: language, pushed_at > activity_days, topics, has_open_issues, !archived, !disabled, !fork (unless --include-forks)
  → Opt-out check: skip repos with "shipwright-ignore" topic
  → Opt-out check: skip repos where gh api /repos/{owner}/{repo}/contents/.shipwright-ignore returns 200
  → Generate fleet-config.json entries (or merge with existing)
  → Output summary / dry-run report

CLI Interface

shipwright fleet discover --org <org> [flags]

Flags:

--org <name> — GitHub org (required)
--language <lang> — filter by primary language
--activity-days <N> — only repos pushed within N days (default: 90)
--topic <topic> — require this topic (repeatable via comma-separated)
--has-issues — only repos with open issues
--include-forks — include forked repos (excluded by default)
--merge — merge discovered repos into existing config rather than overwriting
--dry-run — print what would be added, don't write config

Config Schema Addition

{
  "repos": [...],
  "worker_pool": {...},
  "auto_discover": {
    "enabled": false,
    "org": "my-org",
    "interval_seconds": 3600,
    "filters": {
      "language": null,
      "activity_days": 90,
      "has_issues": false,
      "topics": [],
      "include_forks": false
    }
  }
}

Background Re-discovery

fleet_rediscover_loop() follows the identical pattern to fleet_rebalance():

Spawned as a backgrounded subshell from fleet_start()
Sleeps for interval_seconds, then calls fleet_discover --org "$org" --merge
On new repos found, writes a fleet-rediscover.flag file
Main fleet loop checks for this flag file and calls fleet_add_repo() to hot-add repos to running daemons
Flag file removed after processing

Hot-Add Mechanism

fleet_add_repo() adds a repo entry to the in-memory config, starts a daemon for the new repo (following existing fleet_start_repo() patterns), and updates the fleet status file. This avoids restarting the entire fleet for newly discovered repos.

Topology in `fleet_status()`

Extend the existing status output with a topology section:

Repos grouped by machine (local vs. each remote)
Workers allocated per repo
Active/queued job counts
Auto-discover: enabled/disabled, last scan timestamp, next scan ETA

Error Handling

$NO_GITHUB set → fleet_discover() prints warning and exits 0 (no-op, consistent with other modules)
gh api failures → logged via warn(), discovery aborted for that run, next interval retries
Invalid org / 404 → error() + exit 1 for CLI, warn() + continue for background loop
Rate limiting → gh api handles retry headers natively; if pagination fails mid-stream, partial results are discarded (no partial writes)
.shipwright-ignore file check failure (network error) → repo is included (fail-open, user can always add shipwright-ignore topic as the reliable opt-out)
Atomic config writes: write to fleet-config.json.tmp, then mv into place

Pagination

gh api --paginate handles GitHub's Link-header pagination automatically. For orgs with 1000+ repos, this produces a single concatenated JSON array. We pipe through jq filters in a single pass.

Alternatives Considered

GitHub GraphQL API via sw-github-graphql.sh — Pros: single request for all data including topics, richer filtering server-side, lower API call count. Cons: sw-github-graphql.sh is designed for per-repo queries within a known repo context, not org-wide scans; would require new query templates and caching logic; REST gh api --paginate is simpler and already used in fleet for health checks; GraphQL org queries require different auth scopes. Rejected: unnecessary complexity for the use case.
Separate discovery script (sw-fleet-discover.sh) — Pros: smaller files, clear separation. Cons: discovery is tightly coupled to fleet config schema and hot-add; a separate file would need to import fleet internals or duplicate them; the existing fleet script already handles config loading, status, and rebalancing — discovery is a natural extension. Rejected: would create coupling issues without meaningful separation benefit.
GitHub App / webhook-based discovery — Pros: real-time repo creation events, no polling. Cons: requires a running server or Lambda, GitHub App setup, dramatically increases infrastructure complexity; polling at 1-hour intervals is sufficient for fleet management where daemon startup itself takes seconds. Rejected: over-engineered for the use case.

Implementation Plan

Files to create: None
Files to modify:
- scripts/sw-fleet.sh — Add fleet_discover(), fleet_rediscover_loop(), fleet_add_repo(), topology in fleet_status(), CLI parsing for discover subcommand, load_fleet_config() updates for auto_discover block
- scripts/sw-fleet-test.sh — 13 new test cases covering discover, filters, opt-out, merge, dry-run, rediscovery loop, hot-add, topology display, NO_GITHUB handling
- .claude/CLAUDE.md — Document fleet discover command, auto_discover config keys, topology status output
Dependencies: None new. Uses existing gh, jq, standard POSIX tools.
Risk areas:
- Pagination memory for large orgs: gh api --paginate concatenates all pages into memory. For orgs with 5000+ repos, this could be several MB of JSON. Acceptable for bash fleet management; not a realistic bottleneck.
- .shipwright-ignore file checks: One API call per discovered repo to check for the file. For 100 repos, that's 100 sequential API calls. Mitigate by checking the shipwright-ignore topic first (free, already in the repo listing response) and only checking the file for repos that pass all other filters. Consider caching results in the rediscovery loop.
- Race condition on hot-add: If rediscovery and rebalancer both modify config simultaneously. Mitigate with atomic writes and flag-file signaling (rebalancer processes flag after its current cycle).
- --merge correctness: Must match repos by path field (local repos) or remote URL. Repos already in config should not be duplicated. Use jq to deduplicate by a canonical key.

Pipeline Design 22

Design: Add multi-repo fleet auto-discovery from GitHub org

Context

Decision

Data Flow

CLI Interface

Config Schema Addition

Background Re-discovery

Hot-Add Mechanism

Topology in fleet_status()

Error Handling

Pagination

Alternatives Considered

Implementation Plan

Validation Criteria

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Topology in `fleet_status()`