Skip to content

v0.8.0

@tlamadon tlamadon tagged this 04 Jun 00:42
Job submission has been server-mediated since 0.6.x (`workflow run`
goes through `/api/v1/sources/{name}/run`), but stack management was
still client-local: `stack install` SSHed to the backend from the
operator's machine using local backend configs. A thin remote agent
(cli_server + cli_auth, no `backends:` locally, no SSH route to the
cluster) could submit jobs but couldn't install the stacks those
jobs depend on. v0.8.0 makes the two flows symmetric.

API endpoints (all server-mediated, single-backend per call):
- `GET /api/v1/stacks/{name}/check?backend=<b>&source=<src>` —
  resolve the stack from server-global config or, when `source` is
  set, overlay the source's project-YAML stacks on top (source wins
  on collision, mirroring `_overlay_source_stacks`). 404 when the
  name isn't in either layer, 422 on invalid project YAML, 503 when
  the backend has no SSH on the server. Returns the `StackStatus`
  shape (state/hash/path/last_built/size_bytes/error).
- `POST /api/v1/stacks/{name}/install?backend=<b>&source=<src>&rebuild=<bool>`
  — same resolution; awaits `StackManager.install()` end-to-end on
  the server's SSH connection, with `scheduler="slurm"` set when the
  backend is Slurm (so prep runs via srun on a worker, not the login
  node) and `"pbs"`/None otherwise. Idempotent: skips when `.ready`
  exists at the hash unless `rebuild=true`.
- `DELETE /api/v1/stacks/{name}?backend=<b>&source=<src>` — `rm -rf`
  the cache for every hash of that stack name on that backend.

CLI routing:
- When `_resolve_server(args)` returns a URL, `stack install/check/
  delete` dispatch to the new endpoints instead of opening local
  SSH. `stack list` keeps using the local-overlay path (it's just a
  config read).
- `--backend` is required in server mode — v1 doesn't iterate over
  backends across separate API calls because server-side installs
  are long-running and silently fanning out N concurrent prep jobs
  would surprise the user. The error message explicitly says how to
  pick (`scripthut backend list --json`).
- `--server local` still forces local mode (existing escape hatch).
- The remote helper uses a 30-min default timeout; Cloudflare-Access
  tunnels generally hold for the duration since prep emits output.
  Long-install streaming/async is the natural v0.8.1 follow-up if
  the edge cuts mid-build.

Behavior changes for existing users:
- A user who today runs `scripthut stack install <name>` with
  `--server` configured AND a local `scripthut.yaml` carrying
  backends will now route through the API instead of going local.
  The flow still works because the server has the same backends —
  but it's worth flagging because the API requires `--backend` to
  be explicit and the server does the SSH, not the client.

Agent prompt:
- The Stacks section was rewritten to document both modes side-by-
  side, with the routing rule (server set → API; `--server local`
  → local), the `--backend`-required note, and the long-installs
  caveat. The `stack install` example now shows both `--source` and
  `--backend` together because that's the common flow for remote
  agents driving scripthut.

Tests:
- `test_api_v1.py` — 12 new tests across all three endpoints:
  resolves from server-global, 404 on unknown name, 404 on unknown
  source, 503 when backend has no SSH, source overlay wins on
  collision (the repo's prep is what `StackManager.check` sees),
  install runs and returns ready, rebuild flag forwarded, slurm vs
  pbs scheduler forwarded.
- `test_cli.py` — 7 new tests covering CLI routing: install/check/
  delete dispatch to the API when a server is resolvable; check
  returns 1 when state != ready; install requires `--backend` in
  server mode; check requires `name` in server mode (URL needs it);
  local-mode path unchanged when no server is set.
- All 520/520 in the broad sweep + 156 in the backend-specific
  suites pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Assets 2
Loading