Skip to content

feat(buildrunner): noop impl, poll-driven buildsignal, pipeline wiring#177

Merged
behinddwalls merged 1 commit into
mainfrom
preetam/build-ext
Jun 3, 2026
Merged

feat(buildrunner): noop impl, poll-driven buildsignal, pipeline wiring#177
behinddwalls merged 1 commit into
mainfrom
preetam/build-ext

Conversation

@behinddwalls
Copy link
Copy Markdown
Collaborator

@behinddwalls behinddwalls commented Jun 2, 2026

Summary

Why?

The orchestrator's build stage needs to drive the BuildRunner contract
end-to-end: trigger the runner, persist the result, and poll Status
until terminal so the batch state machine can react. Polling has to
behave like the rest of the pipeline (queue-driven, partition-isolated,
restart-safe) rather than running as an in-process timer loop.

What?

Stacks on top of the BuildRunner interface, noop, and PublishAfter
PRs. Wires the runner into the orchestrator pipeline.

The build poll loop runs as queue traffic inside the existing
buildsignal consumer (no separate stage). On each delivery it loads the
Build from storage, calls BuildRunner.Status, persists the result via
BuildStore.UpdateStatus, publishes the batch ID to speculate so the
state machine re-evaluates, and re-publishes itself via
Publisher.PublishAfter until the build reaches a terminal state. A
webhook-capable backend can publish into the same topic — the consumer
cannot tell a poll-driven message from a push.

Only the build ID travels on the queue (entity.BuildID); the
consumer reloads the full Build from BuildStore, keeping the message
small and storage the single source of truth — the same ID-on-the-queue,
load-from-storage pattern the rest of the pipeline already uses for
batches and requests. The controllers consume the runner's
entity.BuildID signatures (Trigger returns one; Status takes one).

Pieces:

  • orchestrator/controller/build: assembles base from
    batch.Dependencies and head from batch.Contains, calls
    Trigger, persists the initial Build{Accepted} via
    BuildStore.Create (ErrAlreadyExists is swallowed for redelivery),
    publishes the build ID to buildsignal.
  • orchestrator/controller/buildsignal: the polling consumer described
    above. It loads the Build by ID, then polls. PollDelayAcceptedMs=5000,
    PollDelayRunningMs=2000 by default (vars so tests can override; a TODO
    notes these should move into the queueconfig extension). Error
    classification: only the PublishAfter re-schedule is wrapped retryable
    (errs.NewRetryableError) — it is the poll loop's heartbeat, so a
    transient enqueue blip nacks and replays (up to MaxAttempts) rather
    than rejecting the loop's only live message straight to DLQ. Deserialize,
    the Build load, Status, UpdateStatus, and the speculate publish
    stay non-retryable and reject to DLQ on first failure, where an
    operational republish is the recovery path.
  • example/server/orchestrator/main.go: passes the BuildRunner to
    both build.NewController and buildsignal.NewController; pipeline
    diagram updated.
  • root BUILD.bazel: adds # gazelle:exclude .claude so gazelle does
    not index nested worktrees as duplicate rule definitions and corrupt
    the canonical BUILD files.

Test Plan

  • bazel test //extension/buildrunner/... //orchestrator/controller/build/... //orchestrator/controller/buildsignal/... //extension/queue/... — all pass.
  • make fmt lint check-tidy check-gazelle check-mocks — clean.
  • make build — all targets compile.
  • New coverage: build controller persist+publish path (with
    ErrAlreadyExists swallow), buildsignal poll loop (terminal forwards
    to speculate, non-terminal re-publishes via PublishAfter with
    per-status delay, retryable re-publish failure, non-retryable build-load
    / Status / UpdateStatus failures reject to DLQ).

@behinddwalls behinddwalls force-pushed the preetam/queue-publish-after branch from 42a34ea to 2dd6d25 Compare June 2, 2026 18:28
@behinddwalls behinddwalls changed the title feat(buildrunner): noop impl, PublishAfter queue primitive, poll-driven buildsignal feat(buildrunner): noop impl, poll-driven buildsignal, pipeline wiring Jun 2, 2026
@behinddwalls behinddwalls force-pushed the preetam/queue-publish-after branch from 2dd6d25 to a7d0c2a Compare June 2, 2026 18:47
@behinddwalls behinddwalls force-pushed the preetam/queue-publish-after branch from a7d0c2a to d83c3a5 Compare June 2, 2026 18:57
@behinddwalls behinddwalls force-pushed the preetam/queue-publish-after branch from d83c3a5 to d735e8f Compare June 2, 2026 22:04
@behinddwalls behinddwalls force-pushed the preetam/queue-publish-after branch from d735e8f to c368626 Compare June 2, 2026 22:29
@behinddwalls behinddwalls force-pushed the preetam/build-ext branch 2 times, most recently from e298f94 to de8932c Compare June 2, 2026 22:53
@behinddwalls behinddwalls force-pushed the preetam/queue-publish-after branch from c368626 to 64d1a6e Compare June 2, 2026 22:53
@behinddwalls behinddwalls force-pushed the preetam/queue-publish-after branch from 64d1a6e to 1311592 Compare June 2, 2026 23:08
@behinddwalls behinddwalls marked this pull request as ready for review June 2, 2026 23:11
@behinddwalls behinddwalls requested review from a team and sbalabanov as code owners June 2, 2026 23:11
behinddwalls added a commit that referenced this pull request Jun 3, 2026
)

## Summary

### Why?

The build stage needs a vendor-agnostic abstraction for talking to a
Build Runner — one that fits the existing extension family (storage,
queue, conflict) and that does not have to change when the first real
backend lands. Design rationale lives in `doc/rfc/build-runner.md`.

### What?

Introduces the `BuildRunner` contract — three verbs, all keyed by an
`entity.BuildID`: `Trigger` submits a build (ordered `base` + `head`
change sets plus a free-form metadata map) and returns its ID; `Status`
fetches the current `BuildStatus` and runner-defined metadata; `Cancel`
requests cancellation. See `extension/buildrunner/build_runner.go` for
the signatures.

Highlights:

- `Trigger` takes ordered `base` (assumed-good prefix from dependency
  batches) and `head` (changes being verified) — a runner can cache or
  short-circuit a prefix it has validated before, and attribute terminal
  failures to base vs head via `BuildMetadata`.
- Build identifiers are the typed `entity.BuildID` (a lightweight
  `{ID string}` value, also the queue payload envelope) rather than a
  bare `string`, so the runner contract and the on-queue messages share
  one identifier type and the compiler distinguishes a build ID from
  other string IDs.
- `metadata` is a free-form `map[string]string` for caller-supplied
  attributes (requester, ticket ID, trace ID) the runner MAY persist or
  echo back via `Status`.
- `Trigger` and `Cancel` return promptly; `Status` MAY be lengthy.
- The `BuildStatus` enum is narrowed to `Accepted` / `Running` /
  `Succeeded` / `Failed` / `Cancelled` so every backend can map its
  native lifecycle without leaking runner-specific stages.
- `BuildMetadata` is added as a free-form map for round-tripping
  runner-defined attributes (build URL, duration, SHA, etc.).

Naming: the interface is `BuildRunner` (not `BuildManager` — the
"Manager" suffix doesn't describe what it does, and "Build" alone would
collide cognitively with `entity.Build`). The package is
`extension/buildrunner` to follow the repo convention used by
`mergechecker`, `changeprovider`, `pusher`, `counter`, and `storage`.

Implementation (noop backend, queue `PublishAfter` primitive, controller
wiring, and the poll-driven `buildsignal` loop) lands in the follow-up
PRs stacked on top of this one.

## Test Plan

- ✅ `make test` — entity/build tests pass; mock-only consumers of the
  interface compile and pass.
- ✅ `make fmt lint check-tidy check-gazelle check-mocks` — clean.
- ✅ `make build` — all targets compile.


## Stack
1. @ #175
1. #179
1. #176
1. #177
behinddwalls added a commit that referenced this pull request Jun 3, 2026
## Summary
### Why?

Wiring tests for the orchestrator's build stage need a `BuildRunner`
implementation, but real backends (BuildKite, Jenkins, etc.) won't land
until later. A noop that immediately reports success is also useful as
the local default in `make local-start` and as a best-case baseline
where every build passes.

### What?

Adds `extension/buildrunner/noop` — a `BuildRunner` that does no real
work. Every `Trigger` returns a unique `noop-<n>` build ID with no
error; every `Status` returns `BuildStatusSucceeded` with no metadata;
`Cancel` is a no-op. The atomic counter keeps IDs unique under
concurrent use.

## Test Plan
- ✅ `make test` — 36 unit tests pass, including the new
  `extension/buildrunner/noop:noop_test` (Trigger uniqueness, Status,
  Cancel, interface satisfaction).
- ✅ `make fmt lint check-tidy check-gazelle check-mocks` — clean.
- ✅ `make build` — all targets compile.

## Issues


## Stack
1. #175
1. @ #179
1. #176
1. #177
behinddwalls added a commit that referenced this pull request Jun 3, 2026
## Summary
### Why?

The orchestrator's build poll loop needs to space `Status` calls out
without consuming `retry_count` / DLQ retry slots. `Delivery.Nack` does
schedule redelivery after a delay, but it overloads `retry_count` —
which the operator relies on to flag real failures. The fix is a
separate "postpone this work" primitive, distinct from "this delivery
failed, try again", so both signals stay meaningful.

### What?

Adds `Publisher.PublishAfter(ctx, topic, msg, delayMs)` to the queue
extension: a fresh message inserted into the topic, made visible to
subscribers only after `delayMs` from now. Distinct from
`Delivery.Nack(requeueAfterMillis)`:

- `Nack` is "this delivery failed, try again" — it bumps `retry_count`
  and eventually trips DLQ.
- `PublishAfter` is "postpone this work" — `retry_count` resets to 0,
  DLQ stays available for true failures.

SQL backing in `extension/queue/mysql`:

- New `visible_after BIGINT UNSIGNED NOT NULL DEFAULT 0` column on
  `queue_messages`. Default 0 means immediately visible — back-compat
  for any existing rows.
- `messageStore.InsertDelayed(ctx, topic, messages, visibleAfterMs)` is
  the underlying primitive. `Insert` is now a thin wrapper that passes
  `visibleAfterMs = 0`.
- `FetchByOffset` gains a `nowMs` parameter and an
  `AND visible_after <= ?` predicate so subscribers skip rows whose
  delivery is still deferred.
- `MoveToDLQ` writes `visible_after = 0` explicitly: any delay on the
  original message has already been consumed by the time it failed.

The first consumer of `PublishAfter` is the orchestrator's poll-driven
`buildsignal` loop, which lands in the stacked PR on top of this one.

## Test Plan
- ✅ `bazel test //extension/queue/...` — all pass, including new
  coverage for `PublishAfter` (positive / zero / negative delay,
  closed-publisher) and for `FetchByOffset` skipping deferred rows via
  the new `visible_after` predicate.
- ✅ `make fmt lint check-tidy check-gazelle check-mocks` — clean.
- ✅ `make build` — all targets compile.

## Issues


## Stack
1. #175
1. #179
1. @ #176
1. #177
Base automatically changed from preetam/queue-publish-after to main June 3, 2026 02:01
## Summary

### Why?

The orchestrator's build stage needs to drive the `BuildRunner` contract
end-to-end: trigger the runner, persist the result, and poll `Status`
until terminal so the batch state machine can react. Polling has to
behave like the rest of the pipeline (queue-driven, partition-isolated,
restart-safe) rather than running as an in-process timer loop.

### What?

Stacks on top of the BuildRunner interface, noop, and `PublishAfter`
PRs. Wires the runner into the orchestrator pipeline.

The build poll loop runs as queue traffic inside the existing
`buildsignal` consumer (no separate stage). On each delivery it loads the
`Build` from storage, calls `BuildRunner.Status`, persists the result via
`BuildStore.UpdateStatus`, publishes the batch ID to `speculate` so the
state machine re-evaluates, and re-publishes itself via
`Publisher.PublishAfter` until the build reaches a terminal state. A
webhook-capable backend can publish into the same topic — the consumer
cannot tell a poll-driven message from a push.

Only the build **ID** travels on the queue (`entity.BuildID`); the
consumer reloads the full `Build` from `BuildStore`, keeping the message
small and storage the single source of truth — the same ID-on-the-queue,
load-from-storage pattern the rest of the pipeline already uses for
batches and requests. The controllers consume the runner's
`entity.BuildID` signatures (`Trigger` returns one; `Status` takes one).

Pieces:

- `orchestrator/controller/build`: assembles `base` from
  `batch.Dependencies` and `head` from `batch.Contains`, calls
  `Trigger`, persists the initial `Build{Accepted}` via
  `BuildStore.Create` (`ErrAlreadyExists` is swallowed for redelivery),
  publishes the build ID to `buildsignal`.
- `orchestrator/controller/buildsignal`: the polling consumer described
  above. It loads the `Build` by ID, then polls. `PollDelayAcceptedMs=5000`,
  `PollDelayRunningMs=2000` by default (vars so tests can override; a TODO
  notes these should move into the `queueconfig` extension). Error
  classification: only the `PublishAfter` re-schedule is wrapped retryable
  (`errs.NewRetryableError`) — it is the poll loop's heartbeat, so a
  transient enqueue blip nacks and replays (up to `MaxAttempts`) rather
  than rejecting the loop's only live message straight to DLQ. Deserialize,
  the `Build` load, `Status`, `UpdateStatus`, and the speculate publish
  stay non-retryable and reject to DLQ on first failure, where an
  operational republish is the recovery path.
- `example/server/orchestrator/main.go`: passes the `BuildRunner` to
  both `build.NewController` and `buildsignal.NewController`; pipeline
  diagram updated.
- root `BUILD.bazel`: adds `# gazelle:exclude .claude` so gazelle does
  not index nested worktrees as duplicate rule definitions and corrupt
  the canonical BUILD files.

## Test Plan

- ✅ `bazel test //extension/buildrunner/... //orchestrator/controller/build/... //orchestrator/controller/buildsignal/... //extension/queue/...` — all pass.
- ✅ `make fmt lint check-tidy check-gazelle check-mocks` — clean.
- ✅ `make build` — all targets compile.
- New coverage: build controller persist+publish path (with
  `ErrAlreadyExists` swallow), buildsignal poll loop (terminal forwards
  to speculate, non-terminal re-publishes via `PublishAfter` with
  per-status delay, retryable re-publish failure, non-retryable build-load
  / `Status` / `UpdateStatus` failures reject to DLQ).
@behinddwalls behinddwalls merged commit 1106427 into main Jun 3, 2026
14 checks passed
@behinddwalls behinddwalls deleted the preetam/build-ext branch June 3, 2026 03:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants