Skip to content

rpadmin: add PreRestartProbe and PostRestartProbe#170

Merged
david-yu merged 5 commits into
mainfrom
dyu/rpadmin-pre-post-restart-probe
May 21, 2026
Merged

rpadmin: add PreRestartProbe and PostRestartProbe#170
david-yu merged 5 commits into
mainfrom
dyu/rpadmin-pre-post-restart-probe

Conversation

@david-yu
Copy link
Copy Markdown
Contributor

@david-yu david-yu commented May 20, 2026

Summary

Adds Go bindings for the broker-local restart probes introduced in
Redpanda 25.1
(src/v/redpanda/admin/api-doc/broker.json):

Endpoint New method Response
GET /v1/broker/pre_restart_probe PreRestartProbe(ctx, limit) PreRestartCheckResult{Risks}
GET /v1/broker/post_restart_probe PostRestartProbe(ctx, limit) PostRestartCheckResult{LoadReclaimedPercent}

PreRestartProbe asks the broker whether restarting it right now would
affect partitions, and groups the affected partitions into four risk
categories:

  • rf1_offline — RF=1 partitions go offline (generally acceptable risk)
  • full_acks_produce_unavailableacks=-1 produce would be rejected
  • unavailable — both produce and consume rejected
  • acks1_data_lossacks=1 produce could lose data

PostRestartProbe reports how much load this broker has reclaimed
since the most recent restart as a percentage of in-sync replicas.

Both probes answer for the broker that handles the request, so callers
typically scope the client to a specific broker via ForHost before
invoking. The optional limit parameter caps the number of partitions
returned per risk category (server default is 128).

Why

The K8s operator's rolling-restart gate is the immediate consumer
(redpanda-operator#1537):
today it relies on the cluster-wide health overview, which lags pod
state and conflates "this broker isn't safe to restart" with "the
cluster has any under-replicated partition." PreRestartProbe gives
the operator a precise per-decision signal; PostRestartProbe lets it
defer the next pod delete until the just-restarted broker has actually
caught up rather than just passing a liveness check.

Tracked in Redpanda's ENG-222.

Testing

Per review feedback, the probes are now exercised end-to-end against a
real Redpanda broker rather than an httptest mock. The integration
tests in rpadmin/api_broker_test.go spin up
redpandadata/redpanda-nightly:latest via testcontainers-go's redpanda
module

(same dependency already used by kvstore and redpanda-otel-exporter)
and drive the rpadmin client against the container's admin API.

  • TestPreRestartProbe
    • creates a 4-partition RF=1 topic via kadm
    • asserts each partition surfaces in rf1_offline (polled, since the
      probe trails topic creation briefly)
    • asserts the other three risk-category fields decode (catching silent
      JSON-tag drift between client and server)
    • asserts limit=1 caps every category to ≤1 entry
  • TestPostRestartProbe
    • polls the probe until it succeeds against a freshly-started broker
    • asserts load_reclaimed_pc ∈ [0, 100]
    • asserts limit propagates without the broker rejecting the request

Both tests skip under go test -short and gate behind testcontainers'
Docker discovery so contributors without Docker can still run the rest
of the rpadmin suite. The probe doc comments now also link directly to
the admin API spec (broker.json) for callers who want to verify the
response shape without spelunking the rpadmin source.

CI: the Test workflow already runs the rpadmin matrix entry on every
PR touching **.go; this PR also broadens its trigger paths to include
**/go.mod and **/go.sum so dependency-only updates run the tests too.

Test plan

  • go test -run "TestPreRestartProbe|TestPostRestartProbe" ./...
    against redpandadata/redpanda-nightly:latest — both green
  • go test -short ./... — integration tests skip cleanly
  • go vet ./...
  • golangci-lint run --new-from-rev=main ./... — 0 issues

🤖 Generated with Claude Code

Adds Go bindings for the broker-local restart probes introduced in
Redpanda 25.1 (src/v/redpanda/admin/api-doc/broker.json):

  GET /v1/broker/pre_restart_probe  -> PreRestartCheckResult{Risks}
  GET /v1/broker/post_restart_probe -> PostRestartCheckResult{LoadReclaimedPercent}

PreRestartProbe lets a caller ask the broker whether restarting it is
safe right now. The response groups the partitions that would be
affected into four risk categories:

  - rf1_offline                   (RF=1 partitions go offline; usually OK)
  - full_acks_produce_unavailable (acks=-1 produce would be rejected)
  - unavailable                   (both produce and consume rejected)
  - acks1_data_loss               (acks=1 produce could lose data)

PostRestartProbe reports how much load the broker has reclaimed
since the most recent restart, as a percentage of in-sync replicas.

Both probes answer for the broker that handles the request, so callers
typically scope the client to a specific broker via ForHost before
invoking. The optional limit parameter caps the number of partitions
returned per risk category (server default is 128).

The K8s operator's rolling-restart gate is the immediate consumer:
today it relies on the cluster-wide health overview, which lags pod
state and conflates "this broker isn't safe to restart" with "the
cluster has any under-replicated partition". The per-broker probe
gives the operator a precise per-decision signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@secpanda
Copy link
Copy Markdown

secpanda commented May 20, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

CI flagged the table-driven test cases: when the longest key in a map
literal forces aligned spacing, shorter keys need padding to match.
Pre-existing lint issues in other files on main are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread rpadmin/api_broker_test.go Outdated
Copy link
Copy Markdown
Contributor

@andrewstucki andrewstucki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you either swap this to run a real broker via something like testcontainers or at least link to the restart probe API docs?

@david-yu
Copy link
Copy Markdown
Contributor Author

david-yu commented May 20, 2026

Can you either swap this to run a real broker via something like testcontainers or at least link to the restart probe API docs?

Will do, will run a real broker with testcontainers. Thank you.

david-yu and others added 2 commits May 20, 2026 11:12
Replace the httptest mocks for PreRestartProbe/PostRestartProbe with
integration tests that drive a real Redpanda broker via testcontainers
(redpandadata/redpanda-nightly:latest). The pre-restart test creates an
RF=1 topic and asserts the partitions surface in rf1_offline; both tests
also exercise the limit query-param round-trip. The probe doc comments
now also link directly to the admin API spec in src/v/redpanda/admin/
api-doc/broker.json, addressing both halves of the reviewer's suggestion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Test workflow already lists every module in its matrix but only
fires on *.go path changes, so a PR that touches just go.mod/go.sum
(e.g. a dependency bump) bypassed CI. Mirror the Lint workflow's
trigger so module-only updates run the tests too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu david-yu requested a review from andrewstucki May 20, 2026 18:16
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu david-yu merged commit 92a92fa into main May 21, 2026
27 checks passed
@david-yu david-yu deleted the dyu/rpadmin-pre-post-restart-probe branch May 21, 2026 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants