Skip to content

Enable PGO and add marshalling benchmarks#44

Merged
randomizedcoder merged 2 commits into
mainfrom
pgo-profile
Jun 22, 2026
Merged

Enable PGO and add marshalling benchmarks#44
randomizedcoder merged 2 commits into
mainfrom
pgo-profile

Conversation

@randomizedcoder

Copy link
Copy Markdown
Owner

Summary

Gathers a representative CPU profile, commits it as cmd/xtcp2/default.pgo, and relies on Go's default -pgo=auto (and the nix buildGoModule) to apply profile-guided optimization to every build — no build-flag change required. Adds the missing pkg/recordfmt marshalling benchmarks so the win is measurable, documents PGO + profiling, and writes up the top profiled hot spots.

How the profile was captured

  • Workload: the host's sockets plus ~2,000 synthetic localhost sockets (tools/tcp_server + tools/tcp_client), so the parse + marshal loops dominate.
  • Two 45 s steady-state CPU windows via the :9088/debug/pprof/profile endpoint with -dest null (marshalling runs; terminal IO doesn't skew it) — one for -marshal protoJson, one for -marshal protobufList — merged with go tool pprof -proto.

Measured effect (benchstat, -count=8)

pkg/recordfmt (the marshalling hot path — the daemon's CPU-dominant work after the netlink syscalls):

Benchmark sec/op (off → on)
MarshalEnvelopeProtobufList (Kafka path) −12.9%
AppendEnvelopeProtobufList (pooled flush) −13.8%
MarshalEnvelopeJSONL −11.3%
MarshalJSON (per record) −10.4%
MarshalText −6.7%
geomean −7.3%

pkg/xtcpnl typed parsers are already sub-10 ns and memory-bound, so PGO is mostly noise there (geomean ≈ −1%); the protoreflect fallbacks improve ~3–6%.

These gains are above the usual low-single-digit PGO result precisely because the profile targets these paths. Refresh the profile from production traffic for best results — see docs/performance.md.

Quick-wins analysis (from the profile)

go tool pprof -top on the merged profile (18.06 s of samples):

flat% cum% function
25.9% 25.9% syscall.Syscall6 (netlink recvmsg/sendmsg — io_uring path addresses this)
21.4% 40.5% protobuf/internal/impl.(*MessageInfo).sizePointerSlow
64.3% pkg/xtcp.(*XTCP).Deserialize (poll-loop parent)

Headline finding — the envelope size-cap is the #1 CPU win. proto.Size (called from processInetDiagRecord) is 40% of non-idle samples. The size-cap in pkg/xtcp/deserialize.go calls proto.Size(x.currentEnvelope) every envelopeSizeCheckModulus = 64 appends, and each call re-walks the entire growing envelope (all rows). Over a batch of up to 10,000 rows that's roughly O(rows² / 64). Maintaining a running byte accumulator — adding each record's proto.Size once at append time — makes it O(rows). Tracked as a follow-up (no hot-path rewrite in this PR).

Allocation profile (captured on the protoJson window) is dominated by protojson reflection (order.RangeFields, protoreflect.Value.Interface, protojson.marshalMessage, base64) — a generated/vtprotobuf-style encoder is the natural follow-up for the JSON destinations. The new benchmarks also flag MarshalHumanizedJSON at 234 allocs/record (~6× plain JSON) and the reflection CSV path at 1,543 allocs/envelope.

Changes

  • New cmd/xtcp2/default.pgo — committed merged CPU profile (auto-applied by Go + nix).
  • New pkg/recordfmt/bench_test.go — per-record and 64-row-envelope marshalling benchmarks.
  • docs/performance.md — "PGO & profiling" section: how PGO is wired, measured deltas, how to refresh the profile, and the size-cap hot spot.
  • nix/lib/mkGoBinary.nix — comment noting PGO is auto-applied from default.pgo.
  • Removed stale 2024 cmd/xtcp2/cpu.pprof* / mem.pprof* leftovers (not PGO files).

Verification

  • nix build .#xtcp2 .#checks.x86_64-linux.golangci-lint .#test-go-unit .#test-pkg-xtcp — all green with default.pgo present.
  • go build -pgo=auto -x confirms preprofile runs and -pgoprofile= is passed to the compiler.
  • benchstat off-vs-on table above.

Out of scope

  • No hot-path rewrites — PGO + analysis only. The size-cap accumulator, a generated JSON encoder, and the humanize-JSON allocation cuts are follow-ups informed by this profile.
  • The profile is a dev-box starting point; refresh periodically from representative production traffic (same GOARCH).

🤖 Generated with Claude Code

Capture a representative CPU profile and commit it as cmd/xtcp2/default.pgo so
Go's default -pgo=auto (and the nix buildGoModule) optimize every build. The
profile blends the protoJson and protobufList marshallers under a ~2,000-socket
synthetic load, covering both the local JSON path and the production Kafka path.

Add pkg/recordfmt/bench_test.go so the marshalling win is measurable. benchstat
(off vs on, -count=8) shows the recordfmt geomean improving 7.3%, with the
protobufList Kafka path at -12.9%/-13.8% and the JSON paths at -10% to -11%.
The pkg/xtcpnl typed parsers are already sub-10ns so PGO is noise there.

Document PGO + profiling in docs/performance.md, including how to refresh the
profile from production traffic and the top profiled hot spot: proto.Size in the
envelope size-cap re-walks the whole growing envelope every 64 appends (~40% of
non-idle CPU; a running byte accumulator would make it O(rows)).

Remove the stale 2024 cpu.pprof*/mem.pprof* leftovers (not PGO files).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The original profile was captured from the pre-optimization daemon, where
reflective proto.Size/marshal dominated CPU. After the envelope size-cap
accumulator (O(1)) and vtprotobuf marshalling land, that cost is gone and the
daemon is netlink-I/O-bound. Recapture default.pgo from the optimized binary
(protoJson + protobufList blended, same synthetic load) so PGO matches the code.

Update docs/performance.md to match reality: replace the now-obsolete PGO
"measured gains" table with the honest picture (profile is ~46% netlink Syscall6;
the largest remaining Go hot path is protojson on the JSON formats; PGO's
residual benefit is small and the protobufList/Kafka path is already
reflection-free). Reframe the "known hot spot: proto.Size ~40%" note as resolved
by the size-cap accumulator + vtprotobuf.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@randomizedcoder

Copy link
Copy Markdown
Owner Author

Refreshed cmd/xtcp2/default.pgo and corrected the docs to the post-optimization reality.

Merge order: this PR should merge after #46 and #47, so the committed profile matches the code in main. (It's PGO-safe either way — Go ignores profile entries for absent functions — but merging last keeps main self-consistent.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant