Enable PGO and add marshalling benchmarks#44
Merged
Conversation
Capture a representative CPU profile and commit it as cmd/xtcp2/default.pgo so Go's default -pgo=auto (and the nix buildGoModule) optimize every build. The profile blends the protoJson and protobufList marshallers under a ~2,000-socket synthetic load, covering both the local JSON path and the production Kafka path. Add pkg/recordfmt/bench_test.go so the marshalling win is measurable. benchstat (off vs on, -count=8) shows the recordfmt geomean improving 7.3%, with the protobufList Kafka path at -12.9%/-13.8% and the JSON paths at -10% to -11%. The pkg/xtcpnl typed parsers are already sub-10ns so PGO is noise there. Document PGO + profiling in docs/performance.md, including how to refresh the profile from production traffic and the top profiled hot spot: proto.Size in the envelope size-cap re-walks the whole growing envelope every 64 appends (~40% of non-idle CPU; a running byte accumulator would make it O(rows)). Remove the stale 2024 cpu.pprof*/mem.pprof* leftovers (not PGO files). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This was referenced Jun 22, 2026
The original profile was captured from the pre-optimization daemon, where reflective proto.Size/marshal dominated CPU. After the envelope size-cap accumulator (O(1)) and vtprotobuf marshalling land, that cost is gone and the daemon is netlink-I/O-bound. Recapture default.pgo from the optimized binary (protoJson + protobufList blended, same synthetic load) so PGO matches the code. Update docs/performance.md to match reality: replace the now-obsolete PGO "measured gains" table with the honest picture (profile is ~46% netlink Syscall6; the largest remaining Go hot path is protojson on the JSON formats; PGO's residual benefit is small and the protobufList/Kafka path is already reflection-free). Reframe the "known hot spot: proto.Size ~40%" note as resolved by the size-cap accumulator + vtprotobuf. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Owner
Author
|
Refreshed
Merge order: this PR should merge after #46 and #47, so the committed profile matches the code in |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gathers a representative CPU profile, commits it as
cmd/xtcp2/default.pgo, and relies on Go's default-pgo=auto(and the nixbuildGoModule) to apply profile-guided optimization to every build — no build-flag change required. Adds the missingpkg/recordfmtmarshalling benchmarks so the win is measurable, documents PGO + profiling, and writes up the top profiled hot spots.How the profile was captured
tools/tcp_server+tools/tcp_client), so the parse + marshal loops dominate.:9088/debug/pprof/profileendpoint with-dest null(marshalling runs; terminal IO doesn't skew it) — one for-marshal protoJson, one for-marshal protobufList— merged withgo tool pprof -proto.Measured effect (benchstat,
-count=8)pkg/recordfmt(the marshalling hot path — the daemon's CPU-dominant work after the netlink syscalls):MarshalEnvelopeProtobufList(Kafka path)AppendEnvelopeProtobufList(pooled flush)MarshalEnvelopeJSONLMarshalJSON(per record)MarshalTextpkg/xtcpnltyped parsers are already sub-10 ns and memory-bound, so PGO is mostly noise there (geomean ≈ −1%); the protoreflect fallbacks improve ~3–6%.These gains are above the usual low-single-digit PGO result precisely because the profile targets these paths. Refresh the profile from production traffic for best results — see
docs/performance.md.Quick-wins analysis (from the profile)
go tool pprof -topon the merged profile (18.06 s of samples):syscall.Syscall6(netlink recvmsg/sendmsg — io_uring path addresses this)protobuf/internal/impl.(*MessageInfo).sizePointerSlowpkg/xtcp.(*XTCP).Deserialize(poll-loop parent)Headline finding — the envelope size-cap is the #1 CPU win.
proto.Size(called fromprocessInetDiagRecord) is 40% of non-idle samples. The size-cap inpkg/xtcp/deserialize.gocallsproto.Size(x.currentEnvelope)everyenvelopeSizeCheckModulus= 64 appends, and each call re-walks the entire growing envelope (all rows). Over a batch of up to 10,000 rows that's roughly O(rows² / 64). Maintaining a running byte accumulator — adding each record'sproto.Sizeonce at append time — makes it O(rows). Tracked as a follow-up (no hot-path rewrite in this PR).Allocation profile (captured on the
protoJsonwindow) is dominated by protojson reflection (order.RangeFields,protoreflect.Value.Interface,protojson.marshalMessage, base64) — a generated/vtprotobuf-style encoder is the natural follow-up for the JSON destinations. The new benchmarks also flagMarshalHumanizedJSONat 234 allocs/record (~6× plain JSON) and the reflection CSV path at 1,543 allocs/envelope.Changes
cmd/xtcp2/default.pgo— committed merged CPU profile (auto-applied by Go + nix).pkg/recordfmt/bench_test.go— per-record and 64-row-envelope marshalling benchmarks.docs/performance.md— "PGO & profiling" section: how PGO is wired, measured deltas, how to refresh the profile, and the size-cap hot spot.nix/lib/mkGoBinary.nix— comment noting PGO is auto-applied fromdefault.pgo.cmd/xtcp2/cpu.pprof*/mem.pprof*leftovers (not PGO files).Verification
nix build .#xtcp2 .#checks.x86_64-linux.golangci-lint .#test-go-unit .#test-pkg-xtcp— all green withdefault.pgopresent.go build -pgo=auto -xconfirmspreprofileruns and-pgoprofile=is passed to the compiler.Out of scope
GOARCH).🤖 Generated with Claude Code