Enable PGO and add marshalling benchmarks by randomizedcoder · Pull Request #44 · randomizedcoder/xtcp2

randomizedcoder · 2026-06-21T23:47:32Z

Summary

Gathers a representative CPU profile, commits it as cmd/xtcp2/default.pgo, and relies on Go's default -pgo=auto (and the nix buildGoModule) to apply profile-guided optimization to every build — no build-flag change required. Adds the missing pkg/recordfmt marshalling benchmarks so the win is measurable, documents PGO + profiling, and writes up the top profiled hot spots.

How the profile was captured

Workload: the host's sockets plus ~2,000 synthetic localhost sockets (tools/tcp_server + tools/tcp_client), so the parse + marshal loops dominate.
Two 45 s steady-state CPU windows via the :9088/debug/pprof/profile endpoint with -dest null (marshalling runs; terminal IO doesn't skew it) — one for -marshal protoJson, one for -marshal protobufList — merged with go tool pprof -proto.

Measured effect (benchstat, `-count=8`)

pkg/recordfmt (the marshalling hot path — the daemon's CPU-dominant work after the netlink syscalls):

Benchmark	sec/op (off → on)
`MarshalEnvelopeProtobufList` (Kafka path)	−12.9%
`AppendEnvelopeProtobufList` (pooled flush)	−13.8%
`MarshalEnvelopeJSONL`	−11.3%
`MarshalJSON` (per record)	−10.4%
`MarshalText`	−6.7%
geomean	−7.3%

pkg/xtcpnl typed parsers are already sub-10 ns and memory-bound, so PGO is mostly noise there (geomean ≈ −1%); the protoreflect fallbacks improve ~3–6%.

These gains are above the usual low-single-digit PGO result precisely because the profile targets these paths. Refresh the profile from production traffic for best results — see docs/performance.md.

Quick-wins analysis (from the profile)

go tool pprof -top on the merged profile (18.06 s of samples):

flat%	cum%	function
25.9%	25.9%	`syscall.Syscall6` (netlink recvmsg/sendmsg — io_uring path addresses this)
21.4%	40.5%	`protobuf/internal/impl.(*MessageInfo).sizePointerSlow`
—	64.3%	`pkg/xtcp.(*XTCP).Deserialize` (poll-loop parent)

Headline finding — the envelope size-cap is the #1 CPU win. proto.Size (called from processInetDiagRecord) is 40% of non-idle samples. The size-cap in pkg/xtcp/deserialize.go calls proto.Size(x.currentEnvelope) every envelopeSizeCheckModulus = 64 appends, and each call re-walks the entire growing envelope (all rows). Over a batch of up to 10,000 rows that's roughly O(rows² / 64). Maintaining a running byte accumulator — adding each record's proto.Size once at append time — makes it O(rows). Tracked as a follow-up (no hot-path rewrite in this PR).

Allocation profile (captured on the protoJson window) is dominated by protojson reflection (order.RangeFields, protoreflect.Value.Interface, protojson.marshalMessage, base64) — a generated/vtprotobuf-style encoder is the natural follow-up for the JSON destinations. The new benchmarks also flag MarshalHumanizedJSON at 234 allocs/record (~6× plain JSON) and the reflection CSV path at 1,543 allocs/envelope.

Changes

New cmd/xtcp2/default.pgo — committed merged CPU profile (auto-applied by Go + nix).
New pkg/recordfmt/bench_test.go — per-record and 64-row-envelope marshalling benchmarks.
docs/performance.md — "PGO & profiling" section: how PGO is wired, measured deltas, how to refresh the profile, and the size-cap hot spot.
nix/lib/mkGoBinary.nix — comment noting PGO is auto-applied from default.pgo.
Removed stale 2024 cmd/xtcp2/cpu.pprof* / mem.pprof* leftovers (not PGO files).

Verification

nix build .#xtcp2 .#checks.x86_64-linux.golangci-lint .#test-go-unit .#test-pkg-xtcp — all green with default.pgo present.
go build -pgo=auto -x confirms preprofile runs and -pgoprofile= is passed to the compiler.
benchstat off-vs-on table above.

Out of scope

No hot-path rewrites — PGO + analysis only. The size-cap accumulator, a generated JSON encoder, and the humanize-JSON allocation cuts are follow-ups informed by this profile.
The profile is a dev-box starting point; refresh periodically from representative production traffic (same GOARCH).

🤖 Generated with Claude Code

Capture a representative CPU profile and commit it as cmd/xtcp2/default.pgo so Go's default -pgo=auto (and the nix buildGoModule) optimize every build. The profile blends the protoJson and protobufList marshallers under a ~2,000-socket synthetic load, covering both the local JSON path and the production Kafka path. Add pkg/recordfmt/bench_test.go so the marshalling win is measurable. benchstat (off vs on, -count=8) shows the recordfmt geomean improving 7.3%, with the protobufList Kafka path at -12.9%/-13.8% and the JSON paths at -10% to -11%. The pkg/xtcpnl typed parsers are already sub-10ns so PGO is noise there. Document PGO + profiling in docs/performance.md, including how to refresh the profile from production traffic and the top profiled hot spot: proto.Size in the envelope size-cap re-walks the whole growing envelope every 64 appends (~40% of non-idle CPU; a running byte accumulator would make it O(rows)). Remove the stale 2024 cpu.pprof*/mem.pprof* leftovers (not PGO files). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The original profile was captured from the pre-optimization daemon, where reflective proto.Size/marshal dominated CPU. After the envelope size-cap accumulator (O(1)) and vtprotobuf marshalling land, that cost is gone and the daemon is netlink-I/O-bound. Recapture default.pgo from the optimized binary (protoJson + protobufList blended, same synthetic load) so PGO matches the code. Update docs/performance.md to match reality: replace the now-obsolete PGO "measured gains" table with the honest picture (profile is ~46% netlink Syscall6; the largest remaining Go hot path is protojson on the JSON formats; PGO's residual benefit is small and the protobufList/Kafka path is already reflection-free). Reframe the "known hot spot: proto.Size ~40%" note as resolved by the size-cap accumulator + vtprotobuf. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

randomizedcoder · 2026-06-22T20:23:56Z

Refreshed cmd/xtcp2/default.pgo and corrected the docs to the post-optimization reality.

The profile is recaptured from a daemon that includes the size-cap accumulator (pkg/xtcp: O(1) envelope size-cap via running byte accumulator #46) and vtprotobuf marshalling (recordfmt/xtcp: reflection-free protobufList via vtprotobuf #47). With those in, the reflective proto.Size/marshal cost PGO originally targeted is gone — the daemon is netlink-I/O-bound (~46% Syscall6 in the capture; the largest remaining Go hot path is protojson on the JSON formats at ~22%).
docs/performance.md: dropped the now-obsolete PGO "measured gains" table (those gains came from optimizing code that pkg/xtcp: O(1) envelope size-cap via running byte accumulator #46/recordfmt/xtcp: reflection-free protobufList via vtprotobuf #47 deleted) and reframed the "proto.Size ~40% hot spot" note as resolved. PGO is kept as free, auto-applied hygiene, not a primary speedup.

Merge order: this PR should merge after #46 and #47, so the committed profile matches the code in main. (It's PGO-safe either way — Go ignores profile entries for absent functions — but merging last keeps main self-consistent.)

This was referenced Jun 22, 2026

docs: performance optimization roadmap #45

Merged

pkg/xtcp: O(1) envelope size-cap via running byte accumulator #46

Merged

recordfmt/xtcp: reflection-free protobufList via vtprotobuf #47

Merged

randomizedcoder merged commit bcf2a5d into main Jun 22, 2026

randomizedcoder mentioned this pull request Jun 22, 2026

recordfmt/xtcp: reflection-free protobufList via vtprotobuf (re-land #47) #48

Merged

randomizedcoder deleted the pgo-profile branch June 22, 2026 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable PGO and add marshalling benchmarks#44

Enable PGO and add marshalling benchmarks#44
randomizedcoder merged 2 commits into
mainfrom
pgo-profile

randomizedcoder commented Jun 21, 2026

Uh oh!

randomizedcoder commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

randomizedcoder commented Jun 21, 2026

Summary

How the profile was captured

Measured effect (benchstat, -count=8)

Quick-wins analysis (from the profile)

Changes

Verification

Out of scope

Uh oh!

randomizedcoder commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Measured effect (benchstat, `-count=8`)