v2: self-host & codegen performance + fix v -profile#27387
Merged
Conversation
Three cold self-host speedups for `v2 -backend cleanc cmd/v2/v2.v`: - Durable object cache + pre-parse fast-relink gate. The bundle objects + main.o are mirrored to ~/.cache/v2cleanc_persist so a cold build (the /tmp obj cache wiped) restores them. A new gate hoisted to the top of build() relinks directly when all sources are fresh, skipping the ~825ms front-end (parse/typecheck/transform) that otherwise always runs. Cold self-host with unchanged sources drops from ~2.4s to ~60ms; warm from ~0.9s to ~50ms. Conservative: any staleness fails a freshness check and falls through to a normal build, so it can never emit a stale binary (verified byte-equivalent to a fresh full build). - Sub-file Pass 5 splitting. Parallel cleanc Pass 5 split work per file, so the single biggest file (ssa/builder.v, ~590ms) pinned the whole phase. Large files are now split into contiguous FnDecl-index slices across workers; v2compiler.o global symbol set is byte-identical. - Skip dead v2compiler .vh header generation. Those headers are only read by the disabled v2compiler header parse-reuse path, so generating them on every cold self-build was ~230ms of pure overhead (gated behind V2_V2COMPILER_VH). Validated: builder/markused/transformer suites pass; generated C is unchanged for the .vh and persist changes.
The profiler emitted three globals per function — `vpc_<cfn>` (double, time), `vpc_<cfn>_only_current` (double), `vpc_<cfn>_calls` (u64, calls). When two functions' mangled C names differ only by a `_calls`/`_only_current` suffix (e.g. `..._lower` and `..._lower_calls`), one's call counter aliases the other's time accumulator, producing `redefinition of 'vpc_...' with a different type: double vs u64` and failing the C compile. Prefix the counter base name with a unique per-function index so no derived name can ever alias another function's base. `v -profile` now builds and runs on large codebases (e.g. cmd/v2/v2.v).
resolve_specialized_receiver_method scanned the entire fn_return_types + fn_param_is_ptr tables for each (receiver_type, method) key — O(unique_keys × total_fns). The `v -profile` call counts showed it at 42k calls / ~1.2s self on a full self-compile. Replace the scan with a one-time (base|method) index; the per-key result is unchanged (byte-identical generated C). C Gen ~8% faster.
imported_symbol_c_type was the hottest codegen function (~897k calls, ~1.6s self per `v -profile`): each call scanned all ~280 g.files (string-comparing file.name) to find the current file, then its imports' symbols. Build a flat "file\x01symbol -> module" index once in collect_source_module_names (g.files is stable) and make it an O(1) lookup. Generated C is byte-identical; C Gen ~8% faster (on top of the specialized-method index).
unique_v_method_return_type scanned all of v_fn_return_types per call (715 calls / ~680ms self in `v -profile`). Index it by method short-name once after collect_fn_signatures_to_fixed_point (v_fn_return_types is final there). Byte-identical C; ~3% faster C Gen.
qualify_ierror_concrete_base did `fn_return_types.keys()` + `.sort()` on every call (543 calls / ~540ms self in `v -profile`) — allocating and sorting all fn names each time. Index the `*__base__msg` functions by base once (smallest per base = the original sorted-first match); keep the emitted_types/pending scans as fallbacks for those growing maps. Byte-identical C.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
…r/flag changes The pre-parse self-host fast relink keyed only on the input file (is_cmd_v2_self_build), so a warm-cache `-o foo.c cmd/v2/v2.v` linked an executable into foo.c instead of writing C source. Mirror gen_cleanc()'s generation-only decision (.c output / cannot-compile-locally / shared lib) and fall through to normal generation in those cases. It also trusted the cc/cc_flags/cc_link_flags recorded in main.stamp without checking them against the current invocation, so changing the compiler or env CFLAGS while sources were unchanged relinked with stale flags. Record a pre-parse flag fingerprint (compiler choice, prod/shared mode, V2CFLAGS) in the stamp and re-check it on the fast path; a mismatch falls through to a normal build. Source-derived #flag directives are excluded — they change only when a source changes, which the existing freshness checks already catch.
…mbiguity Pass 5 splits large files into per-slice work items across workers, but the cross-worker dedup guard (blocked_fn_keys) was keyed at file level via fn_owner_file. When a file was split across workers every slice's worker considered the whole file owned, so neither blocked the file's lazily or transitively emitted fns — a latent duplicate/reorder hole. Now only the worker that emits a file's globals takes file-level ownership; a split file's other slices block the file's fns and emit their explicit slice through an owner-scoped bypass (explicit_slice_emit_allows) in gen_file_range. Generated self-host C is byte-identical. Also fix resolve_specialized_receiver_method: the one-time lazy specialized_index snapshotted the signature tables at first lookup and went stale, so a second specialization registered afterwards (which makes a (base, method) pair ambiguous) was missed. Consult the always-fresh incremental specialized_receiver_methods / _ambiguous index that remember_specialized_fn_base maintains, and drop the redundant snapshot index. Restores pass5_worker_test; self-host C unchanged.
…ice ownership Add regression coverage for the fast-relink and pass5-split review fixes: - fast_relink_test.v: a `.c` output and a shared lib are generation-only and must never be relinked, and preparse_flag_fingerprint changes when the compiler / prod / shared mode / V2CFLAGS change (and is stable otherwise). - pass5_worker_test.v: the owner-scoped blocked_fn_keys bypass is scoped to the slice's own file and never unblocks fns owned by another file. Extract fast_relink_output_is_generation_only() so the decision is unit-testable without a warm object cache. Output-preserving: the inline and extracted forms emit byte-identical self-host C.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Performance work on the v2 self-host pipeline, plus a fix to mainline
v -profile.v -profilefixThe profiler emitted
vpc_<fn>(time),vpc_<fn>_only_current, andvpc_<fn>_callsglobals. When two functions' mangled C names differed only by a_calls/_only_currentsuffix, one'su64counter aliased the other'sdoubleaccumulator →redefinition of 'vpc_...' with a different typeand a failed C compile. Counter names are now prefixed with a unique per-function index, sov -profilebuilds and runs on large codebases (e.g.cmd/v2/v2.v).v2 codegen: eliminate O(n²) resolution scans
Guided by the now-working profiler's call counts, four per-call linear scans over the signature tables were replaced with one-time indexes. Generated C is byte-identical; C Gen ~26% faster on a full self-compile:
resolve_specialized_receiver_method— was 42k calls / ~1.2s selfimported_symbol_c_type— was 897k calls / ~1.6s self (scanned all ~280 files per call)unique_v_method_return_type— per-callv_fn_return_typesscanqualify_ierror_concrete_base— per-callkeys().sort()v2 self-host caching
Durable object cache + a pre-parse fast-relink gate make a cold cleanc self-host with unchanged sources ~60ms, plus other rebuild/codegen speedups.
Validation
Generated C verified byte-identical to the prior compiler (bundled
v2compiler.c+main.cmd5-identical); transformer test suite passes. The pre-existing arm64 v3 self-host SIGSEGV is unaffected — the pre-change v3 crashes identically.