Skip to content

v2: speed up the transform stage ~15x (42s -> ~2.85s on self-compile)#27333

Merged
medvednikov merged 8 commits into
masterfrom
v2-transform-perf
Jun 3, 2026
Merged

v2: speed up the transform stage ~15x (42s -> ~2.85s on self-compile)#27333
medvednikov merged 8 commits into
masterfrom
v2-transform-perf

Conversation

@medvednikov
Copy link
Copy Markdown
Member

What

Speeds up the v2 transform stage (cmd/v2/v2 v2.v self-compile) from ~42s (serial / loaded) — ~26s parallel — down to ~2.85s, with no change to output.

The stage was pathologically slow, not fundamentally expensive (V1 compiles the whole program in ~1s). A sampling profiler (sample <pid>) showed ~70% of the time in string_index_kmp/_u8, almost all from O(n²) method resolution.

Changes

  • Fix the O(n²) method lookup (the big one). lookup_method_cached linearly scanned every method of a type and recomputed its generic base name — via allocating string.contains/.index (each builds a KMP failure table) — per method, per call site. Now an O(1) lookup against a precomputed type#base -> FnType index, and generic_base_name_without_specialization uses hand-rolled byte scans (no per-call allocation). This path runs in both the serial monomorphize scan and the parallel per-file transform, so both collapsed.
  • Memoize lookup_struct_field_type (the largest remaining source of types.Type-by-value copies), keyed by (cur_module, struct, field).
  • Index the method-key fuzzy fallback loops by short name (a match requires equal short names), and drop per-call allocations in method_key_matches_type_name.
  • Monomorphize fixpoint: skip the redundant confirmation scan and rescan only freshly-materialized clones instead of re-walking all files (prepare ~19s -> ~7s before the O(n²) fix landed).
  • Parallel fan-out load balancing: distribute files largest-first (LPT) instead of contiguous chunks, so the few huge files (transformer.v, the cleanc gen files, ...) no longer pin one worker.

Also adds gated profiling hooks used to find all of the above (V2_TTIME per-subphase timing, V2_TDUMP monomorphization-spec snapshot, plus the existing V2_MEM); all are no-ops unless the env var is set.

Result

before after
Transform total ~26s (parallel) / ~42s (serial) ~2.85s
monomorphize prepare ~19s ~1.3s
per-file fan-out ~10.7s ~1.2s

Correctness

Validated at every commit:

  • Monomorphization output byte-identical (V2_TDUMP dump of sorted monomorphized_specs / generic_struct_specs / generic_types).
  • vlib/v2/transformer/ test suite unchanged: 17 pass / 1 pre-existing failure (same on master — test_transform_eventbus_receiver_generic_method_call_is_monomorphized).
  • Self-compiles to a working compiler; compiled-program output identical.

Each fix also had to remain self-hostable through v2's own cleanc/arm64 backends (flat not nested maps; keys()+index not for k,v over map[string][]&Fn; .move() on map assigns; no capturing-closure comparators; interior-mutability via the same voidptr(&Transformer) cast the worker threads use).

Remaining

At ~2.85s the profile is flat — no hotspot left. The rest is aggregate types.Type-by-value copying + map hashing + fan-out malloc-lock contention under -gc none. Going under ~2s would need structural work (parallelize the serial collect scan, or reduce the pervasive type-by-value copies).

The generic-monomorphization fixpoint in prepare_files_for_transform ran the
whole-program collect_generic_call_specs walk up to 3 times per self-compile
(~6s / ~700MB each under -gc none). For v2.v the first walk already discovers
every spec; the rest re-walk an unchanged program and find nothing new.

- Opt A: track prepared_dirty (via inject_changed_files) and skip the leading
  full scan when the program did not change since the last scan.
- Opt B: after monomorphize_pass, rescan only the freshly-materialized clones
  (collect_generic_call_specs_in_new_clones over last_mono_clones) instead of
  re-walking all files.

Monomorphization output is byte-identical (validated via V2_TDUMP spec dump and
the vlib/v2/transformer test suite: same pass/fail set as before). Prepare
17s -> 7.1s; Transform total 26s -> 18.3s on the C-backend self-compile.

Also adds V2_TTIME per-subphase timing and V2_TDUMP spec-snapshot diagnostics
(both gated, no cost when unset).
The parallel per-file transform split files into contiguous chunks, so the few
huge files (transformer.v, monomorphize.v, the cleanc gen files, ...) clustered
into 2-3 adjacent chunks. Those workers ran ~10s while the other 7 finished in
<0.5s and idled: fan-out wall time was pinned to the heaviest chunk (~10.7s on
a 10-core self-compile) instead of total_work / n_jobs.

Distribute files largest-first to the least-loaded worker (LPT heuristic, cost
proxy = top-level stmt count), then scatter each worker's results back to their
original file index after the join. Assignment is deterministic and workers
still each fill their own result slot (no concurrent writes); only the file->
worker grouping changes.

lpt_buckets uses a plain insertion sort, not sort_with_compare: this file must
self-host through the cleanc and arm64 backends, which do not reliably codegen
capturing closures / pointer comparators (cleanc miscompiles `*a - *b`).

Fan-out ~10.7s -> ~6.5s; monomorphization output byte-identical (V2_TDUMP), and
the C-backend self-compile still produces a working compiler. Only affects the
parallel path; --no-parallel (arm64 self-host) is unchanged.
The LPT cost proxy used top-level statement count, which scores a file of a few
huge functions the same as one of many tiny ones. Weight each FnDecl by its body
statement count so the heavyweight files (transformer.v, the cleanc gen files)
rank higher and land on separate workers. Deterministic, one level deep.

Tightens per-worker spread (most workers now ~3.3-4.9s vs 3.3-7.2s). Self-compile
still produces a working compiler and monomorphization output is byte-identical
(V2_TDUMP M/S/G).
lookup_method_cached linearly scanned every method of a type and, for each,
recomputed its generic base name via generic_base_name_without_specialization
on every call site -> O(call_sites x methods) with a string-search inner loop.
generic_base_name_without_specialization itself used string.contains('_T_') /
.index_u8('['), and V's string.index builds a fresh KMP failure-table heap
allocation per call. Profiling the v2 self-compile showed ~70% of the collect
scan in string_index_kmp/_u8, almost all from this path, allocating ~700MB.

Two fixes:
- generic_base_name_without_specialization: hand-rolled byte scans instead of
  .contains/.index/.all_before (no per-call KMP allocation). Behaviour-identical.
- Precompute cached_method_base_index (flat "type#base" -> Type map, first FnType
  per base) once in cache_env_maps; lookup_method_cached is now an O(1) map get.
  type_has_cached_method and the per-file transform (which also resolve methods
  per call site) get the same speedup.

This path runs in BOTH the serial monomorphize scan and the parallel per-file
transform, so both collapse:
  collect scan ~7.8s -> ~2.3s, per-file fan-out ~6.5s -> ~2.5s,
  Transform ~14.5s -> ~5.1s (and the --no-parallel / loaded ~42s case with it).

Monomorphization output byte-identical (V2_TDUMP M/S/G), transformer test suite
unchanged (17 pass / 1 pre-existing eventbus fail), self-compiles to a working
compiler, compiled-program output identical.

Self-host notes: the index is a FLAT map (nested map copies the inner map per
lookup, which V flags as "cannot copy map" and is slow); built via keys()+index
(a `for k,v` over map[string][]&Fn miscodegens as []Fn); stored as types.Type
not the FnType variant (v2 codegen mishandles a map valued by a bare variant).
method_key_matches_type_name runs inside the O(method_keys) fuzzy-match fallback
loops (per call site) and allocated on every call: two string.replace('.','__')
(replace always allocates) plus up to four .contains('__') (each builds a KMP
failure table). Most keys have no '.', and `__` can be located with a plain
byte scan.

Only run replace when a '.' is actually present (index_u8, no allocation) and
add last_double_underscore() to replace .contains('__')/.all_after_last('__').
Behaviour-identical. Transform ~5.1s -> ~4.4s; monomorphization output
byte-identical (V2_TDUMP), self-compiles.
Three fallback loops (lookup_method_return_type, lookup_method_exists,
resolve_method_call_name) scanned every cached method key and ran
method_key_matches_type_name against each, per call site -> O(call_sites x
all_keys). But that match can only succeed when the key and receiver share the
same short (final `__`-segment) name. Bucket the keys by short name once
(cached_method_keys_by_short) and scan only the matching bucket via
candidate_method_keys(); the full method_key_matches check still runs on the
candidates so behaviour is unchanged.

Transform ~4.4s -> ~3.7s (prepare ~1.8s, fan-out ~1.6s); monomorphization output
byte-identical (V2_TDUMP), transformer tests unchanged (17 pass / 1 pre-existing
eventbus fail), self-compiles.
lookup_struct_field_type is called per field-access expression and re-derives
the field type every time: scope-map lookups, a types.Type-by-value Struct copy,
a field scan, and a scan-all-scopes fallback on miss. It was the largest single
source of memmove (Type copies) left in the transform.

Memoize it. The result is a pure function of (cur_module, struct_name,
field_name) — cur_module only affects the unqualified-name path, but keying on it
unconditionally is always correct. A Void value marks a resolved-to-none entry
(struct fields are never void). The cache is written through an unsafe const->mut
cast (benign interior mutability — same voidptr(&Transformer) pattern the worker
threads already use); each parallel worker gets its own empty cache, so there is
no cross-thread sharing.

Transform ~3.7s -> ~2.85s (prepare ~1.3s, fan-out ~1.3s). Monomorphization output
byte-identical (V2_TDUMP), transformer tests unchanged (17 pass / 1 pre-existing
eventbus fail), self-compiles, compiled-program output identical.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6bb7ee8a5b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


// transform_files transforms all files and returns transformed copies
pub fn (mut t Transformer) transform_files(files []ast.File) []ast.File {
t_print_mem('enter')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Provide a non-Darwin t_print_mem implementation

On Linux/Windows builds this new call is unresolved because t_print_mem is only defined in vlib/v2/transformer/mem_darwin.c.v, whose _darwin.c.v suffix is compiled only on macOS. Since transform_files is compiled for every platform, any non-Darwin V2 build that includes the transformer now fails before it can run; add a default no-op implementation or guard these calls with $if macos.

Useful? React with 👍 / 👎.

t_print_mem was defined only in mem_darwin.c.v (a _darwin.c.v file, compiled on
macOS only) but transform_files calls it on every platform, so non-macOS builds
of the transformer failed to compile (unresolved symbol). Reported by the Codex
PR review.

Move t_print_mem to a platform-neutral mem.v and guard the macOS malloc-statistics
probe behind `$if macos` (other platforms print just the stage marker). The C
declarations + the probe helper stay in mem_darwin.c.v, now only reached from the
macOS branch.
@medvednikov medvednikov merged commit de365a1 into master Jun 3, 2026
79 of 83 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant