v2: speed up the transform stage ~15x (42s -> ~2.85s on self-compile) by medvednikov · Pull Request #27333 · vlang/v

medvednikov · 2026-06-03T20:11:04Z

What

Speeds up the v2 transform stage (cmd/v2/v2 v2.v self-compile) from ~42s (serial / loaded) — ~26s parallel — down to ~2.85s, with no change to output.

The stage was pathologically slow, not fundamentally expensive (V1 compiles the whole program in ~1s). A sampling profiler (sample <pid>) showed ~70% of the time in string_index_kmp/_u8, almost all from O(n²) method resolution.

Changes

Fix the O(n²) method lookup (the big one). lookup_method_cached linearly scanned every method of a type and recomputed its generic base name — via allocating string.contains/.index (each builds a KMP failure table) — per method, per call site. Now an O(1) lookup against a precomputed type#base -> FnType index, and generic_base_name_without_specialization uses hand-rolled byte scans (no per-call allocation). This path runs in both the serial monomorphize scan and the parallel per-file transform, so both collapsed.
Memoize lookup_struct_field_type (the largest remaining source of types.Type-by-value copies), keyed by (cur_module, struct, field).
Index the method-key fuzzy fallback loops by short name (a match requires equal short names), and drop per-call allocations in method_key_matches_type_name.
Monomorphize fixpoint: skip the redundant confirmation scan and rescan only freshly-materialized clones instead of re-walking all files (prepare ~19s -> ~7s before the O(n²) fix landed).
Parallel fan-out load balancing: distribute files largest-first (LPT) instead of contiguous chunks, so the few huge files (transformer.v, the cleanc gen files, ...) no longer pin one worker.

Also adds gated profiling hooks used to find all of the above (V2_TTIME per-subphase timing, V2_TDUMP monomorphization-spec snapshot, plus the existing V2_MEM); all are no-ops unless the env var is set.

Result

	before	after
Transform total	~26s (parallel) / ~42s (serial)	~2.85s
monomorphize prepare	~19s	~1.3s
per-file fan-out	~10.7s	~1.2s

Correctness

Validated at every commit:

Monomorphization output byte-identical (V2_TDUMP dump of sorted monomorphized_specs / generic_struct_specs / generic_types).
vlib/v2/transformer/ test suite unchanged: 17 pass / 1 pre-existing failure (same on master — test_transform_eventbus_receiver_generic_method_call_is_monomorphized).
Self-compiles to a working compiler; compiled-program output identical.

Each fix also had to remain self-hostable through v2's own cleanc/arm64 backends (flat not nested maps; keys()+index not for k,v over map[string][]&Fn; .move() on map assigns; no capturing-closure comparators; interior-mutability via the same voidptr(&Transformer) cast the worker threads use).

Remaining

At ~2.85s the profile is flat — no hotspot left. The rest is aggregate types.Type-by-value copying + map hashing + fan-out malloc-lock contention under -gc none. Going under ~2s would need structural work (parallelize the serial collect scan, or reduce the pervasive type-by-value copies).

The generic-monomorphization fixpoint in prepare_files_for_transform ran the whole-program collect_generic_call_specs walk up to 3 times per self-compile (~6s / ~700MB each under -gc none). For v2.v the first walk already discovers every spec; the rest re-walk an unchanged program and find nothing new. - Opt A: track prepared_dirty (via inject_changed_files) and skip the leading full scan when the program did not change since the last scan. - Opt B: after monomorphize_pass, rescan only the freshly-materialized clones (collect_generic_call_specs_in_new_clones over last_mono_clones) instead of re-walking all files. Monomorphization output is byte-identical (validated via V2_TDUMP spec dump and the vlib/v2/transformer test suite: same pass/fail set as before). Prepare 17s -> 7.1s; Transform total 26s -> 18.3s on the C-backend self-compile. Also adds V2_TTIME per-subphase timing and V2_TDUMP spec-snapshot diagnostics (both gated, no cost when unset).

The parallel per-file transform split files into contiguous chunks, so the few huge files (transformer.v, monomorphize.v, the cleanc gen files, ...) clustered into 2-3 adjacent chunks. Those workers ran ~10s while the other 7 finished in <0.5s and idled: fan-out wall time was pinned to the heaviest chunk (~10.7s on a 10-core self-compile) instead of total_work / n_jobs. Distribute files largest-first to the least-loaded worker (LPT heuristic, cost proxy = top-level stmt count), then scatter each worker's results back to their original file index after the join. Assignment is deterministic and workers still each fill their own result slot (no concurrent writes); only the file-> worker grouping changes. lpt_buckets uses a plain insertion sort, not sort_with_compare: this file must self-host through the cleanc and arm64 backends, which do not reliably codegen capturing closures / pointer comparators (cleanc miscompiles `*a - *b`). Fan-out ~10.7s -> ~6.5s; monomorphization output byte-identical (V2_TDUMP), and the C-backend self-compile still produces a working compiler. Only affects the parallel path; --no-parallel (arm64 self-host) is unchanged.

The LPT cost proxy used top-level statement count, which scores a file of a few huge functions the same as one of many tiny ones. Weight each FnDecl by its body statement count so the heavyweight files (transformer.v, the cleanc gen files) rank higher and land on separate workers. Deterministic, one level deep. Tightens per-worker spread (most workers now ~3.3-4.9s vs 3.3-7.2s). Self-compile still produces a working compiler and monomorphization output is byte-identical (V2_TDUMP M/S/G).

lookup_method_cached linearly scanned every method of a type and, for each, recomputed its generic base name via generic_base_name_without_specialization on every call site -> O(call_sites x methods) with a string-search inner loop. generic_base_name_without_specialization itself used string.contains('_T_') / .index_u8('['), and V's string.index builds a fresh KMP failure-table heap allocation per call. Profiling the v2 self-compile showed ~70% of the collect scan in string_index_kmp/_u8, almost all from this path, allocating ~700MB. Two fixes: - generic_base_name_without_specialization: hand-rolled byte scans instead of .contains/.index/.all_before (no per-call KMP allocation). Behaviour-identical. - Precompute cached_method_base_index (flat "type#base" -> Type map, first FnType per base) once in cache_env_maps; lookup_method_cached is now an O(1) map get. type_has_cached_method and the per-file transform (which also resolve methods per call site) get the same speedup. This path runs in BOTH the serial monomorphize scan and the parallel per-file transform, so both collapse: collect scan ~7.8s -> ~2.3s, per-file fan-out ~6.5s -> ~2.5s, Transform ~14.5s -> ~5.1s (and the --no-parallel / loaded ~42s case with it). Monomorphization output byte-identical (V2_TDUMP M/S/G), transformer test suite unchanged (17 pass / 1 pre-existing eventbus fail), self-compiles to a working compiler, compiled-program output identical. Self-host notes: the index is a FLAT map (nested map copies the inner map per lookup, which V flags as "cannot copy map" and is slow); built via keys()+index (a `for k,v` over map[string][]&Fn miscodegens as []Fn); stored as types.Type not the FnType variant (v2 codegen mishandles a map valued by a bare variant).

method_key_matches_type_name runs inside the O(method_keys) fuzzy-match fallback loops (per call site) and allocated on every call: two string.replace('.','__') (replace always allocates) plus up to four .contains('__') (each builds a KMP failure table). Most keys have no '.', and `__` can be located with a plain byte scan. Only run replace when a '.' is actually present (index_u8, no allocation) and add last_double_underscore() to replace .contains('__')/.all_after_last('__'). Behaviour-identical. Transform ~5.1s -> ~4.4s; monomorphization output byte-identical (V2_TDUMP), self-compiles.

Three fallback loops (lookup_method_return_type, lookup_method_exists, resolve_method_call_name) scanned every cached method key and ran method_key_matches_type_name against each, per call site -> O(call_sites x all_keys). But that match can only succeed when the key and receiver share the same short (final `__`-segment) name. Bucket the keys by short name once (cached_method_keys_by_short) and scan only the matching bucket via candidate_method_keys(); the full method_key_matches check still runs on the candidates so behaviour is unchanged. Transform ~4.4s -> ~3.7s (prepare ~1.8s, fan-out ~1.6s); monomorphization output byte-identical (V2_TDUMP), transformer tests unchanged (17 pass / 1 pre-existing eventbus fail), self-compiles.

lookup_struct_field_type is called per field-access expression and re-derives the field type every time: scope-map lookups, a types.Type-by-value Struct copy, a field scan, and a scan-all-scopes fallback on miss. It was the largest single source of memmove (Type copies) left in the transform. Memoize it. The result is a pure function of (cur_module, struct_name, field_name) — cur_module only affects the unqualified-name path, but keying on it unconditionally is always correct. A Void value marks a resolved-to-none entry (struct fields are never void). The cache is written through an unsafe const->mut cast (benign interior mutability — same voidptr(&Transformer) pattern the worker threads already use); each parallel worker gets its own empty cache, so there is no cross-thread sharing. Transform ~3.7s -> ~2.85s (prepare ~1.3s, fan-out ~1.3s). Monomorphization output byte-identical (V2_TDUMP), transformer tests unchanged (17 pass / 1 pre-existing eventbus fail), self-compiles, compiled-program output identical.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6bb7ee8a5b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-03T20:13:10Z


 // transform_files transforms all files and returns transformed copies
 pub fn (mut t Transformer) transform_files(files []ast.File) []ast.File {
+	t_print_mem('enter')


Provide a non-Darwin t_print_mem implementation

On Linux/Windows builds this new call is unresolved because t_print_mem is only defined in vlib/v2/transformer/mem_darwin.c.v, whose _darwin.c.v suffix is compiled only on macOS. Since transform_files is compiled for every platform, any non-Darwin V2 build that includes the transformer now fails before it can run; add a default no-op implementation or guard these calls with $if macos.

Useful? React with 👍 / 👎.

t_print_mem was defined only in mem_darwin.c.v (a _darwin.c.v file, compiled on macOS only) but transform_files calls it on every platform, so non-macOS builds of the transformer failed to compile (unresolved symbol). Reported by the Codex PR review. Move t_print_mem to a platform-neutral mem.v and guard the macOS malloc-statistics probe behind `$if macos` (other platforms print just the stage marker). The C declarations + the probe helper stay in mem_darwin.c.v, now only reached from the macOS branch.

medvednikov added 7 commits June 3, 2026 20:35

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

medvednikov merged commit de365a1 into master Jun 3, 2026
79 of 83 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2: speed up the transform stage ~15x (42s -> ~2.85s on self-compile)#27333

v2: speed up the transform stage ~15x (42s -> ~2.85s on self-compile)#27333
medvednikov merged 8 commits into
masterfrom
v2-transform-perf

medvednikov commented Jun 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

medvednikov commented Jun 3, 2026

What

Changes

Result

Correctness

Remaining

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant