v2: speed up the transform stage ~15x (42s -> ~2.85s on self-compile)#27333
Conversation
The generic-monomorphization fixpoint in prepare_files_for_transform ran the whole-program collect_generic_call_specs walk up to 3 times per self-compile (~6s / ~700MB each under -gc none). For v2.v the first walk already discovers every spec; the rest re-walk an unchanged program and find nothing new. - Opt A: track prepared_dirty (via inject_changed_files) and skip the leading full scan when the program did not change since the last scan. - Opt B: after monomorphize_pass, rescan only the freshly-materialized clones (collect_generic_call_specs_in_new_clones over last_mono_clones) instead of re-walking all files. Monomorphization output is byte-identical (validated via V2_TDUMP spec dump and the vlib/v2/transformer test suite: same pass/fail set as before). Prepare 17s -> 7.1s; Transform total 26s -> 18.3s on the C-backend self-compile. Also adds V2_TTIME per-subphase timing and V2_TDUMP spec-snapshot diagnostics (both gated, no cost when unset).
The parallel per-file transform split files into contiguous chunks, so the few huge files (transformer.v, monomorphize.v, the cleanc gen files, ...) clustered into 2-3 adjacent chunks. Those workers ran ~10s while the other 7 finished in <0.5s and idled: fan-out wall time was pinned to the heaviest chunk (~10.7s on a 10-core self-compile) instead of total_work / n_jobs. Distribute files largest-first to the least-loaded worker (LPT heuristic, cost proxy = top-level stmt count), then scatter each worker's results back to their original file index after the join. Assignment is deterministic and workers still each fill their own result slot (no concurrent writes); only the file-> worker grouping changes. lpt_buckets uses a plain insertion sort, not sort_with_compare: this file must self-host through the cleanc and arm64 backends, which do not reliably codegen capturing closures / pointer comparators (cleanc miscompiles `*a - *b`). Fan-out ~10.7s -> ~6.5s; monomorphization output byte-identical (V2_TDUMP), and the C-backend self-compile still produces a working compiler. Only affects the parallel path; --no-parallel (arm64 self-host) is unchanged.
The LPT cost proxy used top-level statement count, which scores a file of a few huge functions the same as one of many tiny ones. Weight each FnDecl by its body statement count so the heavyweight files (transformer.v, the cleanc gen files) rank higher and land on separate workers. Deterministic, one level deep. Tightens per-worker spread (most workers now ~3.3-4.9s vs 3.3-7.2s). Self-compile still produces a working compiler and monomorphization output is byte-identical (V2_TDUMP M/S/G).
lookup_method_cached linearly scanned every method of a type and, for each,
recomputed its generic base name via generic_base_name_without_specialization
on every call site -> O(call_sites x methods) with a string-search inner loop.
generic_base_name_without_specialization itself used string.contains('_T_') /
.index_u8('['), and V's string.index builds a fresh KMP failure-table heap
allocation per call. Profiling the v2 self-compile showed ~70% of the collect
scan in string_index_kmp/_u8, almost all from this path, allocating ~700MB.
Two fixes:
- generic_base_name_without_specialization: hand-rolled byte scans instead of
.contains/.index/.all_before (no per-call KMP allocation). Behaviour-identical.
- Precompute cached_method_base_index (flat "type#base" -> Type map, first FnType
per base) once in cache_env_maps; lookup_method_cached is now an O(1) map get.
type_has_cached_method and the per-file transform (which also resolve methods
per call site) get the same speedup.
This path runs in BOTH the serial monomorphize scan and the parallel per-file
transform, so both collapse:
collect scan ~7.8s -> ~2.3s, per-file fan-out ~6.5s -> ~2.5s,
Transform ~14.5s -> ~5.1s (and the --no-parallel / loaded ~42s case with it).
Monomorphization output byte-identical (V2_TDUMP M/S/G), transformer test suite
unchanged (17 pass / 1 pre-existing eventbus fail), self-compiles to a working
compiler, compiled-program output identical.
Self-host notes: the index is a FLAT map (nested map copies the inner map per
lookup, which V flags as "cannot copy map" and is slow); built via keys()+index
(a `for k,v` over map[string][]&Fn miscodegens as []Fn); stored as types.Type
not the FnType variant (v2 codegen mishandles a map valued by a bare variant).
method_key_matches_type_name runs inside the O(method_keys) fuzzy-match fallback
loops (per call site) and allocated on every call: two string.replace('.','__')
(replace always allocates) plus up to four .contains('__') (each builds a KMP
failure table). Most keys have no '.', and `__` can be located with a plain
byte scan.
Only run replace when a '.' is actually present (index_u8, no allocation) and
add last_double_underscore() to replace .contains('__')/.all_after_last('__').
Behaviour-identical. Transform ~5.1s -> ~4.4s; monomorphization output
byte-identical (V2_TDUMP), self-compiles.
Three fallback loops (lookup_method_return_type, lookup_method_exists, resolve_method_call_name) scanned every cached method key and ran method_key_matches_type_name against each, per call site -> O(call_sites x all_keys). But that match can only succeed when the key and receiver share the same short (final `__`-segment) name. Bucket the keys by short name once (cached_method_keys_by_short) and scan only the matching bucket via candidate_method_keys(); the full method_key_matches check still runs on the candidates so behaviour is unchanged. Transform ~4.4s -> ~3.7s (prepare ~1.8s, fan-out ~1.6s); monomorphization output byte-identical (V2_TDUMP), transformer tests unchanged (17 pass / 1 pre-existing eventbus fail), self-compiles.
lookup_struct_field_type is called per field-access expression and re-derives the field type every time: scope-map lookups, a types.Type-by-value Struct copy, a field scan, and a scan-all-scopes fallback on miss. It was the largest single source of memmove (Type copies) left in the transform. Memoize it. The result is a pure function of (cur_module, struct_name, field_name) — cur_module only affects the unqualified-name path, but keying on it unconditionally is always correct. A Void value marks a resolved-to-none entry (struct fields are never void). The cache is written through an unsafe const->mut cast (benign interior mutability — same voidptr(&Transformer) pattern the worker threads already use); each parallel worker gets its own empty cache, so there is no cross-thread sharing. Transform ~3.7s -> ~2.85s (prepare ~1.3s, fan-out ~1.3s). Monomorphization output byte-identical (V2_TDUMP), transformer tests unchanged (17 pass / 1 pre-existing eventbus fail), self-compiles, compiled-program output identical.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6bb7ee8a5b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| // transform_files transforms all files and returns transformed copies | ||
| pub fn (mut t Transformer) transform_files(files []ast.File) []ast.File { | ||
| t_print_mem('enter') |
There was a problem hiding this comment.
Provide a non-Darwin t_print_mem implementation
On Linux/Windows builds this new call is unresolved because t_print_mem is only defined in vlib/v2/transformer/mem_darwin.c.v, whose _darwin.c.v suffix is compiled only on macOS. Since transform_files is compiled for every platform, any non-Darwin V2 build that includes the transformer now fails before it can run; add a default no-op implementation or guard these calls with $if macos.
Useful? React with 👍 / 👎.
t_print_mem was defined only in mem_darwin.c.v (a _darwin.c.v file, compiled on macOS only) but transform_files calls it on every platform, so non-macOS builds of the transformer failed to compile (unresolved symbol). Reported by the Codex PR review. Move t_print_mem to a platform-neutral mem.v and guard the macOS malloc-statistics probe behind `$if macos` (other platforms print just the stage marker). The C declarations + the probe helper stay in mem_darwin.c.v, now only reached from the macOS branch.
What
Speeds up the v2 transform stage (
cmd/v2/v2 v2.vself-compile) from ~42s (serial / loaded) — ~26s parallel — down to ~2.85s, with no change to output.The stage was pathologically slow, not fundamentally expensive (V1 compiles the whole program in ~1s). A sampling profiler (
sample <pid>) showed ~70% of the time instring_index_kmp/_u8, almost all from O(n²) method resolution.Changes
lookup_method_cachedlinearly scanned every method of a type and recomputed its generic base name — via allocatingstring.contains/.index(each builds a KMP failure table) — per method, per call site. Now an O(1) lookup against a precomputedtype#base -> FnTypeindex, andgeneric_base_name_without_specializationuses hand-rolled byte scans (no per-call allocation). This path runs in both the serial monomorphize scan and the parallel per-file transform, so both collapsed.lookup_struct_field_type(the largest remaining source oftypes.Type-by-value copies), keyed by(cur_module, struct, field).method_key_matches_type_name.Also adds gated profiling hooks used to find all of the above (
V2_TTIMEper-subphase timing,V2_TDUMPmonomorphization-spec snapshot, plus the existingV2_MEM); all are no-ops unless the env var is set.Result
Correctness
Validated at every commit:
V2_TDUMPdump of sorted monomorphized_specs / generic_struct_specs / generic_types).vlib/v2/transformer/test suite unchanged: 17 pass / 1 pre-existing failure (same on master —test_transform_eventbus_receiver_generic_method_call_is_monomorphized).Each fix also had to remain self-hostable through v2's own cleanc/arm64 backends (flat not nested maps;
keys()+index notfor k,vovermap[string][]&Fn;.move()on map assigns; no capturing-closure comparators; interior-mutability via the samevoidptr(&Transformer)cast the worker threads use).Remaining
At ~2.85s the profile is flat — no hotspot left. The rest is aggregate
types.Type-by-value copying + map hashing + fan-out malloc-lock contention under-gc none. Going under ~2s would need structural work (parallelize the serial collect scan, or reduce the pervasive type-by-value copies).