perf(codegen): cache per-arm coerced-arg fragments in compile_poly_method_call#99
perf(codegen): cache per-arm coerced-arg fragments in compile_poly_method_call#99ryanseys wants to merge 1 commit into
Conversation
2d4e7fa to
b4c39f4
Compare
…thod_call
The inner per-arm loop in compile_poly_method_call rebuilt the
', arg0, arg1, ...' coerced-argument fragment once per (call_site,
arm). For workloads where many classes implement the same method
signature (typical when an include'd module provides a shared
method set), the fragment is identical across arms and across
call sites that share the same arg expressions and types.
Cache the fragment in a Compiler-wide Hash{String=>String} keyed by
(mname, arg_compiled, arg_types, per-arg poly-rbval flags, arm
ptypes). Hits skip the per-pk box/unbox decision loop entirely.
Also hoist the per-arg expr_emits_poly_rb_val probe out of the
per-arm loop into a pre-computed StrArray; the flag depends only
on the source AST node, not on the dispatch arm, so the original
code was recomputing it N times per call site.
Note: this cache trims allocation in the per-arm path but does
NOT close the large-workload codegen OOM on its own. The
surrounding per-arm 'if (recv.cls_id == X) tmp = ...;' emit
string and the call_expr build are still per-arm. Identifying
the remaining dominant allocator is follow-up work.
Verification: make bootstrap (round 2 vs round 3 byte-identical
on analyze.rb and codegen.rb), make test (588/588), make
optcarrot (checksum 59662 unchanged).
b4c39f4 to
d2a3aae
Compare
|
Closing as not merge-worthy. The implementation is correct (bootstrap clean, byte-identical output) but the perf hypothesis is wrong: -38% wall on self-compile (10.23s -> 14.16s median, 3 runs each) and peak RSS on the large-workload codegen is unmoved (11.6 GB, still OOM-kills). The cost of building the per-call-site key prefix unconditionally overwhelms the cache savings for workloads where most poly call sites have low arm counts. Two prior PRs (PR 4 @out_lines streaming, this one) each ruled out a hypothesized OOM driver. Pivoting to instrumentation (--emit-stats + Vernier --profile) before any further OOM-targeted PR. The diff stays on the branch as research record; the type-inference workaround ( |
Summary
@arm_argstr_cache(Hash{String=>String}) that memoizes the', arg0, arg1, ...'coerced-argument fragment built per arm incompile_poly_method_call. Keyed by mname + arg expressions + arg types + per-arg poly-rbval flags + arm ptypes.expr_emits_poly_rb_valprobe out of the per-arm loop into a pre-computedarg_unbox_pkStrArray. The probe depends only on the source AST node, not on the dispatch arm.Correctness
make bootstrap(round 2 vs 3 byte-identical)make testmake optcarrotmake benchbm_ruby_xoris a pre-existing failure on master, unrelated)Performance — gen1 self-compile (
ruby spinel_codegen.rb build/codegen.ast build/codegen.ir /tmp/out.c)3 runs each,
/usr/bin/time -p:This PR is ~38% slower on self-compile. The cache machinery is net-negative for this workload.
Performance — large-workload codegen (the OOM target)
Single run,
/usr/bin/time -lagainst pre-baked AST+IR cached from prior investigation:This PR does not close the OOM. The plan's hypothesis — that the per-arm
arm_arg_strsconcat incompile_poly_method_callwas the dominant allocator — is wrong. Caching that fragment doesn't move peak RSS. The surrounding per-arm emit string and thecall_exprbuild are still rebuilt every arm; those are the more likely drivers.Why the regression on self-compile
The cache helps high-arm-count call sites (cohorts of N classes that all
includea shared module — many arms with identical signatures within one call site, so 1 build + N-1 hits). spinel's own self-compile is dominated by low-arm-count poly call sites, where the per-call-site key-prefix build (mname + ";|" + arg_compiled.join(",") + ";|" + arg_types.join(",") + ";|" + arg_unbox_pk.join(",")) and thearg_unbox_pkStrArray allocation run unconditionally for every poly call site and dominate any savings.Recommendation
Do not merge as-is. Two paths forward:
--emit-stats+ Vernier--profileso we can identify the actual dominant allocator on the large workload with hard data, then design a targeted PR.Leaving this PR open as research output that documents the experiment and saves the type-inference workaround below.
Type-inference workaround documented in the diff
Spinel's analyze pass seeds the empty-hash ivar type from the first
[]=site's inferred key type. The key here starts with the method parametermname, whose type hasn't propagated on the first analyze iteration — spinel was committing@arm_argstr_cachetoint_str_hash(sp_IntStrHash) instead ofstr_str_hash(sp_StrStrHash) and the build failed withincompatible pointer to integer conversion. Prepending an empty string literal ("" + mname + ...) at both key-construction sites pins the leftmost receiver tostring, which propagates through the chain. This is the kind of non-obvious self-host invariant worth documenting even if this PR doesn't ship.🤖 Generated with Claude Code