Skip to content

perf(codegen): cache per-arm coerced-arg fragments in compile_poly_method_call#99

Closed
ryanseys wants to merge 1 commit into
masterfrom
rs-perf-arm-argstr-cache
Closed

perf(codegen): cache per-arm coerced-arg fragments in compile_poly_method_call#99
ryanseys wants to merge 1 commit into
masterfrom
rs-perf-arm-argstr-cache

Conversation

@ryanseys
Copy link
Copy Markdown
Owner

@ryanseys ryanseys commented May 20, 2026

Summary

  • Adds a Compiler-wide @arm_argstr_cache (Hash{String=>String}) that memoizes the ', arg0, arg1, ...' coerced-argument fragment built per arm in compile_poly_method_call. Keyed by mname + arg expressions + arg types + per-arg poly-rbval flags + arm ptypes.
  • Hoists the per-arg expr_emits_poly_rb_val probe out of the per-arm loop into a pre-computed arg_unbox_pk StrArray. The probe depends only on the source AST node, not on the dispatch arm.

Correctness

Check Result
make bootstrap (round 2 vs 3 byte-identical) ✓ pass on both analyze.rb and codegen.rb
make test ✓ 588 / 588 pass
make optcarrot ✓ checksum 59662 (unchanged)
make bench ✓ 56 / 57 pass (bm_ruby_xor is a pre-existing failure on master, unrelated)
Gen1 self-compile output ✓ byte-identical to master (3,208,666 bytes, 0-line diff)

Performance — gen1 self-compile (ruby spinel_codegen.rb build/codegen.ast build/codegen.ir /tmp/out.c)

3 runs each, /usr/bin/time -p:

Run master this PR
1 10.19s real / 9.94s user / 0.23s sys 14.22s / 13.91s / 0.30s
2 10.23s / 9.98s / 0.24s 13.74s / 13.45s / 0.28s
3 10.41s / 10.16s / 0.24s 14.16s / 13.84s / 0.31s
median 10.23s real / 9.98s user 14.16s real / 13.84s user

This PR is ~38% slower on self-compile. The cache machinery is net-negative for this workload.

Performance — large-workload codegen (the OOM target)

Single run, /usr/bin/time -l against pre-baked AST+IR cached from prior investigation:

Metric master baseline (from prior session) this PR
Wall 189s 207s
Peak RSS 11.09 GB 11.64 GB
Output written 0 bytes (OOM-killed) 0 bytes (OOM-killed)

This PR does not close the OOM. The plan's hypothesis — that the per-arm arm_arg_strs concat in compile_poly_method_call was the dominant allocator — is wrong. Caching that fragment doesn't move peak RSS. The surrounding per-arm emit string and the call_expr build are still rebuilt every arm; those are the more likely drivers.

Why the regression on self-compile

The cache helps high-arm-count call sites (cohorts of N classes that all include a shared module — many arms with identical signatures within one call site, so 1 build + N-1 hits). spinel's own self-compile is dominated by low-arm-count poly call sites, where the per-call-site key-prefix build (mname + ";|" + arg_compiled.join(",") + ";|" + arg_types.join(",") + ";|" + arg_unbox_pk.join(",")) and the arg_unbox_pk StrArray allocation run unconditionally for every poly call site and dominate any savings.

Recommendation

Do not merge as-is. Two paths forward:

  1. Make the cache adaptive — only pay the key-build cost when the dispatch is going to iterate enough arms to amortize it. Would require predicting arm count cheaply (the narrow-set hoist from a sibling perf PR exposes this, but isn't on master yet).
  2. Pivot to instrumentation first — land --emit-stats + Vernier --profile so we can identify the actual dominant allocator on the large workload with hard data, then design a targeted PR.

Leaving this PR open as research output that documents the experiment and saves the type-inference workaround below.

Type-inference workaround documented in the diff

Spinel's analyze pass seeds the empty-hash ivar type from the first []= site's inferred key type. The key here starts with the method parameter mname, whose type hasn't propagated on the first analyze iteration — spinel was committing @arm_argstr_cache to int_str_hash (sp_IntStrHash) instead of str_str_hash (sp_StrStrHash) and the build failed with incompatible pointer to integer conversion. Prepending an empty string literal ("" + mname + ...) at both key-construction sites pins the leftmost receiver to string, which propagates through the chain. This is the kind of non-obvious self-host invariant worth documenting even if this PR doesn't ship.

🤖 Generated with Claude Code

@ryanseys ryanseys force-pushed the rs-perf-arm-argstr-cache branch from 2d4e7fa to b4c39f4 Compare May 20, 2026 00:13
…thod_call

The inner per-arm loop in compile_poly_method_call rebuilt the
', arg0, arg1, ...' coerced-argument fragment once per (call_site,
arm). For workloads where many classes implement the same method
signature (typical when an include'd module provides a shared
method set), the fragment is identical across arms and across
call sites that share the same arg expressions and types.

Cache the fragment in a Compiler-wide Hash{String=>String} keyed by
(mname, arg_compiled, arg_types, per-arg poly-rbval flags, arm
ptypes). Hits skip the per-pk box/unbox decision loop entirely.

Also hoist the per-arg expr_emits_poly_rb_val probe out of the
per-arm loop into a pre-computed StrArray; the flag depends only
on the source AST node, not on the dispatch arm, so the original
code was recomputing it N times per call site.

Note: this cache trims allocation in the per-arm path but does
NOT close the large-workload codegen OOM on its own. The
surrounding per-arm 'if (recv.cls_id == X) tmp = ...;' emit
string and the call_expr build are still per-arm. Identifying
the remaining dominant allocator is follow-up work.

Verification: make bootstrap (round 2 vs round 3 byte-identical
on analyze.rb and codegen.rb), make test (588/588), make
optcarrot (checksum 59662 unchanged).
@ryanseys ryanseys force-pushed the rs-perf-arm-argstr-cache branch from b4c39f4 to d2a3aae Compare May 20, 2026 00:18
@ryanseys
Copy link
Copy Markdown
Owner Author

Closing as not merge-worthy. The implementation is correct (bootstrap clean, byte-identical output) but the perf hypothesis is wrong: -38% wall on self-compile (10.23s -> 14.16s median, 3 runs each) and peak RSS on the large-workload codegen is unmoved (11.6 GB, still OOM-kills). The cost of building the per-call-site key prefix unconditionally overwhelms the cache savings for workloads where most poly call sites have low arm counts.

Two prior PRs (PR 4 @out_lines streaming, this one) each ruled out a hypothesized OOM driver. Pivoting to instrumentation (--emit-stats + Vernier --profile) before any further OOM-targeted PR.

The diff stays on the branch as research record; the type-inference workaround ("" + mname + ... to pin a Hash{String=>String} ivar's seeded type) is documented in the body above and saved to memory for future reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant