perf(codegen): cache per-arm coerced-arg fragments in compile_poly_method_call by ryanseys · Pull Request #99 · ryanseys/spinel

ryanseys · 2026-05-20T00:10:54Z

Summary

Adds a Compiler-wide @arm_argstr_cache (Hash{String=>String}) that memoizes the ', arg0, arg1, ...' coerced-argument fragment built per arm in compile_poly_method_call. Keyed by mname + arg expressions + arg types + per-arg poly-rbval flags + arm ptypes.
Hoists the per-arg expr_emits_poly_rb_val probe out of the per-arm loop into a pre-computed arg_unbox_pk StrArray. The probe depends only on the source AST node, not on the dispatch arm.

Correctness

Check	Result
`make bootstrap` (round 2 vs 3 byte-identical)	✓ pass on both analyze.rb and codegen.rb
`make test`	✓ 588 / 588 pass
`make optcarrot`	✓ checksum 59662 (unchanged)
`make bench`	✓ 56 / 57 pass (`bm_ruby_xor` is a pre-existing failure on master, unrelated)
Gen1 self-compile output	✓ byte-identical to master (3,208,666 bytes, 0-line diff)

Performance — gen1 self-compile (`ruby spinel_codegen.rb build/codegen.ast build/codegen.ir /tmp/out.c`)

3 runs each, /usr/bin/time -p:

Run	master	this PR
1	10.19s real / 9.94s user / 0.23s sys	14.22s / 13.91s / 0.30s
2	10.23s / 9.98s / 0.24s	13.74s / 13.45s / 0.28s
3	10.41s / 10.16s / 0.24s	14.16s / 13.84s / 0.31s
median	10.23s real / 9.98s user	14.16s real / 13.84s user

This PR is ~38% slower on self-compile. The cache machinery is net-negative for this workload.

Performance — large-workload codegen (the OOM target)

Single run, /usr/bin/time -l against pre-baked AST+IR cached from prior investigation:

Metric	master baseline (from prior session)	this PR
Wall	189s	207s
Peak RSS	11.09 GB	11.64 GB
Output written	0 bytes (OOM-killed)	0 bytes (OOM-killed)

This PR does not close the OOM. The plan's hypothesis — that the per-arm arm_arg_strs concat in compile_poly_method_call was the dominant allocator — is wrong. Caching that fragment doesn't move peak RSS. The surrounding per-arm emit string and the call_expr build are still rebuilt every arm; those are the more likely drivers.

Why the regression on self-compile

The cache helps high-arm-count call sites (cohorts of N classes that all include a shared module — many arms with identical signatures within one call site, so 1 build + N-1 hits). spinel's own self-compile is dominated by low-arm-count poly call sites, where the per-call-site key-prefix build (mname + ";|" + arg_compiled.join(",") + ";|" + arg_types.join(",") + ";|" + arg_unbox_pk.join(",")) and the arg_unbox_pk StrArray allocation run unconditionally for every poly call site and dominate any savings.

Recommendation

Do not merge as-is. Two paths forward:

Make the cache adaptive — only pay the key-build cost when the dispatch is going to iterate enough arms to amortize it. Would require predicting arm count cheaply (the narrow-set hoist from a sibling perf PR exposes this, but isn't on master yet).
Pivot to instrumentation first — land --emit-stats + Vernier --profile so we can identify the actual dominant allocator on the large workload with hard data, then design a targeted PR.

Leaving this PR open as research output that documents the experiment and saves the type-inference workaround below.

Type-inference workaround documented in the diff

Spinel's analyze pass seeds the empty-hash ivar type from the first []= site's inferred key type. The key here starts with the method parameter mname, whose type hasn't propagated on the first analyze iteration — spinel was committing @arm_argstr_cache to int_str_hash (sp_IntStrHash) instead of str_str_hash (sp_StrStrHash) and the build failed with incompatible pointer to integer conversion. Prepending an empty string literal ("" + mname + ...) at both key-construction sites pins the leftmost receiver to string, which propagates through the chain. This is the kind of non-obvious self-host invariant worth documenting even if this PR doesn't ship.

🤖 Generated with Claude Code

…thod_call The inner per-arm loop in compile_poly_method_call rebuilt the ', arg0, arg1, ...' coerced-argument fragment once per (call_site, arm). For workloads where many classes implement the same method signature (typical when an include'd module provides a shared method set), the fragment is identical across arms and across call sites that share the same arg expressions and types. Cache the fragment in a Compiler-wide Hash{String=>String} keyed by (mname, arg_compiled, arg_types, per-arg poly-rbval flags, arm ptypes). Hits skip the per-pk box/unbox decision loop entirely. Also hoist the per-arg expr_emits_poly_rb_val probe out of the per-arm loop into a pre-computed StrArray; the flag depends only on the source AST node, not on the dispatch arm, so the original code was recomputing it N times per call site. Note: this cache trims allocation in the per-arm path but does NOT close the large-workload codegen OOM on its own. The surrounding per-arm 'if (recv.cls_id == X) tmp = ...;' emit string and the call_expr build are still per-arm. Identifying the remaining dominant allocator is follow-up work. Verification: make bootstrap (round 2 vs round 3 byte-identical on analyze.rb and codegen.rb), make test (588/588), make optcarrot (checksum 59662 unchanged).

ryanseys · 2026-05-20T00:24:20Z

Closing as not merge-worthy. The implementation is correct (bootstrap clean, byte-identical output) but the perf hypothesis is wrong: -38% wall on self-compile (10.23s -> 14.16s median, 3 runs each) and peak RSS on the large-workload codegen is unmoved (11.6 GB, still OOM-kills). The cost of building the per-call-site key prefix unconditionally overwhelms the cache savings for workloads where most poly call sites have low arm counts.

Two prior PRs (PR 4 @out_lines streaming, this one) each ruled out a hypothesized OOM driver. Pivoting to instrumentation (--emit-stats + Vernier --profile) before any further OOM-targeted PR.

The diff stays on the branch as research record; the type-inference workaround ("" + mname + ... to pin a Hash{String=>String} ivar's seeded type) is documented in the body above and saved to memory for future reference.

ryanseys force-pushed the rs-perf-arm-argstr-cache branch from 2d4e7fa to b4c39f4 Compare May 20, 2026 00:13

ryanseys force-pushed the rs-perf-arm-argstr-cache branch from b4c39f4 to d2a3aae Compare May 20, 2026 00:18

ryanseys closed this May 20, 2026

ryanseys mentioned this pull request May 20, 2026

feat(codegen): --stats=csv / --stats=json poly-dispatch instrumentation #100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(codegen): cache per-arm coerced-arg fragments in compile_poly_method_call#99

perf(codegen): cache per-arm coerced-arg fragments in compile_poly_method_call#99
ryanseys wants to merge 1 commit into
masterfrom
rs-perf-arm-argstr-cache

ryanseys commented May 20, 2026 •

edited

Loading

Uh oh!

ryanseys commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ryanseys commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Correctness

Performance — gen1 self-compile (ruby spinel_codegen.rb build/codegen.ast build/codegen.ir /tmp/out.c)

Performance — large-workload codegen (the OOM target)

Why the regression on self-compile

Recommendation

Type-inference workaround documented in the diff

Uh oh!

ryanseys commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ryanseys commented May 20, 2026 •

edited

Loading

Performance — gen1 self-compile (`ruby spinel_codegen.rb build/codegen.ast build/codegen.ir /tmp/out.c`)