[cc] regalloc coloring work by jgarzik · Pull Request #579 · rustcoreutils/posixutils-rs

jgarzik · 2026-05-31T06:01:49Z

No description provided.

Replace x86_64's linear-scan allocator with MCS-ordered greedy coloring on the SSA interference graph (Hack/Braun 2009) plus Belady next-use eviction when greedy fails. Aarch64 still on M5 linear-scan; port follows. Shared infrastructure in cc/arch/regalloc.rs: - InterferenceGraph using BTreeMap/BTreeSet for deterministic iteration (the design rule established in the determinism work) - build_interference_graph with per-instruction backward walk, def-before-remove + def-vs-live edges; an add_def_src_edges parameter so GP gets the cmov pattern edge and XMM doesn't (forcing def/src to differ over-constrains FP and the linker loses information) - mcs_ordering + greedy_color + ColoringResult - compute_use_positions + next_use_distance for Belady x86_64 backend (cc/arch/x86_64/regalloc.rs): - run_chordal_color replaces run_linear_scan - color_gp_bank: def_src_edges=true, hard-forbid caller-saved for cross-call ranges, soft-prefer callee-saved in loops - color_xmm_bank: def_src_edges=false, spill-on-fail (Belady GP-only for now — XMM splitting is a follow-up) - expire_old_intervals deleted Codegen fixes — chordal coloring exposed every codegen helper that used Xmm0 as an implicit scratch. Linear scan happened to keep Xmm0 cold; chordal coloring assigns it freely, so the clobbers became real miscompilations. Symptom: _Py_dg_strtod infinite loop on subnormal floats (e.g. float('9e-324')) — the correction loop's intermediate doubles got silently destroyed. Switched implicit-scratch sites from Xmm0 (allocatable) to Xmm15/Xmm14 (reserved scratch, same pattern as R10/R11 on the GP side): - emit_fp_binop, emit_fp_neg, emit_fp_compare, emit_int_to_float, emit_float_to_int, emit_float_to_float, emit_fp_const_load (cc/arch/x86_64/float.rs) - emit_select_fp (cc/arch/x86_64/codegen.rs — earlier fix) - regular FP copy fallback in codegen.rs This reserves 2 of 16 XMM regs from the allocator's palette. A follow-up milestone (per-instruction scratch-clobber constraints, shared with inline asm) can free them back. Verified: - cc test suite 914+948+204 pass - clippy clean - CPython 3.12.9 -O2 449/449 (40,817 individual tests, 30m23s) - float('9e-324') returns 1e-323 correctly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port the x86_64 chordal coloring to aarch64. Same shape as the x86_64 implementation: - run_chordal_color replaces run_linear_scan - color_gp_bank: def_src_edges=true (aarch64 csel has the same source-vs-target clobber concern as x86_64 cmov), hard-forbid caller-saved for cross-call ranges, soft-prefer callee-saved in loops - color_vreg_bank: def_src_edges=false, spill-on-fail (Belady GP-only; V-bank splitting follows the same plan as XMM) - Ord/PartialOrd added to Reg and VReg for BTreeMap keys (deterministic iteration — the design rule) aarch64 lacks x86_64's idiv/shift register constraints, so the constraint-point machinery sees no input today. The plumbing is kept identical to x86_64 so when the constraint system grows (M7.5 — per-instruction operand/clobber constraints, shared with inline asm), it lights up here without further structural change. Multi-reg call returns (complex, struct, union) and __int128 remain forced to stack slots in Phase 1, matching prior aarch64 behavior. cc/arch/regalloc.rs — delete now-unused expire_intervals, expire_stack_intervals, find_conflicting_registers. Both backends moved off the linear-scan core; these had no remaining callers. Known latent risk (not blocking, mirrors x86_64): inline libc calls emitted by cc/arch/aarch64/features.rs (__signbitf, fabs, signbit) bypass the allocator's call-position walk. Linear scan hid this by always pop'ing from the end of the palette (V31 first); chordal coloring will use V0 freely. CPython exercises this rarely enough that x86_64 tests pass; if aarch64 CI flags an FP-builtin failure, this is where to look. The constraint system milestone resolves it cleanly. Verified: - cc test suite 914+948+204 pass on x86_64 host - cargo build --release clean - cargo clippy --all-targets clean - cargo fmt --check clean - aarch64 behavior verification deferred to macOS CI (no local cross-compiler) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR replaces the existing linear-scan register allocator in the cc compiler with an SSA-aware chordal graph coloring allocator (M6) plus Belady-style eviction (M7), implemented for both x86_64 and aarch64 backends. It also reserves XMM14/XMM15 as codegen scratch on x86_64 so they cannot be picked by the new allocator, which otherwise would have silently clobbered FP pseudos in several float.rs/codegen.rs helpers that previously assumed XMM0 was free.

Changes:

Adds a shared chordal-coloring core in cc/arch/regalloc.rs (interference graph, MCS ordering, greedy color, use-position/next-use distance helpers) and removes the now-unused linear-scan helpers (expire_intervals, expire_stack_intervals, find_conflicting_registers).
Rewrites run_linear_scan → run_chordal_color in both cc/arch/x86_64/regalloc.rs and cc/arch/aarch64/regalloc.rs as a three-phase pipeline (pre-pass routing, per-bank chordal coloring with constraint/forbidden sets and loop-aware preferred palettes, Belady eviction, commit), and derives PartialOrd/Ord on Reg/XmmReg/VReg to support deterministic BTreeMap/BTreeSet keys.
Switches x86_64 FP codegen scratch from XMM0 to XMM15 (and XMM14 in the few spots where XMM15 would collide with the destination) across float.rs (emit_fp_binop, emit_fp_cmp, emit_fp_to_int, FP-FP conversion, FP-const load) and codegen.rs (FP select, FP copy fallback).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
cc/arch/regalloc.rs	Adds `InterferenceGraph`, `build_interference_graph`, `mcs_ordering`, `greedy_color`, `compute_use_positions`, `next_use_distance`; drops `expire_intervals`/`expire_stack_intervals`/`find_conflicting_registers`.
cc/arch/x86_64/regalloc.rs	Replaces linear scan with chordal coloring + Belady eviction; adds `Ord` to `Reg`/`XmmReg`; routes long-double/__int128/cross-call FP to stack in the pre-pass.
cc/arch/aarch64/regalloc.rs	Mirrors the x86_64 chordal/Belady rewrite for the GP and V banks; removes active-list state; adds `Ord` to `Reg`/`VReg`.
cc/arch/x86_64/float.rs	Reserves XMM15 (and XMM14) as codegen scratch in FP binop/cmp/cvt helpers so the new allocator's XMM picks aren't clobbered.
cc/arch/x86_64/codegen.rs	Switches FP select and the FP-copy fallback from XMM0 to XMM15 for the same reason.

M6+M7 deleted expire_stack_intervals when it deleted the linear scan core, since neither chordal backend called it. That removed stack slot reuse: every alloc_stack_slot request now created a fresh slot because the free_stack_slots pool was never populated. On x86_64 this is wasteful but harmless — 32-bit addressing absorbs huge frames. On aarch64 it broke 16 int128 codegen tests on CI: stp/ldp accept signed 7-bit immediate offsets (range [-512, 504] in multiples of 8), and int128-heavy tests pushed offsets to #3472 (217 * 16 = 217 unique slots where 20-40 would have sufficed with reuse). Fix: re-introduce expire_stack_intervals in cc/arch/regalloc.rs and call it from both chordal allocators: - Phase 1: at the top of each iteration with interval.start. Intervals come pre-sorted by start position, so this monotonic sweep matches the previous linear-scan-era expiration shape. - Phase 3: with usize::MAX just before spill commits, draining all remaining slots so the chordal-spilled pseudos can reuse any non-interfering slot. The existing try_reuse_stack_slot uses interference checks (via pseudos_interfere over live_in/live_out from the dataflow fixpoint), so reuse is correct for any iteration order. Latent issue noted (not fixed here, separate concern): aarch64 codegen helpers like emit_int128_move_to_stack push raw stp/ldp without large-offset fallback. Even with reuse, frames close to the [-512, 504] threshold will still hit this. The fix is an add-scratch-then-stp pattern in the affected emitters; tracked for a follow-up rather than bundled into this regression fix. Verified: - cc test suite 914+948+204 pass on x86_64 host - cargo build / clippy / fmt clean - aarch64 verification deferred to macOS CI Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aarch64 stp/ldp accept signed 7-bit immediate offsets scaled by element size: [-512, +504] step 8 for B64, [-256, +252] step 4 for B32. Body-emitted pair instructions (int128 moves, spilled arg stores, int128 call args) bypassed this constraint and emitted raw stp/ldp with whatever offset stack_mem returned. On deep frames the assembler rejected them ("index must be a multiple of 8 in range [-512, 504]"). The prologue/epilogue already had hand-rolled large-frame splits; the body sites did not. Centralized fix: a pair-address legalization helper on the Aarch64CodeGen. Three new internal methods, two new public emitters: fn pair_offset_fits(offset, size) -> bool fn emit_add_offset(dst, base, offset) fn legalize_pair_addr(size, addr) -> MemAddr pub fn emit_stp_legalized(size, src1, src2, addr) pub fn emit_ldp_legalized(size, addr, dst1, dst2) `legalize_pair_addr` is a no-op for in-range BaseOffset and for PreIndex / PostIndex (the latter is exclusive to the prologue / epilogue, which retain their own split logic). For out-of-range BaseOffset it emits `add X16, base, #offset` and rewrites the addr to `[X16]`. Scratch register choice: X16 (AAPCS64 IP0). Never in the allocator palette, never used by other codegen helpers as a *data* shuttle (they use x9–x11). The legalization convention is documented next to the helper: X16 is clobbered iff the legalizer fires; callers must not rely on X16 being alive past the emit call. In practice every site follows the pattern "compute source addr (may use X16) → load into X9/X10 → legalize destination addr (may reuse X16) → store" — by the time the destination's legalization runs, the source's use of X16 is dead. Migrated body emit sites: cc/arch/aarch64/codegen.rs: - emit_int128_move_to_stack (3 sites — the int128 store pattern that actually triggered the CI failure) - emit_int128_imm_store (1 site) - emit_load int128 path (2 sites) - emit_cbr int128 path (1 site) - emit_return int128 lowering (1 site) - int128 GP-pair arg storage (1 site) cc/arch/aarch64/expression.rs: - load_int128, store_int128 (2 sites) cc/arch/aarch64/call.rs: - int128 call-arg setup (2 sites) Left as raw push_lir, all justified: - Prologue/epilogue PreIndex / PostIndex stp/ldp — already handle their own large-frame splits via emit_sub_sp_imm / emit_add_sp_imm. - zero_stack_frame — only enters the stp branch when offsets fit (lines guard `max_stp_offset <= 504`). - Callee-saved save/restore — offsets bounded by the callee- saved set (≤288 bytes max). - Sites where the address is `MemAddr::Base(X16)` after emit_load_addr — no offset to legalize. FP pair helpers (emit_stp_fp_legalized / emit_ldp_fp_legalized) intentionally not introduced — every current StpFp / LdpFp site is bounded. A code comment marks where to add them when an FP pair instruction grows a body emission with possibly-large offset. Verified: - cc test suite 914+948+204 pass on x86_64 host - cargo build / clippy / fmt clean - aarch64 verification deferred to macOS CI Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jgarzik and others added 2 commits May 31, 2026 03:47

jgarzik requested a review from Copilot May 31, 2026 06:01

jgarzik self-assigned this May 31, 2026

Copilot started reviewing on behalf of jgarzik May 31, 2026 06:01 View session

Copilot AI reviewed May 31, 2026

View reviewed changes

jgarzik and others added 2 commits May 31, 2026 06:18

jgarzik merged commit 3a61979 into main May 31, 2026
9 checks passed

jgarzik deleted the updates branch May 31, 2026 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cc] regalloc coloring work#579

[cc] regalloc coloring work#579
jgarzik merged 4 commits into
mainfrom
updates

jgarzik commented May 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jgarzik commented May 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants