Skip to content

[cc] regalloc coloring work#579

Merged
jgarzik merged 4 commits into
mainfrom
updates
May 31, 2026
Merged

[cc] regalloc coloring work#579
jgarzik merged 4 commits into
mainfrom
updates

Conversation

@jgarzik
Copy link
Copy Markdown
Contributor

@jgarzik jgarzik commented May 31, 2026

No description provided.

jgarzik and others added 2 commits May 31, 2026 03:47
Replace x86_64's linear-scan allocator with MCS-ordered greedy
coloring on the SSA interference graph (Hack/Braun 2009) plus
Belady next-use eviction when greedy fails. Aarch64 still on M5
linear-scan; port follows.

Shared infrastructure in cc/arch/regalloc.rs:

- InterferenceGraph using BTreeMap/BTreeSet for deterministic
  iteration (the design rule established in the determinism work)
- build_interference_graph with per-instruction backward walk,
  def-before-remove + def-vs-live edges; an add_def_src_edges
  parameter so GP gets the cmov pattern edge and XMM doesn't
  (forcing def/src to differ over-constrains FP and the linker
  loses information)
- mcs_ordering + greedy_color + ColoringResult
- compute_use_positions + next_use_distance for Belady

x86_64 backend (cc/arch/x86_64/regalloc.rs):

- run_chordal_color replaces run_linear_scan
- color_gp_bank: def_src_edges=true, hard-forbid caller-saved
  for cross-call ranges, soft-prefer callee-saved in loops
- color_xmm_bank: def_src_edges=false, spill-on-fail (Belady
  GP-only for now — XMM splitting is a follow-up)
- expire_old_intervals deleted

Codegen fixes — chordal coloring exposed every codegen helper
that used Xmm0 as an implicit scratch. Linear scan happened to
keep Xmm0 cold; chordal coloring assigns it freely, so the
clobbers became real miscompilations. Symptom: _Py_dg_strtod
infinite loop on subnormal floats (e.g. float('9e-324')) — the
correction loop's intermediate doubles got silently destroyed.

Switched implicit-scratch sites from Xmm0 (allocatable) to
Xmm15/Xmm14 (reserved scratch, same pattern as R10/R11 on the
GP side):

- emit_fp_binop, emit_fp_neg, emit_fp_compare, emit_int_to_float,
  emit_float_to_int, emit_float_to_float, emit_fp_const_load
  (cc/arch/x86_64/float.rs)
- emit_select_fp (cc/arch/x86_64/codegen.rs — earlier fix)
- regular FP copy fallback in codegen.rs

This reserves 2 of 16 XMM regs from the allocator's palette. A
follow-up milestone (per-instruction scratch-clobber constraints,
shared with inline asm) can free them back.

Verified:
- cc test suite 914+948+204 pass
- clippy clean
- CPython 3.12.9 -O2 449/449 (40,817 individual tests, 30m23s)
- float('9e-324') returns 1e-323 correctly

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port the x86_64 chordal coloring to aarch64. Same shape as the
x86_64 implementation:

- run_chordal_color replaces run_linear_scan
- color_gp_bank: def_src_edges=true (aarch64 csel has the same
  source-vs-target clobber concern as x86_64 cmov), hard-forbid
  caller-saved for cross-call ranges, soft-prefer callee-saved
  in loops
- color_vreg_bank: def_src_edges=false, spill-on-fail (Belady
  GP-only; V-bank splitting follows the same plan as XMM)
- Ord/PartialOrd added to Reg and VReg for BTreeMap keys
  (deterministic iteration — the design rule)

aarch64 lacks x86_64's idiv/shift register constraints, so the
constraint-point machinery sees no input today. The plumbing is
kept identical to x86_64 so when the constraint system grows
(M7.5 — per-instruction operand/clobber constraints, shared with
inline asm), it lights up here without further structural change.

Multi-reg call returns (complex, struct, union) and __int128
remain forced to stack slots in Phase 1, matching prior aarch64
behavior.

cc/arch/regalloc.rs — delete now-unused expire_intervals,
expire_stack_intervals, find_conflicting_registers. Both backends
moved off the linear-scan core; these had no remaining callers.

Known latent risk (not blocking, mirrors x86_64): inline libc
calls emitted by cc/arch/aarch64/features.rs (__signbitf, fabs,
signbit) bypass the allocator's call-position walk. Linear scan
hid this by always pop'ing from the end of the palette (V31
first); chordal coloring will use V0 freely. CPython exercises
this rarely enough that x86_64 tests pass; if aarch64 CI flags
an FP-builtin failure, this is where to look. The constraint
system milestone resolves it cleanly.

Verified:
- cc test suite 914+948+204 pass on x86_64 host
- cargo build --release clean
- cargo clippy --all-targets clean
- cargo fmt --check clean
- aarch64 behavior verification deferred to macOS CI (no local
  cross-compiler)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jgarzik jgarzik requested a review from Copilot May 31, 2026 06:01
@jgarzik jgarzik self-assigned this May 31, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the existing linear-scan register allocator in the cc compiler with an SSA-aware chordal graph coloring allocator (M6) plus Belady-style eviction (M7), implemented for both x86_64 and aarch64 backends. It also reserves XMM14/XMM15 as codegen scratch on x86_64 so they cannot be picked by the new allocator, which otherwise would have silently clobbered FP pseudos in several float.rs/codegen.rs helpers that previously assumed XMM0 was free.

Changes:

  • Adds a shared chordal-coloring core in cc/arch/regalloc.rs (interference graph, MCS ordering, greedy color, use-position/next-use distance helpers) and removes the now-unused linear-scan helpers (expire_intervals, expire_stack_intervals, find_conflicting_registers).
  • Rewrites run_linear_scanrun_chordal_color in both cc/arch/x86_64/regalloc.rs and cc/arch/aarch64/regalloc.rs as a three-phase pipeline (pre-pass routing, per-bank chordal coloring with constraint/forbidden sets and loop-aware preferred palettes, Belady eviction, commit), and derives PartialOrd/Ord on Reg/XmmReg/VReg to support deterministic BTreeMap/BTreeSet keys.
  • Switches x86_64 FP codegen scratch from XMM0 to XMM15 (and XMM14 in the few spots where XMM15 would collide with the destination) across float.rs (emit_fp_binop, emit_fp_cmp, emit_fp_to_int, FP-FP conversion, FP-const load) and codegen.rs (FP select, FP copy fallback).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
cc/arch/regalloc.rs Adds InterferenceGraph, build_interference_graph, mcs_ordering, greedy_color, compute_use_positions, next_use_distance; drops expire_intervals/expire_stack_intervals/find_conflicting_registers.
cc/arch/x86_64/regalloc.rs Replaces linear scan with chordal coloring + Belady eviction; adds Ord to Reg/XmmReg; routes long-double/__int128/cross-call FP to stack in the pre-pass.
cc/arch/aarch64/regalloc.rs Mirrors the x86_64 chordal/Belady rewrite for the GP and V banks; removes active-list state; adds Ord to Reg/VReg.
cc/arch/x86_64/float.rs Reserves XMM15 (and XMM14) as codegen scratch in FP binop/cmp/cvt helpers so the new allocator's XMM picks aren't clobbered.
cc/arch/x86_64/codegen.rs Switches FP select and the FP-copy fallback from XMM0 to XMM15 for the same reason.

jgarzik and others added 2 commits May 31, 2026 06:18
M6+M7 deleted expire_stack_intervals when it deleted the linear
scan core, since neither chordal backend called it. That removed
stack slot reuse: every alloc_stack_slot request now created a
fresh slot because the free_stack_slots pool was never populated.

On x86_64 this is wasteful but harmless — 32-bit addressing
absorbs huge frames. On aarch64 it broke 16 int128 codegen tests
on CI: stp/ldp accept signed 7-bit immediate offsets (range
[-512, 504] in multiples of 8), and int128-heavy tests pushed
offsets to #3472 (217 * 16 = 217 unique slots where 20-40 would
have sufficed with reuse).

Fix: re-introduce expire_stack_intervals in cc/arch/regalloc.rs
and call it from both chordal allocators:

- Phase 1: at the top of each iteration with interval.start.
  Intervals come pre-sorted by start position, so this monotonic
  sweep matches the previous linear-scan-era expiration shape.
- Phase 3: with usize::MAX just before spill commits, draining
  all remaining slots so the chordal-spilled pseudos can reuse
  any non-interfering slot.

The existing try_reuse_stack_slot uses interference checks (via
pseudos_interfere over live_in/live_out from the dataflow
fixpoint), so reuse is correct for any iteration order.

Latent issue noted (not fixed here, separate concern): aarch64
codegen helpers like emit_int128_move_to_stack push raw stp/ldp
without large-offset fallback. Even with reuse, frames close to
the [-512, 504] threshold will still hit this. The fix is an
add-scratch-then-stp pattern in the affected emitters; tracked
for a follow-up rather than bundled into this regression fix.

Verified:
- cc test suite 914+948+204 pass on x86_64 host
- cargo build / clippy / fmt clean
- aarch64 verification deferred to macOS CI

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aarch64 stp/ldp accept signed 7-bit immediate offsets scaled by
element size: [-512, +504] step 8 for B64, [-256, +252] step 4
for B32. Body-emitted pair instructions (int128 moves, spilled
arg stores, int128 call args) bypassed this constraint and
emitted raw stp/ldp with whatever offset stack_mem returned. On
deep frames the assembler rejected them ("index must be a
multiple of 8 in range [-512, 504]"). The prologue/epilogue
already had hand-rolled large-frame splits; the body sites did
not.

Centralized fix: a pair-address legalization helper on the
Aarch64CodeGen. Three new internal methods, two new public
emitters:

  fn pair_offset_fits(offset, size) -> bool
  fn emit_add_offset(dst, base, offset)
  fn legalize_pair_addr(size, addr) -> MemAddr
  pub fn emit_stp_legalized(size, src1, src2, addr)
  pub fn emit_ldp_legalized(size, addr, dst1, dst2)

`legalize_pair_addr` is a no-op for in-range BaseOffset and for
PreIndex / PostIndex (the latter is exclusive to the prologue /
epilogue, which retain their own split logic). For out-of-range
BaseOffset it emits `add X16, base, #offset` and rewrites the
addr to `[X16]`.

Scratch register choice: X16 (AAPCS64 IP0). Never in the
allocator palette, never used by other codegen helpers as a
*data* shuttle (they use x9–x11). The legalization convention is
documented next to the helper: X16 is clobbered iff the
legalizer fires; callers must not rely on X16 being alive past
the emit call. In practice every site follows the pattern
"compute source addr (may use X16) → load into X9/X10 →
legalize destination addr (may reuse X16) → store" — by the
time the destination's legalization runs, the source's use of
X16 is dead.

Migrated body emit sites:
  cc/arch/aarch64/codegen.rs:
    - emit_int128_move_to_stack (3 sites — the int128 store
      pattern that actually triggered the CI failure)
    - emit_int128_imm_store (1 site)
    - emit_load int128 path (2 sites)
    - emit_cbr int128 path (1 site)
    - emit_return int128 lowering (1 site)
    - int128 GP-pair arg storage (1 site)
  cc/arch/aarch64/expression.rs:
    - load_int128, store_int128 (2 sites)
  cc/arch/aarch64/call.rs:
    - int128 call-arg setup (2 sites)

Left as raw push_lir, all justified:
  - Prologue/epilogue PreIndex / PostIndex stp/ldp — already
    handle their own large-frame splits via emit_sub_sp_imm /
    emit_add_sp_imm.
  - zero_stack_frame — only enters the stp branch when offsets
    fit (lines guard `max_stp_offset <= 504`).
  - Callee-saved save/restore — offsets bounded by the callee-
    saved set (≤288 bytes max).
  - Sites where the address is `MemAddr::Base(X16)` after
    emit_load_addr — no offset to legalize.

FP pair helpers (emit_stp_fp_legalized / emit_ldp_fp_legalized)
intentionally not introduced — every current StpFp / LdpFp site
is bounded. A code comment marks where to add them when an FP
pair instruction grows a body emission with possibly-large
offset.

Verified:
- cc test suite 914+948+204 pass on x86_64 host
- cargo build / clippy / fmt clean
- aarch64 verification deferred to macOS CI

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jgarzik jgarzik merged commit 3a61979 into main May 31, 2026
9 checks passed
@jgarzik jgarzik deleted the updates branch May 31, 2026 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants