Skip to content

Skip irrelevant foreign impls when building the specialization graph#157281

Draft
xmakro wants to merge 1 commit into
rust-lang:mainfrom
xmakro:perf/spec-graph-skip-foreign-impls
Draft

Skip irrelevant foreign impls when building the specialization graph#157281
xmakro wants to merge 1 commit into
rust-lang:mainfrom
xmakro:perf/spec-graph-skip-foreign-impls

Conversation

@xmakro
Copy link
Copy Markdown

@xmakro xmakro commented Jun 2, 2026

What this does

specialization_graph_provider enumerates every impl of a trait, including all foreign impls, and records each one into the graph. For a leaf crate this is dominated by foreign impls: a crate that locally defines a handful of Debug/Clone impls still pulls in every foreign impl Debug for <std type>, decoding each one's trait ref via impl_trait_ref and reading impl_parent.

This skips foreign non-blanket impls that cannot affect local coherence. The work list is built from trait_impls_of and keeps:

  • all blanket impls, and
  • every non-blanket impl in a simplified-self bucket that contains at least one local impl.

Foreign impls whose bucket holds no local impl are never recorded.

Why it is sound

Overlap checking of a local impl only consults blanket impls and the non-blanket impls sharing its simplified self type (filtered_children). The orphan rules forbid a local blanket impl of a foreign trait, so a foreign non-blanket impl can only matter to local coherence if its bucket also contains a local impl, and those buckets are kept in full.

The graph's parent map is read in exactly one place, Ancestors::next. For a foreign impl the parent is now read lazily from crate metadata via impl_parent instead of from the graph, consistent with how foreign parent chains are already resolved for impls reachable only as ancestors. Specialization preserves the head constructor (a specializing impl shares the parent's simplified self type, or the parent is blanket), so the parent of any kept non-blanket impl is itself always kept and no graph node becomes unreachable.

What was measured

Two stage1 librustc_driver shared objects were built from this tree, one at the parent commit (baseline) and one with the change. For each run only the .so is swapped, so nothing else differs. The metric is callgrind instruction reads (Ir), which is deterministic (no run-to-run noise), attributed to the rustc process with the largest Ir, that is the crate's own rustc --crate-name <crate> invocation; build scripts and proc-macro host compiles are excluded. Two scenarios per crate:

  • From scratch (CARGO_INCREMENTAL=0): dependencies prebuilt, then the crate compiled cold (cargo check to warm deps, touch the crate's own sources, cargo check under callgrind). This models clean builds and CI, where every crate is compiled from source.
  • Incremental, unchanged (CARGO_INCREMENTAL=1): build once to populate the on-disk incremental cache, then bump source mtimes only (no content change) and rebuild under callgrind. This models the steady-state warm edit loop where the previous session's caches are reused.

From scratch

crate baseline Ir with change Ir delta Ir delta
tokio 1.38 810,555,980 754,459,060 -56,096,920 -6.92%
regex 1.10 1,247,973,341 1,183,361,228 -64,612,113 -5.18%
ripgrep 14.1 3,125,603,677 3,058,667,865 -66,935,812 -2.14%
syn 2.0 5,532,483,754 5,481,826,160 -50,657,594 -0.92%
rayon 1.10 9,131,456,998 9,073,720,082 -57,736,916 -0.63%
serde 1.0 12,074,820,787 12,027,900,440 -46,920,347 -0.39%
wasmi 0.35 24,118,407,063 24,052,136,488 -66,270,575 -0.28%

The absolute saving is similar across crates (roughly 47M to 67M Ir); the percentage tracks how foreign-impl-heavy a crate is relative to its own code.

Incremental, unchanged

crate baseline Ir with change Ir delta Ir delta
regex 1.10 429,085,543 381,953,940 -47,131,603 -10.98%
tokio 1.38 382,531,263 342,721,889 -39,809,374 -10.41%
ripgrep 14.1 1,027,211,095 974,595,117 -52,615,978 -5.12%

The warm rebuild shows a larger percentage even though specialization_graph_of is cache_on_disk and its result is reloaded rather than recomputed. The saving here is not from the provider: recording each foreign impl creates an impl_parent and an impl_trait_header dep-graph node, and skipping them leaves the serialized dependency graph many thousands of nodes smaller, so try_mark_previous_green has less to validate on every incremental session. The absolute saving is comparable to the from-scratch case, against a much smaller total, hence the higher percentage.

Wall clock

The Ir figures above are the deterministic signal. Native cargo check wall time was also measured the same way (same .so hot-swap, baseline and with-change trials interleaved so thermal drift cancels), reporting the median and best (min) per crate. Wall time carries real run-to-run noise, so it is weaker evidence than Ir, but it confirms the direction and rough size.

Incremental, unchanged (re-check after bumping source mtimes, 21 trials):

crate baseline median with change median median delta best delta
regex 1.10 0.0700 s 0.0623 s -11.0% -7.4%
tokio 1.38 0.0718 s 0.0690 s -3.9% -6.2%
ripgrep 14.1 0.1586 s 0.1519 s -4.2% -3.9%
serde 1.0 0.3562 s 0.3506 s -1.6% -1.7%
syn 2.0 0.2327 s 0.2346 s +0.8% -0.2%

From scratch (clean check of the crate plus its dependencies, 7 trials):

crate baseline median with change median median delta best delta
tokio 1.38 0.1734 s 0.1661 s -4.2% -3.8%
ripgrep 14.1 3.3812 s 3.3373 s -1.3% -1.3%
regex 1.10 2.4611 s 2.4494 s -0.5% -0.4%

The incremental, unchanged wall time tracks the Ir saving on the foreign-impl-heavy crates (regex about -11%, ripgrep and tokio in the -4 to -6% range) and falls into the noise floor on the largest crates (serde about -1.6%, syn indistinguishable from zero). This matches the deterministic Ir: the absolute saving is a fixed few tens of millions of Ir, so it shrinks as a fraction of a larger rebuild.

The from-scratch wall time is the whole-project check, so the saving on any single crate's front end is diluted by recompiling every dependency and by cargo's own overhead. tokio, whose default-feature check is essentially just the tokio crate, keeps most of the effect (about -4%); regex and ripgrep, which pull in several dependency crates, dilute to roughly -0.5 to -1.3%.

Check vs build

The work removed is front-end coherence and trait analysis, so a full cargo build sees the same absolute saving, with codegen then diluting the percentage. The figures above are cargo check, so they are the upper bound on the proportional effect.

These are local measurements (deterministic callgrind Ir plus native wall clock); marking the PR as draft so a perf run can confirm on the full suite.

Testing

  • The specialization, coherence, traits, negative-impls, associated-types, impl-trait, auto-traits and marker_trait_attr UI suites pass locally with no failures (3264 tests).
  • The standard library, which uses min_specialization, builds cleanly with the modified compiler.
  • cargo check diagnostics are byte-identical to baseline on regex, syn, serde and ripgrep.

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 2, 2026
@rust-log-analyzer

This comment has been minimized.

The specialization graph is built to overlap-check local impls and to
walk specialization parent chains. A local non-blanket impl is only
overlap-checked against blanket impls and against non-blanket impls that
share its simplified self type, and the orphan rules forbid local
blanket impls of foreign traits. A foreign non-blanket impl is therefore
irrelevant to local coherence unless its simplified-self bucket also
contains a local impl.

Build the work list from `trait_impls_of` and skip foreign non-blanket
impls whose bucket holds no local impl, instead of enumerating every
impl of the trait. For a typical leaf crate this skips the large
majority of impls (for example the many foreign `Debug` and `Clone`
impls), avoiding the `impl_trait_ref` decode and `impl_parent` read that
recording each one would otherwise require.

A local blanket impl is the one exception: it is overlap-checked against
every child rather than just its own bucket, so the foreign non-blanket
impls are kept whenever one is present. Orphan rules forbid local
blanket impls of foreign traits, so this only retains them for crates
that are already going to be rejected, where it avoids reporting a
spurious extra overlap on top of the orphan error.

Specialization parents of foreign impls are now resolved lazily from
crate metadata in `Ancestors::next`, since a skipped foreign impl is no
longer present in the graph's `parent` map. This is sound because
specialization preserves the head constructor, so the parent of any kept
non-blanket impl is itself always kept.
@xmakro xmakro force-pushed the perf/spec-graph-skip-foreign-impls branch from 290c594 to 52e8f33 Compare June 2, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants