Skip to content

perf: compute LogUp fingerprints in parallel chunks#548

Merged
MauroToscano merged 6 commits into
mainfrom
logup-chunked-fingerprints
Apr 24, 2026
Merged

perf: compute LogUp fingerprints in parallel chunks#548
MauroToscano merged 6 commits into
mainfrom
logup-chunked-fingerprints

Conversation

@nicole-graus
Copy link
Copy Markdown
Collaborator

Replaces the per-column parallel dispatch in LogUp fingerprint computation with par_chunks_mut(1024), processing fingerprints, batch-inversion, and term accumulation per chunk for better L2 cache locality.

Optimization extracted from PR #518.

@github-actions
Copy link
Copy Markdown

Codex Code Review

  • Medium - Performance regression in parallel builds at crypto/stark/src/lookup.rs and crypto/stark/src/lookup.rs. The old code used into_par_iter() over committed interaction pairs, so many pairs could run concurrently even when the trace was short. The new path serializes pairs and only parallelizes inside each pair by fixed LOGUP_CHUNK_SIZE = 1024. For tables with many interaction pairs but trace_len <= 1024 (or only a few chunks), this effectively collapses the work to one pair at a time and can be a substantial throughput regression. Keep pair-level parallelism, or switch between pair-parallel and chunk-parallel based on trace_len / number of pairs.

No security or correctness issues stood out in the diff.

I couldn’t run a full cargo check here because the sandbox blocks rustup from creating temp files under /home/runner/.rustup.

Comment thread crypto/stark/src/lookup.rs
Comment thread crypto/stark/src/lookup.rs Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Apr 23, 2026

Review: perf: compute LogUp fingerprints in parallel chunks

Overview: Replaces per-column into_par_iter dispatch with par_chunks_mut(1024), so each Rayon task processes a contiguous block of rows (fingerprint computation → batch inversion → term accumulation) for better L2 cache locality. The change is architecturally sound and the algorithm is mathematically equivalent to the sequential path.


Medium — Dead code rather than removal

The original compute_logup_term_column and compute_logup_batched_term_column are now unreachable when parallel is enabled, silenced with #[cfg_attr(feature = "parallel", allow(dead_code))]. This leaves ~300 lines of production code that can silently diverge from the new chunked versions. The right fix is to drop the allow(dead_code) attributes, use those functions as correctness oracles in unit tests (assert parallel output == sequential output on a small trace), then delete them once confidence is established. See inline comment at line 1397.

Low — No test for parallel ↔ sequential equivalence

There are no unit tests verifying that compute_logup_batched_term_column_chunked / compute_logup_term_column_chunked produce field-element-identical output to their sequential counterparts. Given this is cryptographic arithmetic, even a small trace (e.g. 16 rows) test would give good confidence and is cheap to write.

Nit — _table_name computed unconditionally

let _table_name = self.name.as_deref().unwrap_or("UNKNOWN"); is always evaluated but only referenced in the #[cfg(not(feature = "parallel"))] branch. Harmless (cheap string slice), but a matching #[cfg(not(feature = "parallel"))] guard on the binding would make the intent explicit.


Overall the implementation looks correct: per-row fingerprint computation is independent so chunk-local batch inversion is equivalent to the global one, disjoint par_chunks_mut slices are race-free, and Send + Sync bounds are correctly propagated. The main ask is replacing the dead-code suppression with an actual equivalence test.

@nicole-graus
Copy link
Copy Markdown
Collaborator Author

/bench 10

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

Benchmark — fib_iterative_8M (median of 10)

Table parallelism: 32 (auto = cores / 3)

Metric main PR Δ
Peak heap 62014 MB 63562 MB +1548 MB (+2.5%) ⚪
Prove time 30.985s 30.088s -0.897s (-2.9%) ⚪

✅ No significant change.

⚠️ Heap spread: 7.8% (60766 MB / 62303 MB / 65744 MB / 65377 MB / 62343 MB / 63562 MB / 63657 MB / 61191 MB / 64536 MB / 63726 MB)
⚠️ Baseline heap spread: 7.3% (62014 MB / 60897 MB / 65412 MB) — comparison may be less reliable
Consider re-running /bench

Commit: 7b91793 · Baseline: cached · Runner: self-hosted bench

@nicole-graus nicole-graus marked this pull request as ready for review April 24, 2026 15:11
@github-actions
Copy link
Copy Markdown

Codex Code Review

  1. Medium - build_auxiliary_trace() now disables parallelism across committed pairs whenever trace_len > LOGUP_CHUNK_SIZE (crypto/stark/src/lookup.rs), and relies only on the per-pair chunked worker (crypto/stark/src/lookup.rs). That assumption is too coarse: for traces only slightly above 1024 rows, each pair produces only 2-3 Rayon tasks, so a table with many committed pairs can regress sharply versus the old into_par_iter() over pairs. This is a real throughput regression for medium-size, interaction-heavy tables. The scheduling heuristic should consider both trace_len and num_committed_pairs (or total chunk count), not just trace_len.

No security or correctness issues stood out in the diff beyond that. I couldn’t run cargo test here because cargo/rustup tries to write under a read-only home in this environment.

Comment thread crypto/stark/src/lookup.rs
Comment thread crypto/stark/src/lookup.rs Outdated
Comment thread crypto/stark/src/lookup.rs Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Review: perf: compute LogUp fingerprints in parallel chunks

Correctness: pass. The chunked functions are mathematically equivalent to the originals. Per-chunk batch inversion is independent between chunks (each chunk's fingerprints are self-contained), and the two new tests confirm output parity.

Architecture is sound. The two-regime strategy (parallelize across pairs when trace fits one chunk; parallelize within each pair via par_chunks_mut when trace is large) is a reasonable trade-off for cache locality.


Issues

Medium – duplicate Multiplicity match arms (see inline at line 1791)
compute_multiplicity_for_row is a third copy of the same exhaustive match multiplicity { … } that already lives in compute_multiplicities and inside compute_logup_batched_term_column. Adding a new Multiplicity variant requires updating all three independently; the compiler won't catch a mismatch. A single shared helper (or a method on Multiplicity) would eliminate this risk.

Low – missing debug-checks instrumentation in batched chunked path (see inline at line 1796)
compute_logup_batched_term_column_chunked has no bus-debugger logging when debug-checks is enabled, while the single-interaction chunked variant does. Batch-absorbed interactions become invisible to the bus debugger in parallel builds.

Low – allow(dead_code) on the reference implementations (see inline at line 1422)
The original compute_logup_term_column / compute_logup_batched_term_column now exist only as test oracles in parallel builds. Consider marking them explicitly as reference implementations, or ensuring they are also exercised in non-parallel CI.


Minor notes

  • LOGUP_CHUNK_SIZE = 1024 has no documented tuning rationale. A one-line comment (e.g. field element size × 2 fingerprint vectors × 1024 rows ≈ ~96 KB — fits within typical L2) would help future tuning.
  • LinearTerm::Column and LinearTerm::ColumnUnsigned arms in compute_multiplicity_for_row look identical but are correct: the From<i64>/From<u64> dispatch handles signed negation transparently. A brief comment would prevent a future reader from incorrectly collapsing them.

@MauroToscano
Copy link
Copy Markdown
Contributor

/bench 10

@MauroToscano MauroToscano added this pull request to the merge queue Apr 24, 2026
Merged via the queue into main with commit 61a1eca Apr 24, 2026
10 checks passed
@MauroToscano MauroToscano deleted the logup-chunked-fingerprints branch April 24, 2026 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants