Skip to content

SHA-256: Initial constraint optimizations#247

Merged
Bisht13 merged 20 commits into
mainfrom
rs/sha_optimization
Jan 19, 2026
Merged

SHA-256: Initial constraint optimizations#247
Bisht13 merged 20 commits into
mainfrom
rs/sha_optimization

Conversation

@rose2221
Copy link
Copy Markdown
Collaborator

@rose2221 rose2221 commented Dec 25, 2025

Summary

Optimizes SHA256 compression constraint generation through byte-level operations and fused constraints.

Results :-

Metric Before After Improvement
Constraints 432,976 152699 -64.73%
Constraints (per SHA256 compression call) 39,293 20,907 -46.79%

SHA256 Compression – R1CS Cost Breakdown

Component Constraints Witnesses
Direct 3,600 13,364
AND + XOR 144,389 212,998
RANGE 4,651 5,145
Total 152,640 231,507

Marginal SHA256 Call

Component Constraints / call Witnesses / call
Direct 3,600 13,364
AND + XOR 13,312 16,384
RANGE 3,995 4,169
Total per SHA256 20,907 33,917

Key Optimizations:-

1. New Type System for Byte-Level Operations

  • Introduced U8 and U32 wrapper types to replace raw usize witness indices
  • U32 represents a 32-bit word as 4 little-endian U8 bytes
  • Each U8 tracks whether it has been range-checked via a range_checked flag
  • Enables byte-level operations that avoid expensive 32-bit digital decomposition

2. ROTR with Fused Constraints

  • Implemented ROTR using free byte reordering, byte partitioning into lo and hi via a new BytePartition witness builder, and a single fused constraint per output byte: result * 2^k + lo = byte + lo_next * 256
    This achieves 4 constraints per ROTR (one per byte) instead of multiple decomposition constraints.

3. SHR Implementation with Fused Constraints

  • Similar optimization to ROTR but handles zero-fill for shifted-out bits.
  • Byte-level shift is free (reordering + zero padding)
  • Uses same fused constraint approach for intra-byte bit shifting
  • MSB byte has special handling: result * 2^k + lo = byte (no next byte term)

4. Range Check Optimization

  • Only range-checks lo during byte partitioning. Soundness is preserved because the partitioned high bits (hi) contribute to a derived output byte via fused constraints, and that output byte is subsequently constrained to [0,255] by downstream byte-level lookup tables (e.g. AND/XOR). This reduces the number of explicit range checks in this context without sacrificing soundness.

5.. New BytePartition WitnessBuilder

  • Added dedicated witness builder variant for splitting an 8-bit value at a bit boundary.
  • Computes lo (lower k bits) and hi (upper 8-k bits) such that: x = lo + hi * 2^k
  • Produces 2 witness values per invocation and is reused across ROTR and SHR operations.

6.. Chained U32 Addition with Fused Constraints

  • Replaced chained pairwise additions in T1 computation (h + Σ₁(e) + Ch(e,f,g) + K[i] + W[i]) with a single variadic addition using the fused constraint: a + b + c + d + e = result + carry * 2^32
  • Reduces to 1 constraint + 2 witnesses, with carry range-checks sized precisely based on input count.

7. Byte-Level Binary Operations (AND/XOR)

  • Kept SHA256 AND/XOR operations at the 8-bit level by introducing and_ops_byte and xor_ops_byte tracking and an is_byte_level flag inadd_binop_constraints, avoiding unnecessary 32-bit decomposition while preserving the full decomposition path for general Noir blackbox ops.

8. Range Check & LogUp Optimizations

  • Introduced fused constraints for LogUp inverse computation, replacing the two-step denominator + inverse verification.
    • Before: Compute (X - c·v) and verify denominator * inverse = 1, 2 constraints + 2 witnesses.
    • After: Single fused constraint: (X - c·v) * inverse = 1, 1 constraint + 1 witness.
  • Added a new WitnessBuilder::LogUpInverse variant that computes the inverse directly during batch inversion, eliminating the intermediate denominator witness.
  • Fused the log-derivative multiset equality check into a single constraint.
    • Before: Separate constraints for table sum, witness sum, and equality check (3 constraints).
    • After: Single constraint: (Σ table_terms - Σ witness_terms) = 0. Removes 2 constraints and 2 intermediate sum witnesses per range-check lookup.
  • Extended batch inversion support to include LogUpInverse.
    • LogUpInverse is deferred and batched alongside existing inverse operations.
    • Denominators (X - c·v) are computed inline during batch inversion.

8. BinOp LogUp Constraint Fusion (binops.rs)

  • Replaced separate binop-side and table-side summations with a single fused sum using negated coefficients.
  • Enforces equality via one constraint: (Σ binop_terms − Σ table_terms) = 0.
  • Removes intermediate sum witnesses and reduces constraints per AND/XOR lookup.

9. Combined AND/XOR Lookup Table

  • Replaced separate lookup tables for AND and XOR operations with a single combined table.
  • Before: Two independent lookup tables (~196K constraints each).
  • After: One combined table using a 4-term encoding: sz − (lhs + rs·rhs + rs²·and + rs³·xor). Eliminates one full lookup table.
  • Savings: ~147K constraints.

10. Byte-Level Binary Operations (SHA256)

  • Avoided 32-bit decomposition for SHA256 AND/XOR by operating directly on U8 byte types.
  • Before: 32-bit AND/XOR required decomposition into 4 bytes followed by 4 separate 8-bit lookups.
  • After: SHA256 binops are executed directly at the byte level using U8, skipping decomposition entirely.
  • Eliminates decomposition constraints and intermediate witnesses per binary operation.

11. Inlined T1 / T2 Computation in SHA256 Rounds

  • Removed intermediate T1 and T2 values in SHA256 round computation.
  • Before:
T1 = h + Σ₁(e) + Ch(e,f,g) + K[i] + W[i]
T2 = Σ₀(a) + Maj(a,b,c)
new_e = d + T1
new_a = T1 + T2
  • After: new_e and new_a are computed directly with all terms inlined.
  • Savings: 12 witnesses per round (result, carry, and unpacked bytes for T1/T2) × 64 rounds = 768 witnesses per SHA256 compression.

12. Inline Table Entry Inverse (CombinedTableEntryInverse)

  • Optimized lookup inverse handling by inlining the denominator directly into the inverse constraint.
  • Before: Create denominator witness, then verify denominator × inverse = 1 (2 witnesses per entry).
  • After: Single fused constraint: (sz − lhs − rs·rhs − rs²·and − rs³·xor) × inv = 1. Eliminates the intermediate denominator witness.
  • Savings: 65,536 witnesses (one per table entry).

13. Pack Cache for U32 Operations

  • Introduced memoization for U32 packing operations.
  • Before: Identical U32 byte sequences could be repacked multiple times within a compression.
  • After: pack_cached() caches pack results keyed by byte indices and reuses them.
  • Avoids duplicate pack witnesses and constraints across SHA256 rounds.

@rose2221 rose2221 marked this pull request as draft December 25, 2025 01:27
@rose2221 rose2221 force-pushed the rs/sha_optimization branch 2 times, most recently from 6fd8fd0 to 5b78f63 Compare January 7, 2026 19:38
@rose2221 rose2221 marked this pull request as ready for review January 7, 2026 19:41
Comment thread provekit/r1cs-compiler/src/binops.rs Outdated
Comment thread provekit/r1cs-compiler/src/binops.rs Outdated
Comment thread provekit/r1cs-compiler/src/sha256_compression.rs
Comment thread provekit/common/src/witness/scheduling/dependency.rs Outdated
Comment thread provekit/r1cs-compiler/src/sha256_compression.rs
@Bisht13
Copy link
Copy Markdown
Collaborator

Bisht13 commented Jan 16, 2026

Nit: Remove unused import.

Rose Jethani and others added 16 commits January 17, 2026 14:49
…ion, skip zero-carry range checks, remove unused variants
…e-level ops and optimizing witness decomposition
@rose2221 rose2221 force-pushed the rs/sha_optimization branch from a834994 to e331ec2 Compare January 17, 2026 09:58
Bisht13 and others added 4 commits January 18, 2026 19:18
Use range_ops_total from R1CS breakdown instead of ACIR RANGE opcode
count when calculating non-SHA256 range checks. The previous code
incorrectly used ACIR-level counts which don't match R1CS-level
range operations.
The combined AND/XOR lookup table operates on 8-bit operands. When handling
BlackBoxFuncCall::AND/XOR with constant operands, the code was pushing
full 32-bit constants directly to the ops vectors instead of decomposing
them into 4 bytes first.

This caused index out of bounds panics in the witness builder when
computing multiplicities: the index was computed as (lhs << 8) + rhs,
expecting 8-bit values (max index 65535), but receiving 32-bit values
(producing indices like 3389742323).

Fix: Decompose constant operands into [u8; 4] bytes and push byte-level
constant operations to the lookup table, matching the expected byte-level
semantics.
When handling AND/XOR operations where lhs is a constant and rhs is a
witness, the code was using the raw 32-bit rhs witness directly instead
of decomposing it into bytes. This caused index out of bounds panics in
the witness builder when computing multiplicities for the 8-bit lookup
table (max index 65535, but receiving indices like 2282366754).

Fix: Add explicit handling for the (lhs=constant, rhs=witness) case that
decomposes the rhs witness into bytes via add_digital_decomposition
before pushing to the binop ops vectors.
@Bisht13 Bisht13 merged commit f8728db into main Jan 19, 2026
4 of 5 checks passed
@Bisht13 Bisht13 deleted the rs/sha_optimization branch January 19, 2026 06:51
dcbuild3r pushed a commit that referenced this pull request May 16, 2026
SHA-256: Initial constraint optimizations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants