Complex M31 field extension and NTT#19
Merged
Merged
Conversation
recmo
approved these changes
May 1, 2025
| let b5_j = b5.mul_j(); | ||
| let b7_j = b7.mul_j(); | ||
| let b6_w8 = b6 * W_8; | ||
| let b7_j_w8 = b7_j * W_8; |
Contributor
There was a problem hiding this comment.
I'm surprised we are not exploiting the special structure here. Does the compiler turn this into bitshifts?
Contributor
Author
There was a problem hiding this comment.
Resolved in commit 496dbce, though benchmarks don't show a significant improvement.
|
|
||
| /// A radix-8 NTT butterfly. | ||
| #[inline] | ||
| pub fn ntt_block_8( |
Contributor
There was a problem hiding this comment.
@xrvdg We will want to aggressively optimize this function.
Dzejkop
reviewed
May 2, 2025
Dzejkop
reviewed
May 2, 2025
Dzejkop
reviewed
May 2, 2025
Contributor
Author
|
Thanks for the feedback and suggestions! I'll incorporate them into a new commit. |
Dzejkop
reviewed
May 2, 2025
Dzejkop
reviewed
May 2, 2025
Dzejkop
reviewed
May 2, 2025
Dzejkop
reviewed
May 2, 2025
Dzejkop
reviewed
May 2, 2025
Dzejkop
reviewed
May 2, 2025
Dzejkop
reviewed
May 2, 2025
Dzejkop
reviewed
May 2, 2025
Dzejkop
reviewed
May 2, 2025
Dzejkop
reviewed
May 3, 2025
Dzejkop
reviewed
May 3, 2025
Dzejkop
reviewed
May 3, 2025
Dzejkop
reviewed
May 3, 2025
Dzejkop
reviewed
May 3, 2025
Dzejkop
reviewed
May 3, 2025
Contributor
Author
|
Thank you @Dzejkop ! Your suggestions are super helpful. Things look much cleaner and consistent now! |
Contributor
|
Thank you @Dzejkop and @weijiekoh. This looks good to merge (we can always follow up with further PRs). |
dcbuild3r
pushed a commit
that referenced
this pull request
May 16, 2026
dcbuild3r
pushed a commit
that referenced
this pull request
May 16, 2026
dcbuild3r
pushed a commit
that referenced
this pull request
May 16, 2026
Complex M31 field extension and NTT
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This draft PR adds a
cm31_nttmember to ProveKit.cm31_nttis an implementation of the complex Mersenne-31 field extension (CM31) as well as the number-theoretic transform NTT algorithm for polynomials with CM31 coefficients. It is optimised for the ARM platform, and has been benchmarked on a Raspberry Pi 5.RFandCFCM31 builds upon Mersenne-31 field elements whose arithmetic is implemented using redundant representation. We repersent these redundant M31 field elements as 32-bit unsigned integers wrapped in the
RFtype defined insrc/rm31.rs. Please refer to this note by Solbert and Domb to learn more about the algorithms used.Each CM31 field element consists of a real
RFpart, and an imaginaryRFpart. TheCFtype and complex arithmetic operations are defined insrc/cm31.rs.NTT
ntt.rscontains an optimised implementation of the NTT algorithm. Usentt()as such:Optimisations
The fist optimisation is simply to precompute twiddle factors.
The most important optimisation is to combine an approach that allocates memory with each recursive iteration with one that performs the NTT in-place.
The traditional divide-and-conquer approach allocates memory with each recursive iteration. It is very fast, but leaves some room for improvement. Take a look at
ntt_r8_vec()andntt_r8_vec_p()inntt.rs. They are straightforward implementations of the divide-and-conquer algorithm (the former does not precompute twiddle factors, and the latter does), but they allocate newVecs with each recursive implementation, resulting in some performance overhead.An in-place algorithm (
ntt_r8_ipandntt_r8_ip_p) which avoids memory allocation altogether, however, is extremely slow, especially past 32768 elements. This is likely due to the small CPU cache space leading to costly cache misses.We had a breakthrough when we found that for NTTs over sizes lower than 262144,
ntt_r8_vec_p()is slower than its in-place counterpartntt_r8_ip_p. This led us to develop our most efficient algorithm by combining the two approaches. It is implemented inntt_r8_hybrid_p()which uses the divide-and-conquer approach via recursion, but when the NTT size isNTT_BLOCK_SIZE_FOR_CACHE(hardcoded to 32768), it uses the in-place algorithm.Benchmark results can be found in
README.md.