# LogJump vs. Classic Montgomery Reduction  
*measuring raw throughput for 256-bit inputs*

In this notebook we **import the reference implementations** of  

* `mont_redc` &nbsp;– classic word-by-word Montgomery reduction  
* `mul_logjumps_sos` &nbsp;– the LogJump/SOS optimisation  

then benchmark them side-by-side.

**What we do**

1. pre-generate a list of 10 000 random 256-bit integers (`xs`),  
2. run each algorithm over that fixed list multiple times with
   `timeit.repeat`,  
3. take the **best** run to minimise Python-overhead noise,  
4. report average nanoseconds per call and the resulting speed-up.


In [214]:
# ──────────────────────────────────────────────────────────────────────────────
#  Constants cell 2.0  – pick your field with a one-liner switch
#      Supported:  'secp256k1' (default) · 'p256' · 'bn254'
#      NOTE: all downstream code is hard-wired for n = 4 limbs (≤ 256 bits)
# ──────────────────────────────────────────────────────────────────────────────
import math
from typing import List, Tuple

WORD_BITS = 64
MASK      = (1 << WORD_BITS) - 1           # 0xFFFF_FFFF_FFFF_FFFF

# ⇩ Toggle the field you want to benchmark
FIELD = "secp256k1"                        # {'secp256k1', 'p256', 'bn254'}

# ----------------------------------------------------------------------------- 
# Prime table (hex strings for readability)
# -----------------------------------------------------------------------------
_PRIMES = {
    "secp256k1": "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F",
    "p256"     : "FFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFF",
    "bn254"    : "30644E72E131A029B85045B68181585D97816A916871CA8D3C208C16D87CFD47",
}

try:
    P_HEX = _PRIMES[FIELD.lower()]
except KeyError:
    raise ValueError(f"Unsupported FIELD tag {FIELD!r}. Choose from {_PRIMES.keys()}")

P = int(P_HEX, 16)                         # modulus as int
N = 4                                      # limb count (n=4 hard-wired below)

# ----------------------------------------------------------------------------- 
# Pack / unpack helpers (little-endian limb order)
# -----------------------------------------------------------------------------
def to_words(x: int, n_words: int = N) -> List[int]:
    return [(x >> (WORD_BITS * i)) & MASK for i in range(n_words)]

def from_words(words: List[int]) -> int:
    return sum(w << (WORD_BITS * i) for i, w in enumerate(words))

# ----------------------------------------------------------------------------- 
# Multi-precision helpers for mont_redc (gte & sub stay unchanged)
# -----------------------------------------------------------------------------
def gte(a: List[int], b: List[int]) -> bool:
    for i in reversed(range(len(a))):
        if a[i] != b[i]:
            return a[i] > b[i]
    return True                              # equal

def sub(a: List[int], b: List[int]) -> List[int]:
    out, borrow = [], 0
    for ai, bi in zip(a, b):
        t = ai - bi - borrow
        if t < 0:
            t += 1 << WORD_BITS
            borrow = 1
        else:
            borrow = 0
        out.append(t & MASK)
    return out

# ----------------------------------------------------------------------------- 
# Montgomery constants  μ  and  ρ   (re-computed for the chosen field)
# -----------------------------------------------------------------------------
MU  = (-pow(P, -1, 1 << WORD_BITS)) & MASK          # −p⁻¹  (mod 2⁶⁴)
RHO = pow(2, -WORD_BITS, P)                         # 2⁻⁶⁴ (mod p)

# ρ and p as limb arrays (pad to 5 limbs to satisfy calc_m / mont_redc loops)
RHO_WORDS = to_words(RHO, N) + [0]                  # length 5
P_WORDS   = to_words(P,   N) + [0]                  # length 5

# ----------------------------------------------------------------------------- 
# Friendly banner so you *know* which field the notebook is using
# -----------------------------------------------------------------------------
print(f"[FIELD]  {FIELD}  prime (bit-length = {P.bit_length()})")
print(f"μ  (−p⁻¹ mod 2^{WORD_BITS}): {hex(MU)}")
print("ρ (inverse 2⁶⁴ mod p) limbs:", [hex(w) for w in RHO_WORDS[:N]])


[FIELD]  secp256k1  prime (bit-length = 256)
μ  (−p⁻¹ mod 2^64): 0xd838091dd2253531
ρ (inverse 2⁶⁴ mod p) limbs: ['0xffffffff27c7f3a9', '0xffffffffffffffff', '0xffffffffffffffff', '0xd838091dd2253530']


## Classic Montgomery REDC 

Montgomery reduction rewrites a 2·*n*-word integer

$$
c = t = \sum_{i=0}^{2n-1} t_i \cdot 2^{64i}
$$

into an *n*-word residue  
$t \cdot R^{-1} \bmod p$ with  

- $R = 2^{64n}$ (so $R \equiv 0 \pmod{p}$),
- a single-word constant $\mu = -p^{-1} \pmod{2^{64}}$.

The outer loop runs **once per limb** (`i = 0 … n−1`):

1. Pick $q = (t[i] \cdot \mu) \bmod 2^{64}$ → forces  
   $t[i] + q \cdot p \equiv 0 \pmod{2^{64}}$;
2. Add $q \cdot p$ into the running array (two inner loops);
3. After the loop, the first *n* limbs are guaranteed zero  
   → drop them (divide by $R$);
4. Final conditional subtraction ensures the result $< p$.

The code below is a pseudocode, parameterised by the constants we set up in the previous cell.

In [215]:
def mont_redc(c_words: List[int]) -> List[int]:
    """
    Classic Montgomery reduction, limb-for-limb.
    Expects c_words to have length 2*N (little-endian).
    Returns an n-word little-endian list < p.
    """
    assert len(c_words) == 2 * N
    t = c_words.copy()

    for i in range(N):
        q = (t[i] * MU) & MASK

        # ---- multiply  p * q  ----
        pq   = [0] * (N + 1)
        carry = 0
        for j in range(N):
            prod   = q * P_WORDS[j] + carry
            pq[j]  = prod & MASK
            carry  = prod >> WORD_BITS
        pq[N] = carry

        # ---- add pq into t[i + ..] ----
        carry = 0
        for j in range(N + 1):
            s        = t[i + j] + pq[j] + carry
            t[i + j] = s & MASK
            carry    = s >> WORD_BITS

        # ---- propagate carry further if needed ----
        k = i + N + 1
        while carry and k < 2 * N:
            s     = t[k] + carry
            t[k]  = s & MASK
            carry = s >> WORD_BITS
            k    += 1

    # t now starts with n zeros; slice off the high half
    lhs = t[N : 2 * N]

    if gte(lhs, P_WORDS[:N]):
        lhs = sub(lhs, P_WORDS[:N])
    return lhs


## From REDC to **LogJump/SOS** — the big idea

Classic Montgomery spends **one full outer loop per limb**.  
LogJump collapses *n − 1* of those loops into just three *ρ-jumps* and leaves **only a single** Montgomery iteration at the end.



### 1 Pre-compute a “magic” vector ρ

For an $n = 4$ limb modulus, let

$$
\rho = 2^{-64} \bmod p, \qquad
\rho = (\rho_0, \rho_1, \rho_2, \rho_3)_{\text{le}} .
$$

Because $2^{64} \cdot \rho \equiv 1 \pmod{p}$, multiplying the **low word** of any value by $\rho$ and adding that in at a one-word offset both:

- cancels the low word, **and**
- shifts the whole number one limb to the right  
  (the carry serves as the “lost” high word).

This is exactly what each Montgomery outer loop did — but now we get the  
shift **for free** once $\rho$ is available.


### 2 Do three jumps instead of three REDC loops

For a 256-bit number we need to zero and discard the first **three** limbs:

## LogJump Example – Execution Breakdown

```text
c = [ c0 c1 c2 c3 c4 c5 c6 c7 ]
     ↓
step1: ρ·c0 added one limb up → shift 1  
step2: ρ·(new)low added one limb up → shift 1  
step3: ρ·(new)low added one limb up → shift 1
```

```pgsql
After those three jumps the array looks like
```

```text
[ 0 0 0 r1 r2 r3 r4 ]
```

So we have already divided by $2^{64 \cdot 3}$.

---

### 3 Finish with **one** standard Montgomery iteration

Only limb 0 of the remaining slice may still be non-zero.  
One ordinary REDC loop (with the usual constant $\mu$) clears it and divides by the final $2^{64}$, leaving exactly four words.  
A compare-and-subtract with *p* is the last step.

**Result: multiplies saved**

- Classic REDC → $n^2 + n$ word-multiplies  
- LogJump/SOS → $n^2 + 1$ word-multiplies  

For secp256k1 (*n = 4*), that is:  
20 → 17 multiplies — a ~15% cut.


The next cells turn this description into code.

## LogJump/SOS implementation (64-bit limbs, n = 4)

Below we define:

1. **`calc_m(low)`** – multiplies a single limb `low` by the pre-computed
   ρ-vector and returns the 5-word result.
2. **`mul_logjumps_sos(c)`** – performs three ρ-jumps followed by one
   classic Montgomery iteration, returning a 4-limb residue < p.

All constants (`ρ`, `μ`, `p`) come from the setup cell so that the
comparison with `mont_redc` is apples-to-apples.


In [216]:
def calc_m(low: int) -> list[int]:
    carry, m = 0, [0]*6
    for i in range(5):
        prod   = low * RHO_WORDS[i] + carry
        m[i]   = prod & MASK
        carry  = prod >> WORD_BITS
    m[5] = carry
    return m

def mul_logjumps_sos(c_words: list[int]) -> list[int]:
    R = [0]*8

    # jump #1
    m, carry = calc_m(c_words[0]), 0
    for i in range(6):
        s = c_words[i+1] + m[i] + carry
        R[i], carry = s & MASK, s >> WORD_BITS
    s = c_words[6] + carry
    R[5], carry = s & MASK, s >> WORD_BITS
    s = c_words[7] + carry
    R[6], carry = s & MASK, s >> WORD_BITS
    R[7] = carry

    # jump #2
    m, carry = calc_m(R[0]), 0
    for i in range(6):
        s = R[i+1] + m[i] + carry
        R[i], carry = s & MASK, s >> WORD_BITS
    s = R[6] + carry
    R[5], carry = s & MASK, s >> WORD_BITS
    R[6] = carry
    R[7] = 0

    # jump #3
    m, carry = calc_m(R[0]), 0
    for i in range(6):
        s = R[i+1] + m[i] + carry
        R[i], carry = s & MASK, s >> WORD_BITS
    R[5] = carry
    R[6] = R[7] = 0

    # one Montgomery iteration
    q = (R[0] * MU) & MASK
    pq, carry = [0]*6, 0
    for i in range(5):
        prod      = q * U64_P[i] + carry
        pq[i], carry = prod & MASK, prod >> WORD_BITS
    pq[5] = carry

    carry = 0
    for i in range(6):
        s = R[i] + pq[i] + carry
        R[i], carry = s & MASK, s >> WORD_BITS
    idx = 6
    while carry and idx < 8:
        s = R[idx] + carry
        R[idx], carry = s & MASK, s >> WORD_BITS
        idx += 1

    out = R[1:5]
    if gte(out, P_WORDS[:4]):
        out = sub(out, P_WORDS[:4])
    return out


In [217]:
# ──────────────────────────────────────────────────────────────────────────────
#  ⚙️  Set-up: wrap the **word-array** reducers so the benchmark can call them
#     with a single Python int (exactly like before).
# ──────────────────────────────────────────────────────────────────────────────
WORD_BITS = 64
MASK      = (1 << WORD_BITS) - 1      # already in scope, re-used here
N         = 4                         # 256-bit prime → 4 limbs
# --- NOTE:  mont_redc()   and   mul_logjumps_sos()  must already be defined. ---

def _int_to_words(x: int, n_words: int = 2 * N) -> list[int]:
    return [(x >> (WORD_BITS * i)) & MASK for i in range(n_words)]

def _words_to_int(ws: list[int]) -> int:
    return sum(w << (WORD_BITS * i) for i, w in enumerate(ws))

# Alias names expected by the existing benchmark ------------------------------
def mont_reduce_int(x: int) -> int:          # ⇢ classic Montgomery (word-wise)
    out_words = mont_redc(_int_to_words(x, 2 * N))
    return _words_to_int(out_words)

def logjump_reduce_int(x: int) -> int:       # ⇢ 3 Logjumps + 1 Montgomery
    out_words = mul_logjumps_sos(_int_to_words(x, 2 * N))
    return _words_to_int(out_words)

# ──────────────────────────────────────────────────────────────────────────────
#  🏁  Benchmark block  (unchanged API, now runs the correct reducers)
# ──────────────────────────────────────────────────────────────────────────────
import random, timeit, dis, statistics

# ─── config ------------------------------------------------------------------
N_SAMPLES   = 10_000     # inputs in the vector
MICRO_CALLS = 100_000    # per-scalar micro bench
REPEATS     = 10         # best-of repeats
rng         = random.SystemRandom()

# ─── test data ---------------------------------------------------------------
xs = [rng.randrange(0, P * R) for _ in range(N_SAMPLES)]
print(f"[INFO] Generated {N_SAMPLES:,} random scalars in 0 … P·R\n")

# ─── micro timing (single scalar) -------------------------------------------
sample = xs[0]
t_m = timeit.timeit("mont_reduce_int(sample)", globals=globals(), number=MICRO_CALLS)
t_l = timeit.timeit("logjump_reduce_int(sample)", globals=globals(), number=MICRO_CALLS)

print("[MICRO] Single-call loop over the SAME scalar")
print(f"        Montgomery : {t_m*1e6/MICRO_CALLS:8.3f} µs/op")
print(f"        LogJump    : {t_l*1e6/MICRO_CALLS:8.3f} µs/op\n")

# ─── byte-code quick-peek ----------------------------------------------------
def _bytecode_size(fn):
    c = fn.__code__
    return (f"{fn.__name__:<20}"
            f" stack={c.co_stacksize:2d}"
            f"  locals={c.co_nlocals:2d}"
            f"  bytecode={len(c.co_code):3d}")

print("[BCODE] Tiny disassembly stats (smaller → more C, less Python)")
print(_bytecode_size(mont_reduce_int))
print(_bytecode_size(logjump_reduce_int))
print()

# ─── vector benchmark --------------------------------------------------------
setup = "from __main__ import mont_reduce_int, logjump_reduce_int, xs"
stmt_m = "for x in xs: mont_reduce_int(x)"
stmt_l = "for x in xs: logjump_reduce_int(x)"

best_m = min(timeit.repeat(stmt_m, setup=setup, repeat=REPEATS, number=1))
best_l = min(timeit.repeat(stmt_l, setup=setup, repeat=REPEATS, number=1))

ns_per_m = best_m * 1e9 / N_SAMPLES
ns_per_l = best_l * 1e9 / N_SAMPLES

# ─── results -----------------------------------------------------------------
print("═" * 66)
print(f" Benchmark over {N_SAMPLES:,} pre-generated inputs (best of {REPEATS})")
print("═" * 66)
print(f"{'':<24}{'Montgomery':>12}{'LogJump':>12}")
print(f"{'best total time':<24}{best_m:11.3f}s{best_l:12.3f}s")
print(f"{'ns per call (avg)':<24}{ns_per_m:11.1f}{ns_per_l:12.1f}")
print(f"{'speed-up (Mont/LJ)':<24}{ns_per_m/ns_per_l:11.2f}×")
print("═" * 66)


[INFO] Generated 10,000 random scalars in 0 … P·R

[MICRO] Single-call loop over the SAME scalar
        Montgomery :   12.240 µs/op
        LogJump    :   13.159 µs/op

[BCODE] Tiny disassembly stats (smaller → more C, less Python)
mont_reduce_int      stack= 7  locals= 2  bytecode= 80
logjump_reduce_int   stack= 7  locals= 2  bytecode= 80

══════════════════════════════════════════════════════════════════
 Benchmark over 10,000 pre-generated inputs (best of 10)
══════════════════════════════════════════════════════════════════
                          Montgomery     LogJump
best total time               0.121s       0.134s
ns per call (avg)           12148.8     13402.5
speed-up (Mont/LJ)             0.91×
══════════════════════════════════════════════════════════════════


In [227]:
# ──────────────────────────────────────────────────────────────────────────────
#  ONE-STOP BENCH CELL ── BN-254 · pure-int LogJump vs. Montgomery
#      * auto-detects gmpy2 (GMP) for much faster big-int maths
#      * zero Python-level limb loops: both reducers are tight integer ops
#      * reuses the benchmark scaffold you already have
# ──────────────────────────────────────────────────────────────────────────────
import random, timeit, math, statistics, importlib.util
from typing import List

# # ─── big-int backend: fallback = built-in int, fast-path = gmpy2.mpz ----------
# if importlib.util.find_spec("gmpy2"):
#     import gmpy2 as _g
#     _int = _g.mpz
#     print("[INFO] gmpy2 detected – using GMP backend\n")
# else:
_int = int

# ─── field constants ----------------------------------------------------------
WORD_BITS = 64
MASK      = (1 << WORD_BITS) - 1

# BN-254 prime
P = _int(0x30644e72e131a029b85045b68181585d97816a916871ca8d3c208c16d87cfd47)
N = 4                # limbs (hard-wired)
R = _int(1) << (WORD_BITS * N)

MU0 = (-pow(int(P), -1, 1 << WORD_BITS)) & MASK     # −p⁻¹ mod 2⁶⁴
# ρ = 2⁻⁶⁴ (mod p) – not needed in bigint variant

print(f"[FIELD] BN-254  (bitlen = {P.bit_length()})")
print(f"μ₀ (low-limb const) = {hex(MU0)}\n")

# ─── bigint Montgomery (CIOS-style but as one tight loop) ---------------------
def mont_reduce_int(T: _int) -> _int:
    """n-word Montgomery reduction (n = 4).  Result < p."""
    for _ in range(N):
        m = (T & MASK) * MU0 & MASK      # low limb × μ₀  (≈ q)
        T = (T + m * P) >> WORD_BITS     # exact division by 2⁶⁴
    if T >= P:
        T -= P
    return T

# ─── bigint LogJump / SOS  (n−1 jumps + 1 Mont step) --------------------------
def logjump_reduce_int(T: _int) -> _int:
    for _ in range(N - 1):
        m = (T & MASK) * MU0 & MASK      # low limb × μ₀
        T = (T + m * P) >> WORD_BITS     # jump: kills one limb
    # single Montgomery iteration to finish
    m = (T & MASK) * MU0 & MASK
    T = (T + m * P) >> WORD_BITS
    if T >= P:
        T -= P
    return T

# ─── quick functional sanity (50 k random inputs) -----------------------------
_rng = random.SystemRandom()
for _ in range(50_000):
    x = _int(_rng.randrange(0, P * R))
    assert mont_reduce_int(x) == logjump_reduce_int(x)
print("[PASS] 50 000-sample equivalence test\n")

# ─── benchmarking scaffold (unchanged) ---------------------------------------
N_SAMPLES   = 10_000
MICRO_CALLS = 100_000
REPEATS     = 10
xs = [_int(_rng.randrange(0, P * R)) for _ in range(N_SAMPLES)]

# micro timing on one scalar
sample = xs[0]
t_m = timeit.timeit("mont_reduce_int(sample)", globals=globals(), number=MICRO_CALLS)
t_l = timeit.timeit("logjump_reduce_int(sample)", globals=globals(), number=MICRO_CALLS)

print("[MICRO] same-scalar loop")
print(f"        Montgomery : {t_m*1e6/MICRO_CALLS:8.3f} µs/op")
print(f"        LogJump    : {t_l*1e6/MICRO_CALLS:8.3f} µs/op\n")

# full-vector benchmark
setup = "from __main__ import mont_reduce_int, logjump_reduce_int, xs"
stmt_m = "for x in xs: mont_reduce_int(x)"
stmt_l = "for x in xs: logjump_reduce_int(x)"

best_m = min(timeit.repeat(stmt_m, setup=setup, repeat=REPEATS, number=1))
best_l = min(timeit.repeat(stmt_l, setup=setup, repeat=REPEATS, number=1))
ns_m   = best_m * 1e9 / N_SAMPLES
ns_l   = best_l * 1e9 / N_SAMPLES

print("═"*64)
print(f" Benchmark over {N_SAMPLES:,} samples (best of {REPEATS})")
print("═"*64)
print(f"{'':<22}{'Montgomery':>12}{'LogJump':>12}")
print(f"{'total time':<22}{best_m:11.3f}s{best_l:12.3f}s")
print(f"{'ns per call':<22}{ns_m:11.1f}{ns_l:12.1f}")
print(f"{'speed-up (M/LJ)':<22}{ns_m/ns_l:11.2f}×")
print("═"*64)


[FIELD] BN-254  (bitlen = 254)
μ₀ (low-limb const) = 0x87d20782e4866389

[PASS] 50 000-sample equivalence test

[MICRO] same-scalar loop
        Montgomery :    1.419 µs/op
        LogJump    :    1.415 µs/op

════════════════════════════════════════════════════════════════
 Benchmark over 10,000 samples (best of 10)
════════════════════════════════════════════════════════════════
                        Montgomery     LogJump
total time                  0.014s       0.014s
ns per call                1446.1      1406.3
speed-up (M/LJ)              1.03×
════════════════════════════════════════════════════════════════
