ter

Balanced-ternary substrate simulator (F0–F11) that runs Llama 3.2 1B faster than llama.cpp Q8_0 and reproduces BitNet b1.58 2B-4T bit-exact at 1.58 bits/weight on consumer GPUs. The substrate simulator is the research vehicle; the CUDA port is an iteration aid for the matmul fabric.

TL;DR results (two engineering claims, defendable end-to-end)

Llama 3.2 1B end-to-end: 425 t/s on RTX 3090, 1.08× faster than llama.cpp Q8_0 (395 t/s reference). 28.9× over the v1 baseline.
BitNet b1.58 2B-4T end-to-end: 214 t/s, 8/8 BOS-greedy tokens bit-exact against the microsoft/BitNet llama-cli reference.
Kernel correctness validated three-way for Llama 1B: scalar NumPy reference vs CUDA forward 16/16 bit-exact; fp32 weights (no ternarization) vs Q8_0 golden 7/8; ternarized vs Q8_0 golden 0/16 (confirms H1: post-training ternarization breaks non-QAT models).
Op-count parity holds: 1.002× lane-MACs vs analytical fp16 baseline (measured by the CPU simulator on Llama 1B).
Memory: 1.58 bits/weight (8× compression vs fp16, 5× vs Q8_0).

Underlying CUDA matmul-fabric measurement (with honest attribution)

Packed-trit kernel beats cuBLAS INT8 TC by 1.90× on Llama 1B at single-token gen (M=1), per-shape wins up to 4.18× on Wdown. Attribution: this advantage is bandwidth-bound (one byte yields 16 MACs; ratio 0.94 vs 0.50 MACs/byte for INT8 ≈ 1.88×), not ternary-arithmetic-bound. A generic INT2 packed kernel feeding __dp4a would in principle obtain similar throughput. See paper Sec. 7.3 "What is and isn't ternary about this work" and Cuevas (2026) §5.2.
Our own ADD-only kernel (mm_bitnet_addonly) confirms: eliminating multiplications matches but does not exceed __dp4a; the TVMAC=0 architectural advantage of ternary needs custom silicon to be observable, not commodity GPU.

Full results: docs/baselines/2026-05-10-cuda-synthesis.md; paper paper/ter_paper.pdf.

Full results: docs/baselines/2026-05-10-cuda-synthesis.md.

Build

cmake -S . -B build
cmake --build build
ctest --test-dir build --output-on-failure

The matmul phase-gate test requires numpy. A local venv is provided at .venv/ (use uv venv .venv && uv pip install --python .venv/bin/python numpy to recreate it).

The CUDA bench targets need a separate build with nvcc; see cuda/.

Status (F0–F12)

Foundation (F0–F4)

F0 Trit, Tryte, Word27, Word54 primitives, packing, Memory.
F1 Scalar ISA, assembler, simulator, sum(1..5) smoke test.
F2 SIMD extension (tvadd, tvmac, tvsum, ...), 64×64 matmul gate.
F3 Format B quantizer (bf16/float ↔ fixed-point trit + per-tensor scale).
F4 All kernels (matmul, rmsnorm, softmax, silu, rope) + attention via host-orchestrated composition; complete K3 transformer building blocks.

Inference bring-up (F5)

F5.1–F5.3 Vendor ntransformer infra; single-layer forward; multi-token attention with KV cache; all 4 transcendental kernels plumbed.
F5.4a–g Brandon-tiny f16 GGUF loader, SPM tokenizer, value_residual, DWA, register prefill, multi-token generation.
F5.4h Calibration A/B (9-trit vs 12-trit greedy decoding).
F5.4i TinyStories Q4_K_M (Q4_K + Q6_K dequantizers).

Llama 3.2 1B (F6)

F6.1 Q8_0 dequantizer.
F6.2 TritTensor payload Word27→int32 (27× memory reduction).
F6.3 End-to-end Llama 1B forward, 508s/forward (pre-AVX2, kernel-routed mm_row), finite logits in [-10.29, 8.83]. Brought down to 25.6s/forward by F11.
F6.4 Op-count instrumentation; F8 honest TVMAC counter.
F6.5 Multi-token generation (greedy from BOS).

Hypothesis closure (F7–F9)

F7 Format A (tfloat: 1t sign + 5t exp + N t mantissa); 11-trit (1+5+5) sweet spot preserves argmax on Llama 1B with RMSE 0.05.
F8 Pure-scalar honest TVMAC counter (analytical bump for host-side ops).
F9 BitNet b1.58 substrate-data alignment:
- Analytical: TinyStories layer matmuls 121856 → 0 TVMAC (100% elimination)
- Real GGUF: dequant_i2_s + 100% ternary purity verified on attn_q
- Forward end-to-end: BitNet 2B-4T 18 min/forward, finite logits, sub_norm wired
- Post-training quality: ternarizing Llama 1B Q8_0 (no QAT) breaks quality (argmax flips, 0/5 top-5 overlap, RMSE 2.39)

Engineering (F10–F11)

F10 K4 ring-0 backend (libter_k4.a):
- C ABI freestanding-clean (-fno-exceptions -fno-rtti -DTER_FREESTANDING)
- OpCounters → flat std::array<uint64_t, 256>
- KernelTable → flat array (32 slots), no std::unordered_map
F11 AVX2 GEMV in mm_row + -O3 on ter library: 21× speedup on Llama 1B (508 s/forward kernel-routed → 25.6 s/forward AVX2 GEMV).

CUDA acceleration (F12)

F12.1 mm_row CUDA naive: 21× speedup over Mac AVX2.
F12.2 Microbench INT8 TC, sgemm fp32, naive packed (memory baseline).
F12.3 End-to-end Llama 1B forward in CUDA (INT8 TC matmuls): 14.7 t/s.
F12.4 Packed-trit end-to-end forward: 4.2 t/s, 33% memory savings.
F12.5 RTX 3090 production baseline (llama.cpp Q8_0 Llama 3.2 1B, clean GPU): 395 t/s reference (recalibrated 2026-05-15).
F12.6 Real-weight quality validation on Llama 1B (BitNet post-quant breaks).
F12.7 Packed-trit kernels on consumer GPU:
- v4 wide → v6 dp4a → v7 colmaj → v8 warp → v10 unroll → v11 warp4 → v13
- Best-per-shape dispatch: 6.67 ms / 186 GMAC/s, 1.90× over cuBLAS INT8 TC at M=1
- Honest attribution (per Cuevas 2026 §5.2 + our own ADD-only experiment): this advantage is bandwidth-bound (sub-byte packing density), not ternary-arithmetic-bound. An INT2 packed kernel feeding __dp4a would in principle obtain the same. Ternary-arithmetic energy advantage requires custom silicon (FPGA/ASIC) to be observable.
F12.8 End-to-end forward optimization (round 2): 14.7 → 425 t/s (28.9×)
- Fix 1: device-pointer scale eliminates 65 D2H syncs/token → 28.9 t/s
- Fix 2: forward kernel upgrade v4 → v11 warp-coop → 200.6 t/s
- Fix 3: cudaGraph capture (all state on-device, entire forward as one graph) → 412 t/s
- Fix 4: fused rmsnorm+quant + silu+quant (49 launches/token saved) → 425 t/s
- Result: 1.08× faster than llama.cpp Q8_0 on Llama 1B, RTX 3090, 1.58 bits/weight.
- Kernel correctness validated: GGUF→packed converter + NumPy reference forward + CUDA load_model_from_bin path. CUDA matches NumPy reference 16/16 BOS-greedy tokens bit-exact. Diverges from Llama Q8_0 golden (0/16) per H1 (ternarization breaks non-QAT models). Architecture sanity: with fp32 weights (no ternarization) reference matches Q8_0 golden 7/8.
F12.9 BitNet 2B-4T real-weight end-to-end forward: 214 t/s, 8/8 BOS-greedy tokens match reference
- GGUF i2_s converter: channel-interleaved 128-block, code mapping, per-tensor scales
- Architectural adapters: sub_norms, ReLU², NEOX RoPE, fp16 weight-tied lm_head
- fp16 overflow bug fixed in FFN (gate²·up exceeds 65504 → inf → zero contribution)
- Stage 2 (fp32 residual stream): 6/8 → 8/8 token match (bit-exact vs microsoft llama-cli)
- Stage 3 (fused rmsnorm+quant): +7% throughput (193 → 207 t/s)
- Stage 1 (ADD-only kernel) brings TVMAC=0 from bench to production
- Substrate runs real ternary-trained transformer with bit-exact reference reproduction.

Headline numbers (Llama 1B Q8_0, RTX 3090)

Backend	ms / forward (matmul)	GMAC/s
Mac AVX2 sim (host scalar)	25,600	38
ter sim CUDA naive port	26.06	47.5
ter sim CUDA cuBLAS sgemm	6.87	180
ter sim CUDA packed-trit (best dispatch)	6.67	186
cuBLAS INT8 TC (production reference)	12.67	98

Total session-cumulative speedup vs Mac AVX2 baseline: 3,840×. vs cuBLAS INT8 TC at M=1 gen on Llama 1B shapes: 1.90× faster — attributable to packed sub-byte storage density rather than ternary arithmetic semantics (see paper Sec. 7.3 and Cuevas 2026 §5.2).

What's proven vs what's engineering

Engineering results defended end-to-end

Llama 3.2 1B: 425 t/s, 1.08× over llama.cpp Q8_0 on RTX 3090. Kernel correctness validated bit-exact vs NumPy reference (16/16).
BitNet b1.58 2B-4T: 214 t/s, 8/8 BOS-greedy tokens bit-exact vs microsoft/BitNet llama-cli — this is where substrate-data alignment is semantically meaningful (QAT-trained ternary model).

Architecturally proven (substrate-level)

Op count parity 1.002× lane-MACs vs fp16 (CPU simulator, analytical).
Memory compression: 1.58 bits/weight; 8× vs fp16, 5× vs Q8_0.
Format A trit-budget tradeoff: 1+5+5 preserves Llama 1B argmax with logit RMSE 0.05.
BitNet substrate-data alignment: TVMAC=0 in the matmul fabric (analytical + real GGUF forward verified bit-exact).

What this paper does NOT claim (honest acknowledgments)

Throughput superiority of ternary arithmetic on binary silicon. The 1.90× matmul win is a sub-byte packing density result; a generic INT2 kernel would in principle match it. Our own ADD-only kernel empirically matches __dp4a but does not exceed it.
Production wins at prefill regimes (M ≥ 16) — tensor cores saturate; hybrid dispatch margin shrinks to ~1.04× at M=64.
Quality under post-training ternarization — 0/16 vs Llama Q8_0 golden confirms H1. Substrate-data alignment is genuine only for QAT-trained models (BitNet).
Per-op energy advantage in deployed silicon — literature (Horowitz 2014, CUTIE 2020, Cuevas 2026) supports the bound, but we did not measure it. FPGA/ASIC prototype is the next experiment.
Novelty of the LUT/packed-low-bit principle — T-MAC (Wei et al. EuroSys 2025), LUTMUL (Xie et al. ASPDAC 2025), Platinum (Shan et al. 2025), TeLLMe (Qiao et al. 2025) all precede this work. Our contribution is the specific GPU __dp4a packed-trit kernel design
- end-to-end engine integration + bit-exact validation.

Acknowledgment

Felipe Cuevas Araneda's Análisis de Eficiencia Energética en Multiplicación de Matrices con Pesos Discretizados (mayo 2026) provides the formal energy-cost framework (Prop. 2: ρ₃ ≥ 3.8× bound in 45nm; Prop. 5: k=3 efficiency optimum; Prop. 6: MAC-to-LUT reduction) and a critical review that motivated the honest framing adopted in this revision.

Building blocks

Component	Purpose	File
`tk_matmul_b_9t`	27-lane integer dot product (one tile of GEMM)	`src/kernels/tk_matmul_b_9t.tasm`
`tk_rmsnorm`	RMSNorm via rsqrt LUT	`src/kernels/tk_rmsnorm.tasm`
`tk_softmax`	Softmax via per-lane exp LUT + recip	`src/kernels/tk_softmax.tasm`
`tk_silu`	SiLU(x) = x · sigmoid(x); SwiGLU via host composition	`src/kernels/tk_silu.tasm`
`tk_rope`	Pure-SIMD paired rotation (host-prepared inputs)	`src/kernels/tk_rope.tasm`
`mm_row` (AVX2 GEMV)	Host-side matmul fabric, F8 honest TVMAC accounting	`src/tx/forward.cpp`
`mm_packed_v11/v13` (CUDA)	Ternary-optimized packed kernel, 4 cols per warp + dp4a	`cuda/ter_cuda_packed_v6.cu`
`forward_token`	End-to-end transformer forward (Llama, brandon, BitNet)	`src/tx/transformer.cpp`
`libter_k4.a`	C ABI for ring-0 osito-k integration	`src/k4/ter_k4.cpp`, `include/ter_k4/ter_k4.h`

Reference docs

Design: docs/superpowers/specs/2026-05-08-ter-design.md
Plans: docs/superpowers/plans/
ISA: docs/isa.md
Number formats (B, A): docs/number-formats.md
Kernel patterns: docs/kernel-patterns.md
RTX 3090 + CUDA full results: docs/baselines/2026-05-10-cuda-synthesis.md
RTX 3090 production baseline (llama.cpp Q8_0): docs/baselines/2026-05-10-rtx3090.md

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
.claude		.claude
Testing/Temporary		Testing/Temporary
cuda		cuda
docs		docs
examples		examples
include		include
paper		paper
src		src
tests		tests
tools		tools
v2		v2
vendor/ntransformer		vendor/ntransformer
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
STATE-2026-05-15.md		STATE-2026-05-15.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ter

TL;DR results (two engineering claims, defendable end-to-end)

Underlying CUDA matmul-fabric measurement (with honest attribution)

Build

Status (F0–F12)

Foundation (F0–F4)

Inference bring-up (F5)

Llama 3.2 1B (F6)

Hypothesis closure (F7–F9)

Engineering (F10–F11)

CUDA acceleration (F12)

Headline numbers (Llama 1B Q8_0, RTX 3090)

What's proven vs what's engineering

Engineering results defended end-to-end

Architecturally proven (substrate-level)

What this paper does NOT claim (honest acknowledgments)

Acknowledgment

Building blocks

Reference docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ter

TL;DR results (two engineering claims, defendable end-to-end)

Underlying CUDA matmul-fabric measurement (with honest attribution)

Build

Status (F0–F12)

Foundation (F0–F4)

Inference bring-up (F5)

Llama 3.2 1B (F6)

Hypothesis closure (F7–F9)

Engineering (F10–F11)

CUDA acceleration (F12)

Headline numbers (Llama 1B Q8_0, RTX 3090)

What's proven vs what's engineering

Engineering results defended end-to-end

Architecturally proven (substrate-level)

What this paper does NOT claim (honest acknowledgments)

Acknowledgment

Building blocks

Reference docs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages