Skip to content

feat: Parallel ternary conversion + rayon matmul#62

Merged
shift merged 1 commit intomainfrom
feat/simd-ternary-speedup
Apr 22, 2026
Merged

feat: Parallel ternary conversion + rayon matmul#62
shift merged 1 commit intomainfrom
feat/simd-ternary-speedup

Conversation

@shift
Copy link
Copy Markdown
Owner

@shift shift commented Apr 22, 2026

Parallel Ternary Speedup

Changes

1. Parallel layer conversion (rayon)

gemma4_to_block_attnres() now processes all 35 layers in parallel using rayon::par_iter(). Each layers weight quantization is independent.

Before: 472s (7.9 min) sequential
After: ~120s (2 min) estimated on 4-core Skylake

2. Parallel ternary matmul

ternary_matmul_parallel(): processes sequence positions (or output rows for single-token) in parallel.

CpuLinear::forward_parallel(): multi-threaded forward for large matrices.

Expected speedup: ~4-8x on Skylake (4 cores/8 threads)

3. Ternary matmul optimization

Split inner loop into pos_sum - neg_sum instead of i8 as f32 * input. This is more branch-predictor friendly and enables better SIMD auto-vectorization.

Dependencies

  • Added rayon = "1.10" to Cargo.toml

Tests

  • 1596 passing (0 new failures)

[e6e5afb8]

gemma4_to_block_attnres() now parallelizes layer conversion with rayon.
Expected: 8 min → ~2 min on 4-core Skylake.
ternary_matmul_parallel(): processes seq positions (or output rows) in parallel.
CpuLinear::forward_parallel(): multi-threaded forward for large matrices.
Added rayon dependency.

[e6e5afb8]
@shift shift merged commit a9d826e into main Apr 22, 2026
4 checks passed
@shift shift deleted the feat/simd-ternary-speedup branch April 22, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant