x86_64 CPU parity: AVX2+FMA quantized matmul kernels and kernel-machine routing by gabewillen · Pull Request #87 · stateforward/emel.cpp

gabewillen · 2026-07-02T13:14:58Z

Summary

Add AVX2+FMA x86_64 kernels: Q4_K×Q8_K matmul, legacy quants Q4_0/Q4_1/Q5_0/Q8_0×Q8_0, FMA F32 blocked GEMM, and a dedicated f32 matrix×vector kernel (the GEMM degenerates to scalar for n==1)
Fix latent scalar bug: run_mul_mat validated q4_0/q4_1 but had no compute branch (reported done with untouched dst); scalar branches added
Route embeddings, whisper encoder/decoder, and sortformer matmuls through the kernel machine (emel::kernel::any) so sm guards do all arch/ISA routing; per-domain #if aarch64 / scalar dispatch duplicates deleted
Cover all whisper linear weight-variant template instantiations with machine-level tests (changed-line coverage gate: 65% → 100%)

Performance (Ryzen 9 5950X)

Sortformer AMI diarization bench: 20.07s → 2.24s (9×), output checksum unchanged
Kernel bench vs llama.cpp reference lane: q2_k 0.78x, q3_k 1.09x, f32 mul_mat 0.39x (lower = emel faster)

Gates

kernel parity ok, lint clean, changed-line coverage 100%, full test suite green
Known red (documented, pre-existing/provenance): diarization bench baseline in snapshots/bench/benchmarks.txt was recorded on the aarch64 host and is unreachable on x86 (needs per-arch baselines — separate decision); snapshots/parity/generation_lfm2_5_230m_q8_0_* baselines were never committed (parent-branch gap)

Note

Includes parent-branch commit 20d43c7 (view-sliced parallel matmul cutover) from co-sm-graph-processor-plan, on which this work is stacked.

Note

Low Risk
Changes are planning, generated architecture diagrams, and local tool paths—no runtime src/ edits in this diff. Residual risk is doc/planning drift if implementation and snapshots diverge from the archived audit claims.

Overview
Closes and archives milestone v1.27 (Ryzen AVX2/FMA kernel support) in active planning: PROJECT, STATE, ROADMAP, MILESTONES, and RETROSPECTIVE now record v1.27 as shipped (2026-06-25), no open milestone, and v1.26 as the prior I/O staged-read slice. Adds or relocates milestone artifacts under .planning/milestones/ (v1.26 audit/roadmap, v1.27 roadmap/requirements/audit).

Regenerated x86_64 kernel architecture docs (.planning/architecture/kernel_x86_64.md and mermaid) to match the kernel state machine: explicit dispatch_op_mul_mat transitions for q2_K / q3_K / q6_K × q8_K SIMD guards, and flash attention split into an AVX2/FMA one-chunk path vs shared fallback.

Portable Codex agent config: .codex/config.toml config_file paths move from a user-specific tree to /shared/stateforward/emel.cpp/....

^{Reviewed by Cursor Bugbot for commit 86c7808. Bugbot is set up for automated code reviews on this repo. Configure here.}

…n lanes Parallel matmul cutover: - Remove ggml-inherited ith/nth thread-partition fields from all kernel op events; tensor views are now the only slice descriptor, with partition policy living solely at the orchestration fork site. - Add 8-lane view-sliced fork/join matmul routes for prefill (GEMM) and decode (GEMV) behind explicit guards; lane kernels and thread pool are constructed once at backend init, dispatch stays allocation-free. - Fix thread-pool scheduler fork/join lost-wakeup (Dekker race and destroy-during-release) in the join latch. Decode wavefront: - New text/generator/decode_wavefront component (co_sm lane orchestration) with lifecycle tests, focused bench, and eval tool. Cross-engine comparison lanes: - parallel_matmul bench gains a ggml reference lane (warm 8-thread ggml threadpool, plain GGUF-native blocks both sides). Evidence at dim 2048: EMEL wins prefill GEMM 0.843x; ggml leads GEMV (q8_0 2.7x, q6_k 2.5x, q4_k 4.2x, f32 9.6x) on per-kernel arithmetic, not orchestration. - generation bench gains EMEL_BENCH_REFERENCE_THREADS for matched-core end-to-end compares; maintained publication rows stay at 1 thread. - LFM2.5-230M-Q8_0 fixture wired end to end: fixture registry, workload manifests, preserve_thinking ChatML formatter contract (both EMEL and reference resolvers), 230M strict model contract. - LFM2 attention-layer layout is now metadata-driven from per-layer lfm2.attention.head_count_kv instead of the hardcoded 1.2B block list, with lifecycle tests for the pattern layout and contradiction rejection. Matched-thread evidence (230M, 8 threads both sides): steady-state decode within ~1.15x of llama.cpp; end-to-end gap concentrated in a ~443 ms EMEL first-token path (top follow-up), full decomposition in coroutine-plan.md. Also: coverage lane fixes for gcovr 8.6 (merge-mode-functions separate, atomic profile updates, negative-hits tolerance); benchmark and lint snapshot refreshes pre-approved for this changeset. Known open items: changed-line coverage for the new parallel decode action structs is below gate (needs lifecycle tests driving those routes); strict LFM2 1.2B x86 lane still blocked on the optimized plain-Q4 kernel.

…Q8_0) and FMA F32 GEMM - Port reference AVX2 arithmetic for Q4_K x Q8_K mul_mat with guard-routed sm dispatch and optimized/shared counters - Add AVX2+FMA row-dot kernels and execute paths for legacy quants Q4_0/Q4_1/Q5_0/Q8_0 x Q8_0 - Add FMA variant of the blocked F32 GEMM, selected by explicit guard when fma_available; AVX2-only path remains as fallback - Fix latent scalar gap: run_mul_mat validated q4_0/q4_1 but had no compute branch (requests reported done with untouched dst); add scalar branches - clang-format src/emel/model/data.{cpp,hpp} (unformatted additions from parent branch commit; lint lane had been skipped on hosts without clang-format) Gates: kernel parity ok, bench snapshot ok, changed-line coverage 97.6%, lint clean. Paritychecker generation-baseline failures are pre-existing (missing snapshots/parity/generation_lfm2_5_230m_q8_0_*).

…tor AVX2+FMA kernel - Embeddings, whisper encoder/decoder, and sortformer contexts own an emel::kernel::sm (kernel::any) and dispatch op_mul_mat via process_event; per-domain #if-aarch64/scalar dispatch duplicates deleted. Kernel machine guards now do arch/ISA routing for all these domains. - Add shared emel::kernel::detect_host_kind() helper in any.hpp - Fix dense src0 views carrying quantized byte-stride (nb[0]) that kernel validation would reject - Add execute_avx2_fma_mul_mat_f32_vector_unchecked: dedicated matrix x vector (n==1) AVX2+FMA kernel with guard-routed sm row; the blocked GEMM degenerates to scalar tail for n==1 - x86 sortformer AMI bench: 20.07s (pre-refactor scalar) -> 2.24s, output checksum unchanged vs baseline Known outstanding (documented, needs user decision): - bench lane: snapshots/bench/benchmarks.txt baselines were recorded on the aarch64 host (contains kernel/aarch64 entries); x86 cannot meet ARM absolute-ns baselines for diarization_sortformer - coverage lane: whisper decoder forward-pass lines are a pre-existing coverage hole now gating because the refactor touched them

…matrix matmul path - Encoder/decoder lifecycle tests now run q4_0, q4_1, and q8_0+f32-aux weight variants end-to-end through the machines with the tiny fixture, exercising compute_decoder_cross_cache, run_decoder_layer_sequence, the logits paths, and the kernel-machine dispatch wiring per variant - Root cause: the changed-line coverage gate keys records by line number, so never-executed template instantiations shadowed covered q8_0 records - Add embeddings matmul_f32_matrix multi-column test and pointwise lane-rejection test - sortformer_bench: thread the kernel machine into the stage-profile compute_speaker_probabilities call (one-time static construction) Coverage gate: changed-line 83/83 (100%), branches 55.8%, exit 0. Speech shard 50/50 cases green.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 86c7808. Configure here.}

cursor · 2026-07-02T13:16:05Z

 [agents.gsd-verifier]
 description = "Verifies phase goal achievement through goal-backward analysis. Checks codebase delivers what phase promised, not just that tasks completed. Creates VERIFICATION.md report."
-config_file = "/Users/gabrielwillen/VSCode/stateforward/emel/emel.cpp/.codex/agents/gsd-verifier.toml"
+config_file = "/shared/stateforward/emel.cpp/.codex/agents/gsd-verifier.toml"


Committed absolute Codex agent paths

Low Severity

Every config_file entry now points at /shared/stateforward/emel.cpp/.codex/agents/..., a host-specific absolute tree. That replaces the prior /Users/... paths with another machine-bound layout, so Codex agent configs fail on typical clones unless that exact mount exists.

^{Triggered by learned rule: No absolute local filesystem paths in committed files}

^{Reviewed by Cursor Bugbot for commit 86c7808. Configure here.}

Revert PR #87 merge

gabewillen added 6 commits June 25, 2026 16:35

feat: ship v1.27 ryzen avx2 fma support

2c86ae5

chore: refresh quality gates timing data

86c7808

Copilot AI review requested due to automatic review settings July 2, 2026 13:14

Copilot AI reviewed Jul 2, 2026

gabewillen merged commit ee26ae8 into main Jul 2, 2026
1 check passed

cursor Bot reviewed Jul 2, 2026

View reviewed changes

gabewillen mentioned this pull request Jul 2, 2026

Revert PR #87 merge #88

Merged

gabewillen added a commit that referenced this pull request Jul 2, 2026

Merge pull request #88 from stateforward/revert-pr-87

8d1c232

Revert PR #87 merge

gabewillen mentioned this pull request Jul 2, 2026

x86_64 CPU parity: AVX2+FMA quantized matmul kernels and kernel-machine routing #89

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

x86_64 CPU parity: AVX2+FMA quantized matmul kernels and kernel-machine routing#87

x86_64 CPU parity: AVX2+FMA quantized matmul kernels and kernel-machine routing#87
gabewillen merged 6 commits into
mainfrom
x86-avx2-quant-kernels

gabewillen commented Jul 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gabewillen commented Jul 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance (Ryzen 9 5950X)

Gates

Note

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 2, 2026

Choose a reason for hiding this comment

Committed absolute Codex agent paths

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabewillen commented Jul 2, 2026 •

edited by cursor Bot

Loading