x86_64 CPU parity: AVX2+FMA quantized matmul kernels and kernel-machine routing#87
Conversation
…n lanes Parallel matmul cutover: - Remove ggml-inherited ith/nth thread-partition fields from all kernel op events; tensor views are now the only slice descriptor, with partition policy living solely at the orchestration fork site. - Add 8-lane view-sliced fork/join matmul routes for prefill (GEMM) and decode (GEMV) behind explicit guards; lane kernels and thread pool are constructed once at backend init, dispatch stays allocation-free. - Fix thread-pool scheduler fork/join lost-wakeup (Dekker race and destroy-during-release) in the join latch. Decode wavefront: - New text/generator/decode_wavefront component (co_sm lane orchestration) with lifecycle tests, focused bench, and eval tool. Cross-engine comparison lanes: - parallel_matmul bench gains a ggml reference lane (warm 8-thread ggml threadpool, plain GGUF-native blocks both sides). Evidence at dim 2048: EMEL wins prefill GEMM 0.843x; ggml leads GEMV (q8_0 2.7x, q6_k 2.5x, q4_k 4.2x, f32 9.6x) on per-kernel arithmetic, not orchestration. - generation bench gains EMEL_BENCH_REFERENCE_THREADS for matched-core end-to-end compares; maintained publication rows stay at 1 thread. - LFM2.5-230M-Q8_0 fixture wired end to end: fixture registry, workload manifests, preserve_thinking ChatML formatter contract (both EMEL and reference resolvers), 230M strict model contract. - LFM2 attention-layer layout is now metadata-driven from per-layer lfm2.attention.head_count_kv instead of the hardcoded 1.2B block list, with lifecycle tests for the pattern layout and contradiction rejection. Matched-thread evidence (230M, 8 threads both sides): steady-state decode within ~1.15x of llama.cpp; end-to-end gap concentrated in a ~443 ms EMEL first-token path (top follow-up), full decomposition in coroutine-plan.md. Also: coverage lane fixes for gcovr 8.6 (merge-mode-functions separate, atomic profile updates, negative-hits tolerance); benchmark and lint snapshot refreshes pre-approved for this changeset. Known open items: changed-line coverage for the new parallel decode action structs is below gate (needs lifecycle tests driving those routes); strict LFM2 1.2B x86 lane still blocked on the optimized plain-Q4 kernel.
…Q8_0) and FMA F32 GEMM
- Port reference AVX2 arithmetic for Q4_K x Q8_K mul_mat with guard-routed
sm dispatch and optimized/shared counters
- Add AVX2+FMA row-dot kernels and execute paths for legacy quants
Q4_0/Q4_1/Q5_0/Q8_0 x Q8_0
- Add FMA variant of the blocked F32 GEMM, selected by explicit guard when
fma_available; AVX2-only path remains as fallback
- Fix latent scalar gap: run_mul_mat validated q4_0/q4_1 but had no compute
branch (requests reported done with untouched dst); add scalar branches
- clang-format src/emel/model/data.{cpp,hpp} (unformatted additions from
parent branch commit; lint lane had been skipped on hosts without
clang-format)
Gates: kernel parity ok, bench snapshot ok, changed-line coverage 97.6%,
lint clean. Paritychecker generation-baseline failures are pre-existing
(missing snapshots/parity/generation_lfm2_5_230m_q8_0_*).
…tor AVX2+FMA kernel - Embeddings, whisper encoder/decoder, and sortformer contexts own an emel::kernel::sm (kernel::any) and dispatch op_mul_mat via process_event; per-domain #if-aarch64/scalar dispatch duplicates deleted. Kernel machine guards now do arch/ISA routing for all these domains. - Add shared emel::kernel::detect_host_kind() helper in any.hpp - Fix dense src0 views carrying quantized byte-stride (nb[0]) that kernel validation would reject - Add execute_avx2_fma_mul_mat_f32_vector_unchecked: dedicated matrix x vector (n==1) AVX2+FMA kernel with guard-routed sm row; the blocked GEMM degenerates to scalar tail for n==1 - x86 sortformer AMI bench: 20.07s (pre-refactor scalar) -> 2.24s, output checksum unchanged vs baseline Known outstanding (documented, needs user decision): - bench lane: snapshots/bench/benchmarks.txt baselines were recorded on the aarch64 host (contains kernel/aarch64 entries); x86 cannot meet ARM absolute-ns baselines for diarization_sortformer - coverage lane: whisper decoder forward-pass lines are a pre-existing coverage hole now gating because the refactor touched them
…matrix matmul path - Encoder/decoder lifecycle tests now run q4_0, q4_1, and q8_0+f32-aux weight variants end-to-end through the machines with the tiny fixture, exercising compute_decoder_cross_cache, run_decoder_layer_sequence, the logits paths, and the kernel-machine dispatch wiring per variant - Root cause: the changed-line coverage gate keys records by line number, so never-executed template instantiations shadowed covered q8_0 records - Add embeddings matmul_f32_matrix multi-column test and pointwise lane-rejection test - sortformer_bench: thread the kernel machine into the stage-profile compute_speaker_probabilities call (one-time static construction) Coverage gate: changed-line 83/83 (100%), branches 55.8%, exit 0. Speech shard 50/50 cases green.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 86c7808. Configure here.
| [agents.gsd-verifier] | ||
| description = "Verifies phase goal achievement through goal-backward analysis. Checks codebase delivers what phase promised, not just that tasks completed. Creates VERIFICATION.md report." | ||
| config_file = "/Users/gabrielwillen/VSCode/stateforward/emel/emel.cpp/.codex/agents/gsd-verifier.toml" | ||
| config_file = "/shared/stateforward/emel.cpp/.codex/agents/gsd-verifier.toml" |
There was a problem hiding this comment.
Committed absolute Codex agent paths
Low Severity
Every config_file entry now points at /shared/stateforward/emel.cpp/.codex/agents/..., a host-specific absolute tree. That replaces the prior /Users/... paths with another machine-bound layout, so Codex agent configs fail on typical clones unless that exact mount exists.
Triggered by learned rule: No absolute local filesystem paths in committed files
Reviewed by Cursor Bugbot for commit 86c7808. Configure here.


Summary
emel::kernel::any) so sm guards do all arch/ISA routing; per-domain#if aarch64 / scalardispatch duplicates deletedPerformance (Ryzen 9 5950X)
Gates
snapshots/bench/benchmarks.txtwas recorded on the aarch64 host and is unreachable on x86 (needs per-arch baselines — separate decision);snapshots/parity/generation_lfm2_5_230m_q8_0_*baselines were never committed (parent-branch gap)Note
Includes parent-branch commit 20d43c7 (view-sliced parallel matmul cutover) from
co-sm-graph-processor-plan, on which this work is stacked.Note
Low Risk
Changes are planning, generated architecture diagrams, and local tool paths—no runtime
src/edits in this diff. Residual risk is doc/planning drift if implementation and snapshots diverge from the archived audit claims.Overview
Closes and archives milestone v1.27 (Ryzen AVX2/FMA kernel support) in active planning:
PROJECT,STATE,ROADMAP,MILESTONES, andRETROSPECTIVEnow record v1.27 as shipped (2026-06-25), no open milestone, and v1.26 as the prior I/O staged-read slice. Adds or relocates milestone artifacts under.planning/milestones/(v1.26 audit/roadmap, v1.27 roadmap/requirements/audit).Regenerated x86_64 kernel architecture docs (
.planning/architecture/kernel_x86_64.mdand mermaid) to match the kernel state machine: explicitdispatch_op_mul_mattransitions for q2_K / q3_K / q6_K × q8_K SIMD guards, and flash attention split into an AVX2/FMA one-chunk path vs shared fallback.Portable Codex agent config:
.codex/config.tomlconfig_filepaths move from a user-specific tree to/shared/stateforward/emel.cpp/....Reviewed by Cursor Bugbot for commit 86c7808. Bugbot is set up for automated code reviews on this repo. Configure here.