Skip to content

x86_64 CPU parity: AVX2+FMA quantized matmul kernels and kernel-machine routing#87

Merged
gabewillen merged 6 commits into
mainfrom
x86-avx2-quant-kernels
Jul 2, 2026
Merged

x86_64 CPU parity: AVX2+FMA quantized matmul kernels and kernel-machine routing#87
gabewillen merged 6 commits into
mainfrom
x86-avx2-quant-kernels

Conversation

@gabewillen

@gabewillen gabewillen commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add AVX2+FMA x86_64 kernels: Q4_K×Q8_K matmul, legacy quants Q4_0/Q4_1/Q5_0/Q8_0×Q8_0, FMA F32 blocked GEMM, and a dedicated f32 matrix×vector kernel (the GEMM degenerates to scalar for n==1)
  • Fix latent scalar bug: run_mul_mat validated q4_0/q4_1 but had no compute branch (reported done with untouched dst); scalar branches added
  • Route embeddings, whisper encoder/decoder, and sortformer matmuls through the kernel machine (emel::kernel::any) so sm guards do all arch/ISA routing; per-domain #if aarch64 / scalar dispatch duplicates deleted
  • Cover all whisper linear weight-variant template instantiations with machine-level tests (changed-line coverage gate: 65% → 100%)

Performance (Ryzen 9 5950X)

  • Sortformer AMI diarization bench: 20.07s → 2.24s (9×), output checksum unchanged
  • Kernel bench vs llama.cpp reference lane: q2_k 0.78x, q3_k 1.09x, f32 mul_mat 0.39x (lower = emel faster)

Gates

  • kernel parity ok, lint clean, changed-line coverage 100%, full test suite green
  • Known red (documented, pre-existing/provenance): diarization bench baseline in snapshots/bench/benchmarks.txt was recorded on the aarch64 host and is unreachable on x86 (needs per-arch baselines — separate decision); snapshots/parity/generation_lfm2_5_230m_q8_0_* baselines were never committed (parent-branch gap)

Note

Includes parent-branch commit 20d43c7 (view-sliced parallel matmul cutover) from co-sm-graph-processor-plan, on which this work is stacked.


Note

Low Risk
Changes are planning, generated architecture diagrams, and local tool paths—no runtime src/ edits in this diff. Residual risk is doc/planning drift if implementation and snapshots diverge from the archived audit claims.

Overview
Closes and archives milestone v1.27 (Ryzen AVX2/FMA kernel support) in active planning: PROJECT, STATE, ROADMAP, MILESTONES, and RETROSPECTIVE now record v1.27 as shipped (2026-06-25), no open milestone, and v1.26 as the prior I/O staged-read slice. Adds or relocates milestone artifacts under .planning/milestones/ (v1.26 audit/roadmap, v1.27 roadmap/requirements/audit).

Regenerated x86_64 kernel architecture docs (.planning/architecture/kernel_x86_64.md and mermaid) to match the kernel state machine: explicit dispatch_op_mul_mat transitions for q2_K / q3_K / q6_K × q8_K SIMD guards, and flash attention split into an AVX2/FMA one-chunk path vs shared fallback.

Portable Codex agent config: .codex/config.toml config_file paths move from a user-specific tree to /shared/stateforward/emel.cpp/....

Reviewed by Cursor Bugbot for commit 86c7808. Bugbot is set up for automated code reviews on this repo. Configure here.

…n lanes

Parallel matmul cutover:
- Remove ggml-inherited ith/nth thread-partition fields from all kernel op
  events; tensor views are now the only slice descriptor, with partition
  policy living solely at the orchestration fork site.
- Add 8-lane view-sliced fork/join matmul routes for prefill (GEMM) and
  decode (GEMV) behind explicit guards; lane kernels and thread pool are
  constructed once at backend init, dispatch stays allocation-free.
- Fix thread-pool scheduler fork/join lost-wakeup (Dekker race and
  destroy-during-release) in the join latch.

Decode wavefront:
- New text/generator/decode_wavefront component (co_sm lane orchestration)
  with lifecycle tests, focused bench, and eval tool.

Cross-engine comparison lanes:
- parallel_matmul bench gains a ggml reference lane (warm 8-thread ggml
  threadpool, plain GGUF-native blocks both sides). Evidence at dim 2048:
  EMEL wins prefill GEMM 0.843x; ggml leads GEMV (q8_0 2.7x, q6_k 2.5x,
  q4_k 4.2x, f32 9.6x) on per-kernel arithmetic, not orchestration.
- generation bench gains EMEL_BENCH_REFERENCE_THREADS for matched-core
  end-to-end compares; maintained publication rows stay at 1 thread.
- LFM2.5-230M-Q8_0 fixture wired end to end: fixture registry, workload
  manifests, preserve_thinking ChatML formatter contract (both EMEL and
  reference resolvers), 230M strict model contract.
- LFM2 attention-layer layout is now metadata-driven from per-layer
  lfm2.attention.head_count_kv instead of the hardcoded 1.2B block list,
  with lifecycle tests for the pattern layout and contradiction rejection.

Matched-thread evidence (230M, 8 threads both sides): steady-state decode
within ~1.15x of llama.cpp; end-to-end gap concentrated in a ~443 ms EMEL
first-token path (top follow-up), full decomposition in coroutine-plan.md.

Also: coverage lane fixes for gcovr 8.6 (merge-mode-functions separate,
atomic profile updates, negative-hits tolerance); benchmark and lint
snapshot refreshes pre-approved for this changeset.

Known open items: changed-line coverage for the new parallel decode action
structs is below gate (needs lifecycle tests driving those routes); strict
LFM2 1.2B x86 lane still blocked on the optimized plain-Q4 kernel.
…Q8_0) and FMA F32 GEMM

- Port reference AVX2 arithmetic for Q4_K x Q8_K mul_mat with guard-routed
  sm dispatch and optimized/shared counters
- Add AVX2+FMA row-dot kernels and execute paths for legacy quants
  Q4_0/Q4_1/Q5_0/Q8_0 x Q8_0
- Add FMA variant of the blocked F32 GEMM, selected by explicit guard when
  fma_available; AVX2-only path remains as fallback
- Fix latent scalar gap: run_mul_mat validated q4_0/q4_1 but had no compute
  branch (requests reported done with untouched dst); add scalar branches
- clang-format src/emel/model/data.{cpp,hpp} (unformatted additions from
  parent branch commit; lint lane had been skipped on hosts without
  clang-format)

Gates: kernel parity ok, bench snapshot ok, changed-line coverage 97.6%,
lint clean. Paritychecker generation-baseline failures are pre-existing
(missing snapshots/parity/generation_lfm2_5_230m_q8_0_*).
…tor AVX2+FMA kernel

- Embeddings, whisper encoder/decoder, and sortformer contexts own an
  emel::kernel::sm (kernel::any) and dispatch op_mul_mat via
  process_event; per-domain #if-aarch64/scalar dispatch duplicates deleted.
  Kernel machine guards now do arch/ISA routing for all these domains.
- Add shared emel::kernel::detect_host_kind() helper in any.hpp
- Fix dense src0 views carrying quantized byte-stride (nb[0]) that kernel
  validation would reject
- Add execute_avx2_fma_mul_mat_f32_vector_unchecked: dedicated matrix x
  vector (n==1) AVX2+FMA kernel with guard-routed sm row; the blocked GEMM
  degenerates to scalar tail for n==1
- x86 sortformer AMI bench: 20.07s (pre-refactor scalar) -> 2.24s, output
  checksum unchanged vs baseline

Known outstanding (documented, needs user decision):
- bench lane: snapshots/bench/benchmarks.txt baselines were recorded on the
  aarch64 host (contains kernel/aarch64 entries); x86 cannot meet ARM
  absolute-ns baselines for diarization_sortformer
- coverage lane: whisper decoder forward-pass lines are a pre-existing
  coverage hole now gating because the refactor touched them
…matrix matmul path

- Encoder/decoder lifecycle tests now run q4_0, q4_1, and q8_0+f32-aux
  weight variants end-to-end through the machines with the tiny fixture,
  exercising compute_decoder_cross_cache, run_decoder_layer_sequence, the
  logits paths, and the kernel-machine dispatch wiring per variant
- Root cause: the changed-line coverage gate keys records by line number,
  so never-executed template instantiations shadowed covered q8_0 records
- Add embeddings matmul_f32_matrix multi-column test and pointwise
  lane-rejection test
- sortformer_bench: thread the kernel machine into the stage-profile
  compute_speaker_probabilities call (one-time static construction)

Coverage gate: changed-line 83/83 (100%), branches 55.8%, exit 0.
Speech shard 50/50 cases green.
Copilot AI review requested due to automatic review settings July 2, 2026 13:14

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@gabewillen gabewillen merged commit ee26ae8 into main Jul 2, 2026
1 check passed

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 86c7808. Configure here.

Comment thread .codex/config.toml
[agents.gsd-verifier]
description = "Verifies phase goal achievement through goal-backward analysis. Checks codebase delivers what phase promised, not just that tasks completed. Creates VERIFICATION.md report."
config_file = "/Users/gabrielwillen/VSCode/stateforward/emel/emel.cpp/.codex/agents/gsd-verifier.toml"
config_file = "/shared/stateforward/emel.cpp/.codex/agents/gsd-verifier.toml"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committed absolute Codex agent paths

Low Severity

Every config_file entry now points at /shared/stateforward/emel.cpp/.codex/agents/..., a host-specific absolute tree. That replaces the prior /Users/... paths with another machine-bound layout, so Codex agent configs fail on typical clones unless that exact mount exists.

Fix in Cursor Fix in Web

Triggered by learned rule: No absolute local filesystem paths in committed files

Reviewed by Cursor Bugbot for commit 86c7808. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants