ggml-cpu: add repack GEMM and GEMV for floating-point by taimur-10x · Pull Request #4 · riseproject-dev/llama.cpp

taimur-10x · 2025-12-05T11:49:28Z

Summary

This PR adds repacking and GEMM/GEMV kernels for floating-point (FP16 and FP32) for RVV (with the zvfh extension).

Key Changes

Added RVV kernels for GEMM with tiling:
- 7 x {16, 32, 64, 128} (selected based on VLEN)
Added RVV kernels for GEMV with tiling:
- 1 x {16, 32, 64, 128} (selected based on VLEN)
Added scalar functions for repacking. They support arbitrary tile sizes.
Generic scalar fallbacks for GEMM/GEMV operations.
ggml_quantize_mat_t is refactored to ggml_repack_mat_t to allow for a common interface for both quantization and floating-point repacking.
Additional template parameter NB_ROWS added to select the number of rows to interleave for repacking. Previously, this was fixed at 4.

Tile Sizes

The repack operation interleaves N rows of activations with an interleave size of K, and M columns of weights with an interleave size of K.

NxK is fixed at 7x1. This introduces 7 accumulators with LMUL=4 (7 x 4 = 28 registers), each accumulating M results.

M is varied based on the available VLEN:

VLEN	Tile Size (N x M x K)
128	7 x 16 x 1
256	7 x 32 x 1
512	7 x 64 x 1
1024	7 x 128 x 1

M is the maximum number of values that can be loaded in (LMUL=2 for F16, LMUL=4 for F32).

Testing

Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit, 512-bit and 1024-bit) for a range of input sizes.

Benchmarking Results

End-to-end benchmarking on BananaPI-BPI F3 (VLEN=256)

Prefill / Prompt Processing (GEMM)

Tokens / Second

Model	Prompt Size	Repack GEMM (7x32)	Vec Dot
Tinyllama F16 1.1B	28	24.72	8.31
Tinyllama F16 1.1B	32	16.72	8.42
Tinyllama F16 1.1B	64	22.55	8.57
Tinyllama F16 1.1B	128	22.78	8.78
Tinyllama F16 1.1B	256	21.82	8.57
Tinyllama F16 1.1B	512	21.81	8.68

Model	Prompt Size	Repack GEMM (7x32)	Vec Dot
Tinyllama F32 1.1B	28	11.45	3.72
Tinyllama F32 1.1B	32	7.13	3.75
Tinyllama F32 1.1B	64	10.76	3.74
Tinyllama F32 1.1B	128	10.86	3.73
Tinyllama F32 1.1B	256	10.94	3.68
Tinyllama F32 1.1B	512	11.12	3.79

Model	Prompt Size	Repack GEMM (7x32)	Vec Dot
BERT Large Uncased F16	28	82.25	32.13
BERT Large Uncased F16	32	63.16	27.33
BERT Large Uncased F16	64	79.16	30.01
BERT Large Uncased F16	128	76.20	30.80
BERT Large Uncased F16	256	66.09	28.01
BERT Large Uncased F16	512	50.57	24.26

Model	Prompt Size	Repack GEMM (7x32)	Vec Dot
BERT Large Uncased F32	28	43.29	11.16
BERT Large Uncased F32	32	29.17	11.08
BERT Large Uncased F32	64	37.31	11.35
BERT Large Uncased F32	128	39.70	11.16
BERT Large Uncased F32	256	35.62	11.95
BERT Large Uncased F32	512	31.60	11.18

Result: ~2x-3x speedup over vec_dot

Decode (GEMV)

Tokens / Second

Model	Decode Size (Prompt=32)	Repack GEMV (1x32)	Vec Dot
Tinyllama F16 1.1B	10	3.37	3.11
Tinyllama F16 1.1B	16	3.29	3.45
Tinyllama F16 1.1B	32	3.12	3.25
Tinyllama F16 1.1B	64	3.23	3.27
Tinyllama F16 1.1B	100	3.04	3.15
Tinyllama F16 1.1B	128	3.09	3.2
Tinyllama F16 1.1B	256	3.15	3.19

Model	Decode Size (Prompt=32)	Repack GEMV (1x32)	Vec Dot
Tinyllama F32 1.1B	10	1.66	1.74
Tinyllama F32 1.1B	16	1.73	1.63
Tinyllama F32 1.1B	32	1.81	1.68
Tinyllama F32 1.1B	64	1.61	1.69
Tinyllama F32 1.1B	100	1.72	1.75
Tinyllama F32 1.1B	128	1.76	1.72
Tinyllama F32 1.1B	256	1.75	1.69

Result: No noticeable improvement, as decode remains memory-bound.

Additional Notes

Current fallback model requires every architecture to have a scalar fallback for each implementation. This creates a clutter in arch-fallback.h as 7xMx1 is very RVV-specific tiling, and should not be used by other architectures.
GEMM reaches peak performance when the prompt is a multiple of 7 (for example, prompt=28). To handle leftovers, it defaults to GEMV, which impacts performance. Ideally, there should be leftover Nx32 kernels which handle each leftover case from 2-6 leftover tokens.

Future Work

Subsequent PRs plan to add RVV kernels for quantization types, as well as extend existing quantization support to other VLENs.

luhenry · 2025-12-11T13:03:48Z

@xctan it would be lovely to have your review on that as it will go to upstream next.

xctan · 2025-12-11T14:53:16Z

Changing ncols_interleaved from a template parameter to a regular argument would offer a couple of advantages: ability of vectorized handling of leftover elements and, more importantly, it would significantly reduce the number of specialized kernels in arch-fallback.h (a file created due to MachO limitations). The logic for selecting the vector length could then be moved into the inner kernels to avoid changing the unified function signatures. Furthermore, the NB_COL template parameter in the outer wrapper structs might need a new special value to represent indefinite vector length, impacting related index calculations. Ultimately, this approach would lead to a cleaner implementation and provide flexibility for other variable vector length instructions like SVE.

Since my suggestion effectively refactors the repack logic to accommodate variable vector length architectures, I think it's perfectly acceptable to implement this in a follow-up PR after discussing it with other upstream maintainers.

taimur-10x · 2025-12-23T10:10:59Z

Since my suggestion effectively refactors the repack logic to accommodate variable vector length architectures, I think it's perfectly acceptable to implement this in a follow-up PR after discussing it with other upstream maintainers.

@xctan, I'll move this to upstream then to further discuss the structural changes around the code.

* FlashAttention (#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness

taimur-10x marked this pull request as draft December 5, 2025 11:49

github-actions Bot added the ggml label Dec 5, 2025

taimur-10x changed the base branch from 10x-repack-fp to master December 5, 2025 12:21

taimur-10x marked this pull request as ready for review December 7, 2025 22:34

taimur-10x force-pushed the rvv-repack-floating branch from cfc1da0 to 5b5b054 Compare December 8, 2025 20:00

taimur-10x self-assigned this Dec 8, 2025

taimur-10x requested review from david-baker-808 and luhenry December 9, 2025 09:30

luhenry approved these changes Dec 11, 2025

View reviewed changes

taimur-10x changed the base branch from master to 10x-repack-fp December 22, 2025 14:46

taimur-10x force-pushed the rvv-repack-floating branch 3 times, most recently from 41d77be to aab5b5a Compare December 23, 2025 09:59

ggml-cpu: add repack GEMM and GEMV for floating-point

6ddaa7d

taimur-10x force-pushed the rvv-repack-floating branch from aab5b5a to 6ddaa7d Compare December 23, 2025 10:04

Merge branch '10x-repack-fp' into rvv-repack-floating

880929d

taimur-10x merged commit f96d154 into 10x-repack-fp Dec 23, 2025
53 of 70 checks passed

taimur-10x added a commit that referenced this pull request Dec 23, 2025

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

7e97d30

taimur-10x added a commit that referenced this pull request Jan 9, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

12a1b52

taimur-10x added a commit that referenced this pull request Jan 27, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

0d9caad

taimur-10x added a commit that referenced this pull request Feb 14, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

d09f5df

taimur-10x added a commit that referenced this pull request Mar 4, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

d6fdaf4

taimur-10x added a commit that referenced this pull request Mar 4, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

8a438ba

taimur-10x added a commit that referenced this pull request Mar 4, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

b28b4c5

taimur-10x added a commit that referenced this pull request Mar 4, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

2db2e9f

taimur-10x added a commit that referenced this pull request Apr 2, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

fd94e4c

taimur-10x added a commit that referenced this pull request Apr 24, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

8856f8a

taimur-10x added a commit that referenced this pull request May 24, 2026

ggml-cpu: add repack GEMM and GEMV for floating-point (#4)

e897076

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: add repack GEMM and GEMV for floating-point#4

ggml-cpu: add repack GEMM and GEMV for floating-point#4
taimur-10x merged 2 commits into
10x-repack-fpfrom
rvv-repack-floating

taimur-10x commented Dec 5, 2025 •

edited

Loading

Uh oh!

luhenry commented Dec 11, 2025

Uh oh!

xctan commented Dec 11, 2025

Uh oh!

taimur-10x commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

taimur-10x commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Tile Sizes

Testing

Benchmarking Results

Prefill / Prompt Processing (GEMM)

Tokens / Second

Decode (GEMV)

Tokens / Second

Additional Notes

Future Work

Uh oh!

luhenry commented Dec 11, 2025

Uh oh!

xctan commented Dec 11, 2025

Uh oh!

taimur-10x commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

taimur-10x commented Dec 5, 2025 •

edited

Loading