ggml-cpu: add repack GEMM and GEMV for floating-point#4
Conversation
cfc1da0 to
5b5b054
Compare
|
@xctan it would be lovely to have your review on that as it will go to upstream next. |
|
Changing Since my suggestion effectively refactors the |
41d77be to
aab5b5a
Compare
aab5b5a to
6ddaa7d
Compare
@xctan, I'll move this to upstream then to further discuss the structural changes around the code. |
* FlashAttention (#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness
Summary
This PR adds repacking and GEMM/GEMV kernels for floating-point (FP16 and FP32) for RVV (with the
zvfhextension).Key Changes
7 x {16, 32, 64, 128}(selected based on VLEN)1 x {16, 32, 64, 128}(selected based on VLEN)ggml_quantize_mat_tis refactored toggml_repack_mat_tto allow for a common interface for both quantization and floating-point repacking.NB_ROWSadded to select the number of rows to interleave for repacking. Previously, this was fixed at4.Tile Sizes
The repack operation interleaves
Nrows ofactivationswith an interleave size ofK, andMcolumns ofweightswith an interleave size ofK.NxKis fixed at7x1. This introduces 7 accumulators withLMUL=4(7 x 4 = 28 registers), each accumulatingMresults.Mis varied based on the available VLEN:Mis the maximum number of values that can be loaded in (LMUL=2 for F16, LMUL=4 for F32).Testing
Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit, 512-bit and 1024-bit) for a range of input sizes.
Benchmarking Results
End-to-end benchmarking on
BananaPI-BPI F3 (VLEN=256)Prefill / Prompt Processing (GEMM)
Tokens / Second
Result: ~2x-3x speedup over
vec_dotDecode (GEMV)
Tokens / Second
Result: No noticeable improvement, as decode remains memory-bound.
Additional Notes
arch-fallback.has7xMx1is very RVV-specific tiling, and should not be used by other architectures.GEMMreaches peak performance when the prompt is a multiple of 7 (for example,prompt=28). To handle leftovers, it defaults toGEMV, which impacts performance. Ideally, there should be leftoverNx32kernels which handle each leftover case from2-6leftover tokens.Future Work
Subsequent PRs plan to add RVV kernels for quantization types, as well as extend existing quantization support to other VLENs.