Vulkan: Remove dedicated aligned matrix matrix multiplication shaders #12515

0cc4m · 2025-03-22T14:58:13Z

My motivation here was simplifying the matmul shader code, because I learned a lot since I initially wrote it. It's a lot easier to read and work with this way.

I would like to get rid of dedicated aligned shaders, this seems to have worked mostly, but performance is up in same cases and down in others. @jeffbolznv since you know the most about shader optimization and it also affects coopmat2, could you take a look if what I'm doing with the mul_mm shader seems sensible?

jeffbolznv · 2025-03-22T16:51:06Z

ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp

+#if QUANT_K > 1
+    const uint load_vec_a = LOAD_VEC_QUANT;
+#else
+    const uint load_vec_a =


Dividing by a value that's not a compile time constant probably won't get turned into a shift, so it'll be better if you can do that manually.

Sadly this didn't fix the performance drop, maybe the aligned shader variants can't easily be replaced. But this attempt did have some positive cases, too.

Maybe I should revert the aligned shader and just fix the fp32 and fp16 performance of the original shader, apart from refactoring the code to be more readable.

jeffbolznv · 2025-03-22T16:55:59Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

@@ -1802,9 +1793,6 @@ static void ggml_vk_load_shaders(vk_device& device) {
        ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->l, #NAMELC #F16ACC "_l", NAMELC ## F16ACC ## _cm2_len, NAMELC ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), l_ ## WG_DENOMS, l_ ## WARPTILE, 1);   \
        ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->m, #NAMELC #F16ACC "_m", NAMELC ## F16ACC ## _cm2_len, NAMELC ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), m_ ## WG_DENOMS, m_ ## WARPTILE, 1);   \
        ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->s, #NAMELC #F16ACC "_s", NAMELC ## F16ACC ## _cm2_len, NAMELC ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), s_ ## WG_DENOMS, s_ ## WARPTILE, 1);   \
-        ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->a_l, #NAMELC #F16ACC "_aligned_l", NAMELC ## _aligned ## F16ACC ## _cm2_len, NAMELC ## _aligned ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), l_ ## WG_DENOMS, l_ ## WARPTILE, l_align);   \


The mul_mm_cm2 shader has this use of the ALIGNED define:

// Hint to the compiler that values are aligned (want 16B alignment). // Quants are always block-aligned, no alignment needed. #if ALIGNED #if QUANT_K == 1 stride_a &= ~7; #endif stride_b &= ~7; #endif

This is still needed for perf. You could probably turn the #ifdef into a branch on a spec constant and then vulkan-shaders-gen won't need to generate both variants, but here we'll still need to compile both of them.

In the long run it would be nice to be able to more easily generate more spec constant combinations on-demand, without having to have code here that enumerates all combinations.

Maybe we can add a simple type of device-specific tuning file, like a CSV that enumerates shader variants and their spec constants. But we would have to also encode when which version gets used.

I didn't mean per-device tuning values, I just meant being able to more easily specialize things like control flow, alignment, etc. by adding more spec constants.

jeffbolznv · 2025-03-22T16:58:14Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

@@ -4265,10 +4225,7 @@ static void ggml_vk_mul_mat_q_f16(ggml_backend_vk_context * ctx, vk_context& sub
    // Not implemented
    GGML_ASSERT(y_non_contig || !qy_needs_dequant);  // NOLINT

-    const uint32_t kpad = ggml_vk_align_size(ne10, ggml_vk_guess_matmul_pipeline_align(ctx, mmp, ne01, ne11, qx_needs_dequant ? GGML_TYPE_F16 : src0->type));
-    const bool aligned = ne10 == kpad && ne01 > 8 && ne11 > 8;


Here, too, I think coopmat2 needs to keep some logic to detect alignment. But it probably only needs to check that the K dimension is a multiple of 8.

github-actions bot added testing Vulkan ggml labels Mar 22, 2025

Vulkan: Remove dedicated aligned matrix matrix multiplication shaders

7d0fde1

0cc4m force-pushed the 0cc4m/vulkan-mm-remove-aligned branch from 5f0f724 to 7d0fde1 Compare March 22, 2025 15:00

jeffbolznv reviewed Mar 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: Remove dedicated aligned matrix matrix multiplication shaders #12515

Vulkan: Remove dedicated aligned matrix matrix multiplication shaders #12515

0cc4m commented Mar 22, 2025

jeffbolznv Mar 22, 2025

0cc4m Mar 22, 2025

jeffbolznv Mar 22, 2025

0cc4m Mar 22, 2025

jeffbolznv Mar 22, 2025

jeffbolznv Mar 22, 2025

Vulkan: Remove dedicated aligned matrix matrix multiplication shaders #12515

Are you sure you want to change the base?

Vulkan: Remove dedicated aligned matrix matrix multiplication shaders #12515

Conversation

0cc4m commented Mar 22, 2025

jeffbolznv Mar 22, 2025

Choose a reason for hiding this comment

0cc4m Mar 22, 2025

Choose a reason for hiding this comment

jeffbolznv Mar 22, 2025

Choose a reason for hiding this comment

0cc4m Mar 22, 2025

Choose a reason for hiding this comment

jeffbolznv Mar 22, 2025

Choose a reason for hiding this comment

jeffbolznv Mar 22, 2025

Choose a reason for hiding this comment