Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Remove dedicated aligned matrix matrix multiplication shaders #12515

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Mar 22, 2025

My motivation here was simplifying the matmul shader code, because I learned a lot since I initially wrote it. It's a lot easier to read and work with this way.

I would like to get rid of dedicated aligned shaders, this seems to have worked mostly, but performance is up in same cases and down in others. @jeffbolznv since you know the most about shader optimization and it also affects coopmat2, could you take a look if what I'm doing with the mul_mm shader seems sensible?

@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 22, 2025
@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mm-remove-aligned branch from 5f0f724 to 7d0fde1 Compare March 22, 2025 15:00
#if QUANT_K > 1
const uint load_vec_a = LOAD_VEC_QUANT;
#else
const uint load_vec_a =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dividing by a value that's not a compile time constant probably won't get turned into a shift, so it'll be better if you can do that manually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly this didn't fix the performance drop, maybe the aligned shader variants can't easily be replaced. But this attempt did have some positive cases, too.

Maybe I should revert the aligned shader and just fix the fp32 and fp16 performance of the original shader, apart from refactoring the code to be more readable.

@@ -1802,9 +1793,6 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->l, #NAMELC #F16ACC "_l", NAMELC ## F16ACC ## _cm2_len, NAMELC ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), l_ ## WG_DENOMS, l_ ## WARPTILE, 1); \
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->m, #NAMELC #F16ACC "_m", NAMELC ## F16ACC ## _cm2_len, NAMELC ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), m_ ## WG_DENOMS, m_ ## WARPTILE, 1); \
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->s, #NAMELC #F16ACC "_s", NAMELC ## F16ACC ## _cm2_len, NAMELC ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), s_ ## WG_DENOMS, s_ ## WARPTILE, 1); \
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->a_l, #NAMELC #F16ACC "_aligned_l", NAMELC ## _aligned ## F16ACC ## _cm2_len, NAMELC ## _aligned ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), l_ ## WG_DENOMS, l_ ## WARPTILE, l_align); \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mul_mm_cm2 shader has this use of the ALIGNED define:

    // Hint to the compiler that values are aligned (want 16B alignment).
    // Quants are always block-aligned, no alignment needed.
#if ALIGNED
#if QUANT_K == 1
    stride_a &= ~7;
#endif
    stride_b &= ~7;
#endif

This is still needed for perf. You could probably turn the #ifdef into a branch on a spec constant and then vulkan-shaders-gen won't need to generate both variants, but here we'll still need to compile both of them.

In the long run it would be nice to be able to more easily generate more spec constant combinations on-demand, without having to have code here that enumerates all combinations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a simple type of device-specific tuning file, like a CSV that enumerates shader variants and their spec constants. But we would have to also encode when which version gets used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean per-device tuning values, I just meant being able to more easily specialize things like control flow, alignment, etc. by adding more spec constants.

@@ -4265,10 +4225,7 @@ static void ggml_vk_mul_mat_q_f16(ggml_backend_vk_context * ctx, vk_context& sub
// Not implemented
GGML_ASSERT(y_non_contig || !qy_needs_dequant); // NOLINT

const uint32_t kpad = ggml_vk_align_size(ne10, ggml_vk_guess_matmul_pipeline_align(ctx, mmp, ne01, ne11, qx_needs_dequant ? GGML_TYPE_F16 : src0->type));
const bool aligned = ne10 == kpad && ne01 > 8 && ne11 > 8;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, too, I think coopmat2 needs to keep some logic to detect alignment. But it probably only needs to check that the K dimension is a multiple of 8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants