-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan: Remove dedicated aligned matrix matrix multiplication shaders #12515
base: master
Are you sure you want to change the base?
Conversation
5f0f724
to
7d0fde1
Compare
#if QUANT_K > 1 | ||
const uint load_vec_a = LOAD_VEC_QUANT; | ||
#else | ||
const uint load_vec_a = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dividing by a value that's not a compile time constant probably won't get turned into a shift, so it'll be better if you can do that manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly this didn't fix the performance drop, maybe the aligned shader variants can't easily be replaced. But this attempt did have some positive cases, too.
Maybe I should revert the aligned shader and just fix the fp32 and fp16 performance of the original shader, apart from refactoring the code to be more readable.
@@ -1802,9 +1793,6 @@ static void ggml_vk_load_shaders(vk_device& device) { | |||
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->l, #NAMELC #F16ACC "_l", NAMELC ## F16ACC ## _cm2_len, NAMELC ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), l_ ## WG_DENOMS, l_ ## WARPTILE, 1); \ | |||
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->m, #NAMELC #F16ACC "_m", NAMELC ## F16ACC ## _cm2_len, NAMELC ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), m_ ## WG_DENOMS, m_ ## WARPTILE, 1); \ | |||
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->s, #NAMELC #F16ACC "_s", NAMELC ## F16ACC ## _cm2_len, NAMELC ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), s_ ## WG_DENOMS, s_ ## WARPTILE, 1); \ | |||
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->a_l, #NAMELC #F16ACC "_aligned_l", NAMELC ## _aligned ## F16ACC ## _cm2_len, NAMELC ## _aligned ## F16ACC ## _cm2_data, "main", PARAMCOUNT, sizeof(PUSHCONST), l_ ## WG_DENOMS, l_ ## WARPTILE, l_align); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mul_mm_cm2 shader has this use of the ALIGNED define:
// Hint to the compiler that values are aligned (want 16B alignment).
// Quants are always block-aligned, no alignment needed.
#if ALIGNED
#if QUANT_K == 1
stride_a &= ~7;
#endif
stride_b &= ~7;
#endif
This is still needed for perf. You could probably turn the #ifdef into a branch on a spec constant and then vulkan-shaders-gen won't need to generate both variants, but here we'll still need to compile both of them.
In the long run it would be nice to be able to more easily generate more spec constant combinations on-demand, without having to have code here that enumerates all combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a simple type of device-specific tuning file, like a CSV that enumerates shader variants and their spec constants. But we would have to also encode when which version gets used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean per-device tuning values, I just meant being able to more easily specialize things like control flow, alignment, etc. by adding more spec constants.
@@ -4265,10 +4225,7 @@ static void ggml_vk_mul_mat_q_f16(ggml_backend_vk_context * ctx, vk_context& sub | |||
// Not implemented | |||
GGML_ASSERT(y_non_contig || !qy_needs_dequant); // NOLINT | |||
|
|||
const uint32_t kpad = ggml_vk_align_size(ne10, ggml_vk_guess_matmul_pipeline_align(ctx, mmp, ne01, ne11, qx_needs_dequant ? GGML_TYPE_F16 : src0->type)); | |||
const bool aligned = ne10 == kpad && ne01 > 8 && ne11 > 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, too, I think coopmat2 needs to keep some logic to detect alignment. But it probably only needs to check that the K dimension is a multiple of 8.
My motivation here was simplifying the matmul shader code, because I learned a lot since I initially wrote it. It's a lot easier to read and work with this way.
I would like to get rid of dedicated aligned shaders, this seems to have worked mostly, but performance is up in same cases and down in others. @jeffbolznv since you know the most about shader optimization and it also affects coopmat2, could you take a look if what I'm doing with the mul_mm shader seems sensible?