vulkan: Implement grouped query attention in the coopmat2 FA shader #12559

jeffbolznv · 2025-03-25T03:41:38Z

When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when:

dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))

previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each.

This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.

When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.

jeffbolznv requested a review from 0cc4m March 25, 2025 03:41

github-actions bot added Vulkan ggml labels Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Implement grouped query attention in the coopmat2 FA shader #12559

vulkan: Implement grouped query attention in the coopmat2 FA shader #12559

jeffbolznv commented Mar 25, 2025

vulkan: Implement grouped query attention in the coopmat2 FA shader #12559

Are you sure you want to change the base?

vulkan: Implement grouped query attention in the coopmat2 FA shader #12559

Conversation

jeffbolznv commented Mar 25, 2025