vulkan: workaround for #10710 and #12147 16 bit unpack8 bug #12472

netrunnereve · 2025-03-19T22:07:32Z

This is a temporary workaround for the AMD unpack8 driver bug reported in #10710 and #12147. I managed to trigger this bug in older versions of RGA using the offline compiler and I could see that it only happens with an unpack8 on a 16 bit input. If I make the unpack8 use a 32 bit input instead it works fine.

Most compilers should be smart enough to optimize out the extra two unpacks and performance is pretty much the same on my 470.

PR:

  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   38 runs - 27400.34 us/run -  60.13 GFLOP/run -   2.19 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                30 runs - 33384.53 us/run -  60.13 GFLOP/run -   1.80 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 32 runs - 31894.81 us/run -  60.13 GFLOP/run -   1.89 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  28 runs - 37669.79 us/run -  60.13 GFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                32 runs - 32807.97 us/run -  60.13 GFLOP/run -   1.83 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  32 runs - 32698.22 us/run -  60.13 GFLOP/run -   1.84 TFLOPS

MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   349.42 us/run - 117.44 MFLOP/run - 336.10 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  4260 runs -   238.07 us/run - 117.44 MFLOP/run - 493.31 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  5112 runs -   210.89 us/run - 117.44 MFLOP/run - 556.89 GFLOPS

Master:

  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   38 runs - 27388.82 us/run -  60.13 GFLOP/run -   2.20 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                32 runs - 33222.94 us/run -  60.13 GFLOP/run -   1.81 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 32 runs - 31699.12 us/run -  60.13 GFLOP/run -   1.90 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  28 runs - 37612.89 usrun -  60.13 GFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                32 runs - 32592.78us/run -  60.13 GFLOP/run -   1.84 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  32 runs - 32816.25 usrun -  60.13 GFLOP/run -   1.83 TFLOPS

  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   350.98 us/run - 117.44 MFLOP/run - 334.60 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  4260 runs -   237.31 us/run - 117.44 MFLOP/run - 494.89 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  5112 runs -   210.24 us/run - 117.44 MFLOP/run - 558.59 GFLOPS

@stduhpf @Danik-droid @masamaru-san can you test and make sure that the bug is gone now?

And yeah AMD still needs to fix their driver. Something as simple as the little shader below will trigger it 🤦‍♀️.

#version 450

#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#extension GL_EXT_shader_explicit_arithmetic_types_int16 : require
#extension GL_EXT_shader_explicit_arithmetic_types_int8 : require
#extension GL_EXT_control_flow_attributes : enable
#extension GL_EXT_shader_16bit_storage : require
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require

layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

struct block_q8_0_packed32
{
    float16_t d;
    int16_t qs[16];
};

layout (binding = 0) readonly buffer A {block_q8_0_packed32 data_a[];};
layout (binding = 1) writeonly buffer B {float data_b[][2];};

void main()
{
	const uint tid = gl_LocalInvocationID.x;
	
        // this generates wrong assembly instructions
	//const vec2 v = vec2(unpack8(data_a[tid].qs[0]));

        // this works!
	const vec2 v = vec2(unpack8(int32_t(data_a[tid].qs[0])).xy);

	data_b[tid][0] = v.x;
	data_b[tid][1] = v.y;
}

masamaru-san · 2025-03-20T02:31:12Z

@netrunnereve As a quick check, it looks good. The performance of sd.cpp may have been a little bit (< 1%) faster than b4785.
(Windows11 24H2 / AMD Ryzen 7 5700U with Radeon Graphics / Adrenalin 24.9.1 / vulkan driver ver.2.0.279)

0cc4m · 2025-03-20T06:46:32Z

Thank you for looking into this.

stduhpf · 2025-03-20T12:35:25Z

Works like a charm, thanks @netrunnereve. It makes it even stranger that the bug still isn't fixed at the driver level, the logic between unpacking from 16 bits or 32 bits should be pretty much the same.

netrunnereve · 2025-03-20T16:52:39Z

It makes it even stranger that the bug still isn't fixed at the driver level, the logic between unpacking from 16 bits or 32 bits should be pretty much the same.

The SPIRV is different as glslc doesn't optimize out the additional unpacks and the 32 bit one is treated as an unpack of 4 8 bit values into 4 floats. In the assembly I'm seeing that it correctly loads the bottom 8 bits into the .x slot and the next set of 8 bits in the .y slot for a 32 bit input, but if I use a 16 bit input it somehow loads the identical bottom bits into both the .x and .y slots. I mean they should have some sort of automated regression test for this... right?

0cc4m

I'm not convinced we'll ever be able to remove this again, but let's keep hoping.

Also, more of these will sneak in over time, probably.

netrunnereve · 2025-03-22T21:34:29Z

I'm not convinced we'll ever be able to remove this again, but let's keep hoping.

Even if they fix it for RDNA GCN still has this bug and Windows GCN drivers are finished. So yeah I guess we'll have to avoid 16 bit unpack8s from now on.

…g#12472)

use vec4 unpack8 for #12147

317607b

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 19, 2025

netrunnereve marked this pull request as ready for review March 20, 2025 16:52

0cc4m approved these changes Mar 21, 2025

View reviewed changes

0cc4m merged commit 30c42ef into ggml-org:master Mar 21, 2025
44 checks passed

netrunnereve deleted the vec2_unpack8 branch March 22, 2025 21:25

Ivy233 pushed a commit to Ivy233/llama.cpp that referenced this pull request Mar 23, 2025

vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (ggml-or…

4d65199

…g#12472)

henk717 mentioned this pull request Apr 2, 2025

About Vulcan on AMD LostRuins/koboldcpp#1459

Closed

LostRuins mentioned this pull request Apr 2, 2025

Eval bug: ~~Q2_K and Q3_K~~ Q8_0 not working on Vulkan anymore on RX 5700XT #10710

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: workaround for #10710 and #12147 16 bit unpack8 bug #12472

vulkan: workaround for #10710 and #12147 16 bit unpack8 bug #12472

Uh oh!

netrunnereve commented Mar 19, 2025

Uh oh!

masamaru-san commented Mar 20, 2025

Uh oh!

0cc4m commented Mar 20, 2025

Uh oh!

stduhpf commented Mar 20, 2025

Uh oh!

netrunnereve commented Mar 20, 2025

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

netrunnereve commented Mar 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

vulkan: workaround for #10710 and #12147 16 bit unpack8 bug #12472

vulkan: workaround for #10710 and #12147 16 bit unpack8 bug #12472

Uh oh!

Conversation

netrunnereve commented Mar 19, 2025

Uh oh!

masamaru-san commented Mar 20, 2025

Uh oh!

0cc4m commented Mar 20, 2025

Uh oh!

stduhpf commented Mar 20, 2025

Uh oh!

netrunnereve commented Mar 20, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

netrunnereve commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

netrunnereve commented Mar 22, 2025 •

edited

Loading