Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: workaround for #10710 and #12147 16 bit unpack8 bug #12472

Merged
merged 1 commit into from
Mar 21, 2025

Conversation

netrunnereve
Copy link
Collaborator

This is a temporary workaround for the AMD unpack8 driver bug reported in #10710 and #12147. I managed to trigger this bug in older versions of RGA using the offline compiler and I could see that it only happens with an unpack8 on a 16 bit input. If I make the unpack8 use a 32 bit input instead it works fine.

Most compilers should be smart enough to optimize out the extra two unpacks and performance is pretty much the same on my 470.

PR:

  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   38 runs - 27400.34 us/run -  60.13 GFLOP/run -   2.19 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                30 runs - 33384.53 us/run -  60.13 GFLOP/run -   1.80 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 32 runs - 31894.81 us/run -  60.13 GFLOP/run -   1.89 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  28 runs - 37669.79 us/run -  60.13 GFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                32 runs - 32807.97 us/run -  60.13 GFLOP/run -   1.83 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  32 runs - 32698.22 us/run -  60.13 GFLOP/run -   1.84 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   349.42 us/run - 117.44 MFLOP/run - 336.10 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  4260 runs -   238.07 us/run - 117.44 MFLOP/run - 493.31 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  5112 runs -   210.89 us/run - 117.44 MFLOP/run - 556.89 GFLOPS

Master:

  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   38 runs - 27388.82 us/run -  60.13 GFLOP/run -   2.20 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                32 runs - 33222.94 us/run -  60.13 GFLOP/run -   1.81 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 32 runs - 31699.12 us/run -  60.13 GFLOP/run -   1.90 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  28 runs - 37612.89 usrun -  60.13 GFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                32 runs - 32592.78us/run -  60.13 GFLOP/run -   1.84 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  32 runs - 32816.25 usrun -  60.13 GFLOP/run -   1.83 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   350.98 us/run - 117.44 MFLOP/run - 334.60 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  4260 runs -   237.31 us/run - 117.44 MFLOP/run - 494.89 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  5112 runs -   210.24 us/run - 117.44 MFLOP/run - 558.59 GFLOPS

@stduhpf @Danik-droid @masamaru-san can you test and make sure that the bug is gone now?

And yeah AMD still needs to fix their driver. Something as simple as the little shader below will trigger it 🤦‍♀️.

#version 450

#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#extension GL_EXT_shader_explicit_arithmetic_types_int16 : require
#extension GL_EXT_shader_explicit_arithmetic_types_int8 : require
#extension GL_EXT_control_flow_attributes : enable
#extension GL_EXT_shader_16bit_storage : require
#extension GL_EXT_shader_8bit_storage : require
#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require

layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

struct block_q8_0_packed32
{
    float16_t d;
    int16_t qs[16];
};

layout (binding = 0) readonly buffer A {block_q8_0_packed32 data_a[];};
layout (binding = 1) writeonly buffer B {float data_b[][2];};

void main()
{
	const uint tid = gl_LocalInvocationID.x;
	
        // this generates wrong assembly instructions
	//const vec2 v = vec2(unpack8(data_a[tid].qs[0]));

        // this works!
	const vec2 v = vec2(unpack8(int32_t(data_a[tid].qs[0])).xy);

	data_b[tid][0] = v.x;
	data_b[tid][1] = v.y;
}

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 19, 2025
@masamaru-san
Copy link

@netrunnereve As a quick check, it looks good. The performance of sd.cpp may have been a little bit (< 1%) faster than b4785.
(Windows11 24H2 / AMD Ryzen 7 5700U with Radeon Graphics / Adrenalin 24.9.1 / vulkan driver ver.2.0.279)

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 20, 2025

Thank you for looking into this.

@stduhpf
Copy link
Contributor

stduhpf commented Mar 20, 2025

Works like a charm, thanks @netrunnereve. It makes it even stranger that the bug still isn't fixed at the driver level, the logic between unpacking from 16 bits or 32 bits should be pretty much the same.

@netrunnereve
Copy link
Collaborator Author

It makes it even stranger that the bug still isn't fixed at the driver level, the logic between unpacking from 16 bits or 32 bits should be pretty much the same.

The SPIRV is different as glslc doesn't optimize out the additional unpacks and the 32 bit one is treated as an unpack of 4 8 bit values into 4 floats. In the assembly I'm seeing that it correctly loads the bottom 8 bits into the .x slot and the next set of 8 bits in the .y slot for a 32 bit input, but if I use a 16 bit input it somehow loads the identical bottom bits into both the .x and .y slots. I mean they should have some sort of automated regression test for this... right?

@netrunnereve netrunnereve marked this pull request as ready for review March 20, 2025 16:52
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we'll ever be able to remove this again, but let's keep hoping.

Also, more of these will sneak in over time, probably.

@0cc4m 0cc4m merged commit 30c42ef into ggml-org:master Mar 21, 2025
44 checks passed
@netrunnereve netrunnereve deleted the vec2_unpack8 branch March 22, 2025 21:25
@netrunnereve
Copy link
Collaborator Author

netrunnereve commented Mar 22, 2025

I'm not convinced we'll ever be able to remove this again, but let's keep hoping.

Even if they fix it for RDNA GCN still has this bug and Windows GCN drivers are finished. So yeah I guess we'll have to avoid 16 bit unpack8s from now on.

Ivy233 pushed a commit to Ivy233/llama.cpp that referenced this pull request Mar 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants