-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster q3_0 implementation, using two planes #1
Conversation
This lets the preprocessor generate the look-up table: #define B0(n) 0x ## n
#define B1(n) B0(n ## FC), B0(n ## 00)
#define B2(n) B1(n ## FC), B1(n ## 00)
#define B3(n) B2(n ## FC), B2(n ## 00)
#define B4(n) B3(n ## FC), B3(n ## 00)
#define B5(n) B4(n ## FC), B4(n ## 00)
#define B6(n) B5(n ## FC), B5(n ## 00)
#define B7(n) B6(n ## FC), B6(n ## 00)
#define B8( ) B7( FC), B7( 00)
static const uint64_t ggml_q3_table[256] = { B8() };
|
Made the changes you talked about. Now getting 38.77 seconds per pass.
I don't have It's possible to implement the
I'm guessing other SIMD architectures can have code like above if they can't use tables. I'll look into it more - I'm mostly familiar with x64. |
ggml.c
Outdated
uint64_t qs = y[i].qs; | ||
for (int l = 0; l < QK3_0; l++) { | ||
const int8_t vi = qs & 7; | ||
for (int i = 0; i < 16; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pubby : I'm confused by this... why should we have two nested loops that go up to 16?
It's getting late for me, possibly you're right about this. But then q2 would also be wrong? I'll look into it tomorrow.
Also, you changed QK3_0
to 16 in various places, was there a specific reason? With the block_q3_0
definition as it is, no other value will work obviously, but I think it's nice to have a consistent name, just for keeping the code understandable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that line's a mistake. I believe it should be i < nb
(and then add back in nb = k / QK3_0
).
The QK3_0
-> 16
change was made when I was testing a customizable version block_q3_0
that looked like this:
typedef struct {
ggml_fp16_t d;
uint16_t qhi[QK3_0 / 16];
uint32_t qlo[QK3_0 / 16];
} block_q3_0;
The 16 number referred to the bit count per qhi
element.
I scrapped that version though, so I'll revert the changes and use QK3_0
again.
This reimplements block_q3_0 as:
For each 3-bit number, the lowest 2 bits are packed into
qlo
, while the highest bit is packed intoqhi
.To convert this representation to SIMD vectors,
qlo
is unpacked exactly like q2_0, whileqhi
is converted to a lookup table whose results are then OR'd.This representation is both faster, and simpler regarding SIMD as it's implementation shares most code with q2_0. I'm seeing an improvement from 66.07 seconds per pass to 42.12 seconds per pass on AVX2.
Pros:
Cons: