Faster q3_0 implementation, using two planes #1

pubby · 2023-04-16T16:21:16Z

This reimplements block_q3_0 as:

typedef struct {
    ggml_fp16_t d;
    uint16_t qhi;
    uint32_t qlo;
} block_q3_0;

For each 3-bit number, the lowest 2 bits are packed into qlo, while the highest bit is packed into qhi.

To convert this representation to SIMD vectors, qlo is unpacked exactly like q2_0, while qhi is converted to a lookup table whose results are then OR'd.

This representation is both faster, and simpler regarding SIMD as it's implementation shares most code with q2_0. I'm seeing an improvement from 66.07 seconds per pass to 42.12 seconds per pass on AVX2.

Pros:

Simpler SIMD implementation
Faster
Shares code with q2_0

Cons:

The data representation is unusual
Requires a lookup table

ggml.c

sw · 2023-04-16T18:58:03Z

This lets the preprocessor generate the look-up table:

#define B0(n) 0x ## n
#define B1(n) B0(n ## FC), B0(n ## 00)
#define B2(n) B1(n ## FC), B1(n ## 00)
#define B3(n) B2(n ## FC), B2(n ## 00)
#define B4(n) B3(n ## FC), B3(n ## 00)
#define B5(n) B4(n ## FC), B4(n ## 00)
#define B6(n) B5(n ## FC), B5(n ## 00)
#define B7(n) B6(n ## FC), B6(n ## 00)
#define B8( ) B7(     FC), B7(     00)
static const uint64_t ggml_q3_table[256] = { B8() };

~~It's using string concatenation which is a bit silly. I'll see if I can find a more arithmetic way.~~ Actually let's not bother, this is better debuggable by running it through cpp.

pubby · 2023-04-16T19:41:02Z

Made the changes you talked about. Now getting 38.77 seconds per pass.

Unfortunately my processor doesn't support vpdpbssd. Still overall your changes are faster for me.

I don't have vpdpbssd support either - just wanted to future proof things.

It's possible to implement the qhi conversion without the lookup table, but the table has better performance on my system. Here's the code I used to test that (shift result, OR to Crumbs, subtract 4 element-wise):

// Unpack 32 1-bit fields into 32 bytes
// The output vector contains 32 bytes, each one in [ 0 .. 1 ] interval
static inline __m256i bytesFromBits(uint16_t packed_hi, uint16_t packed_lo) {
    __m128i bx_hi = _mm_set1_epi32(packed_hi | ((packed_hi >> 1) << 16));
    __m128i bx_lo = _mm_set1_epi32(packed_lo | ((packed_lo >> 1) << 16));
    __m256i bx = _mm256_set_m128i(bx_hi, bx_lo);

    // shift counts to get all bit pairs in lowest position of each byte
    const __m256i shift256 = _mm256_set_epi32(6, 4, 2, 0,
                                              6, 4, 2, 0);
    bx = _mm256_srlv_epi32(bx, shift256); // TODO: It would be better to use srlv_epi16, if available

    const __m256i shufmask = _mm256_set_epi8(15, 13, 11,  9,
                                              7,  5,  3,  1,
                                             14, 12, 10,  8,
                                              6,  4,  2,  0,
                                             15, 13, 11,  9,
                                              7,  5,  3,  1,
                                             14, 12, 10,  8,
                                              6,  4,  2,  0);
    bx = _mm256_shuffle_epi8(bx, shufmask);

    const __m256i mask = _mm256_set1_epi8(1);
    bx = _mm256_and_si256(mask, bx);

    return bx;
}

I'm guessing other SIMD architectures can have code like above if they can't use tables. I'll look into it more - I'm mostly familiar with x64.

sw · 2023-04-16T20:19:13Z

ggml.c

-            uint64_t qs = y[i].qs;
-            for (int l = 0; l < QK3_0; l++) {
-                const int8_t vi = qs & 7;
+        for (int i = 0; i < 16; i++) {


@pubby : I'm confused by this... why should we have two nested loops that go up to 16?

It's getting late for me, possibly you're right about this. But then q2 would also be wrong? I'll look into it tomorrow.

Also, you changed QK3_0 to 16 in various places, was there a specific reason? With the block_q3_0 definition as it is, no other value will work obviously, but I think it's nice to have a consistent name, just for keeping the code understandable.

I think that line's a mistake. I believe it should be i < nb (and then add back in nb = k / QK3_0).

The QK3_0 -> 16 change was made when I was testing a customizable version block_q3_0 that looked like this:

typedef struct { ggml_fp16_t d; uint16_t qhi[QK3_0 / 16]; uint32_t qlo[QK3_0 / 16]; } block_q3_0;

The 16 number referred to the bit count per qhi element.

I scrapped that version though, so I'll revert the changes and use QK3_0 again.

pubby added 3 commits April 16, 2023 10:55

Faster q3_0 implementation, using two planes.

f54da1f

Merge remote-tracking branch 'sw/q2q3' into q3

21acb58

Add a comment describing q3_0 better.

394a3c1

pubby mentioned this pull request Apr 16, 2023

Q2 and Q3 quantization ggerganov/llama.cpp#1004

Closed

Pre-shift hi in q3_0 implementation.

f7874df

sw reviewed Apr 16, 2023

View reviewed changes

ggml.c Show resolved Hide resolved

sw reviewed Apr 16, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

Improvements suggested by sw

b304c4d

sw reviewed Apr 16, 2023

View reviewed changes

Change 16 to QK3_0 and fix quantize_q3_0 loop (sw).

e893d33

sw merged commit de542b3 into sw:q2q3 Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster q3_0 implementation, using two planes #1

Faster q3_0 implementation, using two planes #1

pubby commented Apr 16, 2023

sw commented Apr 16, 2023 •

edited

Loading

pubby commented Apr 16, 2023 •

edited

Loading

sw Apr 16, 2023

pubby Apr 16, 2023

Faster q3_0 implementation, using two planes #1

Faster q3_0 implementation, using two planes #1

Conversation

pubby commented Apr 16, 2023

sw commented Apr 16, 2023 • edited Loading

pubby commented Apr 16, 2023 • edited Loading

sw Apr 16, 2023

Choose a reason for hiding this comment

pubby Apr 16, 2023

Choose a reason for hiding this comment

sw commented Apr 16, 2023 •

edited

Loading

pubby commented Apr 16, 2023 •

edited

Loading