Faster int8 quantized #125704

malfet · 2024-05-07T18:59:55Z

Or my journey to learn how to write fast Metal kernels (more details would be posted here )

Using gpt-fast as a benchmark (by running python generate.py --checkpoint_path checkpoints/stories110M/model_int8.pth --device mps)

Before the change, on M2 Pro I get 50 tokens per sec
After adding a very naive

template<typename T>
kernel void int8pack_mm(
    constant T                 * A              [[buffer(0)]],
    constant char              * B              [[buffer(1)]],
    constant T                 * scales         [[buffer(2)]],
    device   T                 * outputData     [[buffer(3)]],
    constant uint3             & sizes          [[buffer(4)]],
    uint                         thread_index   [[thread_position_in_grid]]) {
    const uint lda = sizes.y;
    const uint ldc = sizes.z;
    const uint m = thread_index / sizes.z; // 0..sizes.x-1
    const uint n = thread_index % sizes.z; // 0..sizes.z-1
    constant T *A_ptr = A + m * lda;
    constant char *B_ptr = B + n * lda;

    float rc = 0.0;
    for(uint k = 0; k < sizes.y;  k++) {
      const auto a_val = float(A_ptr[k]);
      const auto b_val = float(B_ptr[k]);
      rc += a_val * b_val;
    }
    outputData[thread_index] = T(rc * float(scales[n]));
}

Perf dropped down to sad 15 tokens per seconds.
Replacing inner loop with vectorized operations

    float rc = 0.0;
    for(uint k = 0; k < sizes.y/4;  k++) {
      const auto a_val = float4(A_ptr[k]);
      const auto b_val = float4(B_ptr[k]);
      rc += dot(a_val, b_val);
    }

Perf jumps back up to 53 tokens per second, but it's a bit of a lie when it comes to llama2-7B perf.

Next step in unlocking the performance were to replace a 1D grid with a 2D one, but limit the thread group size to a single row, which results in a much better data locality which unfortunately is not observable with stories110M anymore as it small model size and Python runtime overhead hide the perf gain)

There were several unsuccessful attempts at caching inputs in thread local memory or using float4x4 to speed up computation. But the key to unlocking the perf were a comment in https://github.com/ml-explore/mlx/blob/631dfbe67309fb630795cd612739cbe54c75e222/mlx/backend/metal/kernels/gemv.metal#L184
which hinted at exploiting both SIMD groups and thread local caches, which resulted in 5x jump in performance compared to initial vectorization approach and 3x perf jump in end-to-end llama7b test

pytorch-bot · 2024-05-07T18:59:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125704

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ffcd8c5 with merge base e9c5f1c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet · 2024-05-07T19:02:37Z

@cccclai FYI

malfet · 2024-05-15T16:37:29Z

@pytorchbot merge -f "Lint and MPS tests are green"

pytorchmergebot · 2024-05-15T16:39:12Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Or my journey to learn how to write fast Metal kernels (more details would be posted [here](https://github.com/malfet/llm_experiments/tree/main/metal-perf) ) Using gpt-fast as a benchmark (by running `python generate.py --checkpoint_path checkpoints/stories110M/model_int8.pth --device mps`) Before the change, on M2 Pro I get 50 tokens per sec After adding a very naive ```metal template<typename T> kernel void int8pack_mm( constant T * A [[buffer(0)]], constant char * B [[buffer(1)]], constant T * scales [[buffer(2)]], device T * outputData [[buffer(3)]], constant uint3 & sizes [[buffer(4)]], uint thread_index [[thread_position_in_grid]]) { const uint lda = sizes.y; const uint ldc = sizes.z; const uint m = thread_index / sizes.z; // 0..sizes.x-1 const uint n = thread_index % sizes.z; // 0..sizes.z-1 constant T *A_ptr = A + m * lda; constant char *B_ptr = B + n * lda; float rc = 0.0; for(uint k = 0; k < sizes.y; k++) { const auto a_val = float(A_ptr[k]); const auto b_val = float(B_ptr[k]); rc += a_val * b_val; } outputData[thread_index] = T(rc * float(scales[n])); } ``` Perf dropped down to sad 15 tokens per seconds. Replacing inner loop with vectorized operations ```metal float rc = 0.0; for(uint k = 0; k < sizes.y/4; k++) { const auto a_val = float4(A_ptr[k]); const auto b_val = float4(B_ptr[k]); rc += dot(a_val, b_val); } ``` Perf jumps back up to 53 tokens per second, but it's a bit of a lie when it comes to llama2-7B perf. Next step in unlocking the performance were to replace a 1D grid with a 2D one, but limit the thread group size to a single row, which results in a much better data locality which unfortunately is not observable with `stories110M` anymore as it small model size and Python runtime overhead hide the perf gain) There were several unsuccessful attempts at caching inputs in thread local memory or using `float4x4` to speed up computation. But the key to unlocking the perf were a comment in https://github.com/ml-explore/mlx/blob/631dfbe67309fb630795cd612739cbe54c75e222/mlx/backend/metal/kernels/gemv.metal#L184 which hinted at exploiting both SIMD groups and thread local caches, which resulted in 5x jump in performance compared to initial vectorization approach and 3x perf jump in end-to-end llama7b test Pull Request resolved: pytorch#125704 Approved by: https://github.com/mikekgfb

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels May 7, 2024

malfet force-pushed the malfet/faster-int8-quantized branch from 7983a22 to 0ec5940 Compare May 9, 2024 19:59

mikekgfb approved these changes May 11, 2024

View reviewed changes

malfet added 3 commits May 11, 2024 02:45

Slower int8 quantized

f98a409

Use float4 and dot in the inner loop

e9d6a4c

Use sharedmemory to speedup the computation

6a0aabb

malfet force-pushed the malfet/faster-int8-quantized branch from 0ec5940 to 6a0aabb Compare May 11, 2024 17:51

malfet marked this pull request as ready for review May 13, 2024 14:23

malfet requested a review from kulinseth as a code owner May 13, 2024 14:23

Will that one work?

ffcd8c5

malfet changed the title ~~Towards faster int8 quantized~~ Faster int8 quantized May 15, 2024

pytorchmergebot added the merging label May 15, 2024

pytorchmergebot added the Merged label May 15, 2024

pytorchmergebot closed this in 4ab2c39 May 15, 2024

pytorchmergebot removed the merging label May 15, 2024

github-actions bot deleted the malfet/faster-int8-quantized branch June 15, 2024 02:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster int8 quantized #125704

Faster int8 quantized #125704

malfet commented May 7, 2024 •

edited

Loading

pytorch-bot bot commented May 7, 2024 •

edited

Loading

malfet commented May 7, 2024

malfet commented May 15, 2024

pytorchmergebot commented May 15, 2024

Faster int8 quantized #125704

Faster int8 quantized #125704

Conversation

malfet commented May 7, 2024 • edited Loading

pytorch-bot bot commented May 7, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125704

✅ No Failures

malfet commented May 7, 2024

malfet commented May 15, 2024

pytorchmergebot commented May 15, 2024

Merge started

malfet commented May 7, 2024 •

edited

Loading

pytorch-bot bot commented May 7, 2024 •

edited

Loading