-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add int4 packed gemm support on CPU device #117475
Conversation
add int4 packed gemm on avx2 refine blocking on K [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117475
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5391b4b with merge base 8a42cff (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 40a43dc98cbce3dd7559b69697cc69d799bca3b7 Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 5e48822450acb02cb3fe5db374129b4e135b249c Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: dfeffb51fd67c547c5ff64a72dc09e7c11b32fc6 Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 7a63882f95d07d36a365b498325eccca7085e58f Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 2d5b5ec52693d76b4e98772031638e0d3b68c35b Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 5fa83e92c70927b296a5e2abebdd83da7f8cb3f6 Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: f8013a8f16dc8dca6cf16a6b8677215a0fc6b7b4 Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: bdcef93e555e45e1d70ffbcb31508b96d021ffc7 Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 161a875ff2c10ee0d1ddd3cf9560aa52adbde4f5 Pull Request resolved: #117475
Pull Request resolved: #118056 Approved by: https://github.com/mikekgfb ghstack dependencies: #117475
@pytorchbot revert -m "fails meta-internal tests" -c ghfirst @mingfeima, looks like this PR breaks meta-internal int4 quant tests. I can't share the whole context, but here's some excerpts:
|
@pytorchbot successfully started a revert job. Check the current status here. |
@mingfeima your PR has been successfully reverted. |
This reverts commit 30befa5. Reverted #117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](#117475 (comment)))
@izaitsevfb it feels like dynamo compiles with the wrong shape. Given a weight of {32, 64} where n=32 and k=64, it will be {n, k/2} with CPU device and { n // 8, k // (inner_k_tiles * 16), 32, inner_k_tiles // 2} on CUDA. I think we can start from |
@mingfeima, It seems that the behavier of cpu and gpu are different for op _convert_weight_to_int4pack. on cpu>>>import torch
>>>torch.__version__
'2.3.0a0+gitf84375c'
>>>torch._convert_weight_to_int4pack(torch.arange(16*16, dtype=torch.int32).reshape(16, 16), 8).shape
torch.Size([16, 8]) on gpu>>>import torch
>>>torch.__version__
2.2.1+cu121
>>>torch._convert_weight_to_int4pack(torch.arange(128*256, dtype=torch.int32).reshape(128, 256).cuda(), 8).shape
torch.Size([16, 2, 32, 4]) As you can see, |
@dervon Yes, cpu and gpu has different packed formats for int4. cpu uses a 2d tensor and gpu uses a 4d tensor. Will this be a trouble for dynamo? Will the issue be fixed if I use a fake 4d tensor, e.g. {n, k/2, 1, 1} for cpu packed weight ? Additionally, the packed weight dtype is also different. cpu uses |
@mingfeima sorry, I've missed this part earlier, but point of meta registration is that for the same input shape it should return some output shapes, regardless of whether tensor is on CPU, GPU, XPU or whatever. How interpret this output is up to operator implementation, but shapes and dtypes be the same. So, if you don't mind, can you change the code to return 4D int32(each int32 is a pack of 4 int8) tensor rather than 2D int8 tensor? |
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
@malfet Just updated this patch to use the same shape for packed weight on both CPU and CUDA device! Hopefully dynamo tests won't fail this time. |
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization Pull Request resolved: pytorch#117475 Approved by: https://github.com/jgong5, https://github.com/malfet
Pull Request resolved: pytorch#118056 Approved by: https://github.com/mikekgfb ghstack dependencies: pytorch#117475
This reverts commit 30befa5. Reverted pytorch#117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](pytorch#117475 (comment)))
@izaitsevfb just FYI, torch.ao is not internal only, its just that we run those tests right away internally, but would've taken us a day to discover it in OSS, when torch.ao will pick latest nightly Test that was failing is likely this one https://github.com/pytorch-labs/ao/blob/c9b397de3895610cfbbca2ccef96fc12c9208885/test/test.py#L1013 Manually run those tests before merging |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
} | ||
#else | ||
for (int n = 0; n < nb_size; n += 2) { | ||
int32_t val0 = src[n * K + k]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, what is the reason for transposing the tensor while packing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The major reason is that x86 doesn't have a horizontal reduce from a SIMD vector to a Scalar. Note that on avx512, _mm512_reduce_add_ps
is a SEQUENCE instruction, which means it will be translated into multiple uops (6 or 7 I don't remember precisely).
If we do gemm in the form of NT (which is default in pytorch Linear definition): {M, K} * {N, K}, we will end up doing dot-product on each m row by n col, and eventually we have a horizontal reduce and write to C would be scalar.
A better approach would be NN: {M, K} * {K, N}, we are doing FMA, broadcast each A[row_index] to a vector and load 4 or 6 vectors from B cols, then use C as an accumulator, we won't need a horizontal reduce and write to C would be a vectorized store.
Above is the explanation of avx512f with fp32 FMA, the other ISAs are similar:
- For avx512f, we transpose B from [N, K] to [K, N]
- For avx-vnni, we can assume that 4 * int8 is a float32, so we packed B from [N, K] to [K/4, N, 4], e.g. [K/4, N4]
- For avx512-bf16 or avx512-f16, we can assume that 2 * bf16 or 2 * f16 is a float32, so we packed B from [N, K] to [K/2, N, 2], e.g. [K/2, N2]
- For amx-vnni, it is just doing 16 cols together, you may assume that [K/4, N4] = [K/4, 64]
- for amx-bf16 or amx-f16, also it it doing 16 cols together, you may assume that [K/2, N2] = [k/2, 32]
So you see, all the gemms (fp32, int8, bf16) are following the same pattern.
The above is the explanation of the inner block, usually we need to break the weight B into multiples of inner blocks, this is ISA dependent:
For this piece of code, on avx512 we have 32 regs and we want to use them efficiently. A will take 1 reg, B will take 4, and C will take 4 by 4, the zero and scale takes 2 by 4. So we end up having block size of 4 x 64 (4 x 4 x 16 floats). avx512-vnni, avx512-bf16 follow the same rule. avx2 would be different as it has only 16 regs.
For amx, we have 8 tiles. Usually, the pattern is 2-2-4 or 1-3-3, which means A takes 2 tiles and B takes 2 tiles, C takes 4 tiles. and each tile computes 16 rows and 16 cols, so the inner block size is usually 16x16 and we handle 2 A blocks and 2 B blocks at a tiles (the 2-2-4 pattern). When M < 16, we use 1-3-3 pattern, waste 1 tile here.
It is often very difficult to understand the packed layout if we read from oneDNN primitive verbose. We can break it down into 2 parts and in this way it would be much easier to sort out the layout:
- the inner block (described in the front), are all the same no matter int8, bf16, f16, f32;
- the outer block (described in the latter), are related to register allocation and caching behaviors.
Stack from ghstack (oldest at bottom):
This patch adds int4 packed gemm support on CPU, both
avx512
andavx2
are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fastThe default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is
16.13 sec total, 12.40 tokens/sec
5.92 sec total, 33.79 tokens/sec
6.90 sec total, 29.00 tokens/sec
WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization
cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10