Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add int4 packed gemm support on CPU device #117475

Closed
wants to merge 21 commits into from

Conversation

mingfeima
Copy link
Collaborator

@mingfeima mingfeima commented Jan 15, 2024

Stack from ghstack (oldest at bottom):

This patch adds int4 packed gemm support on CPU, both avx512 and avx2 are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is 16.13 sec total, 12.40 tokens/sec

  • WOQ int4 on avx512: 5.92 sec total, 33.79 tokens/sec
  • WOQ int4 on avx2: 6.90 sec total, 29.00 tokens/sec

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10

add int4 packed gemm on avx2

refine blocking on K

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Jan 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117475

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5391b4b with merge base 8a42cff (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mingfeima added a commit that referenced this pull request Jan 15, 2024
add int4 packed gemm on avx2

refine blocking on K

ghstack-source-id: 40a43dc98cbce3dd7559b69697cc69d799bca3b7
Pull Request resolved: #117475
@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jan 15, 2024
@mingfeima mingfeima marked this pull request as draft January 15, 2024 02:32
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jan 15, 2024
add int4 packed gemm on avx2

refine blocking on K

ghstack-source-id: 5e48822450acb02cb3fe5db374129b4e135b249c
Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jan 15, 2024
add int4 packed gemm on avx2

refine blocking on K

ghstack-source-id: dfeffb51fd67c547c5ff64a72dc09e7c11b32fc6
Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jan 15, 2024
add int4 packed gemm on avx2

refine blocking on K

ghstack-source-id: 7a63882f95d07d36a365b498325eccca7085e58f
Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jan 15, 2024
add int4 packed gemm on avx2

refine blocking on K

ghstack-source-id: 2d5b5ec52693d76b4e98772031638e0d3b68c35b
Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jan 15, 2024
add int4 packed gemm on avx2

refine blocking on K

ghstack-source-id: 5fa83e92c70927b296a5e2abebdd83da7f8cb3f6
Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jan 15, 2024
add int4 packed gemm on avx2

refine blocking on K

ghstack-source-id: f8013a8f16dc8dca6cf16a6b8677215a0fc6b7b4
Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jan 15, 2024
add int4 packed gemm on avx2

refine blocking on K

ghstack-source-id: bdcef93e555e45e1d70ffbcb31508b96d021ffc7
Pull Request resolved: #117475
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
mingfeima added a commit that referenced this pull request Jan 15, 2024
add int4 packed gemm on avx2

refine blocking on K

ghstack-source-id: 161a875ff2c10ee0d1ddd3cf9560aa52adbde4f5
Pull Request resolved: #117475
pytorchmergebot pushed a commit that referenced this pull request Mar 2, 2024
@izaitsevfb
Copy link
Contributor

@pytorchbot revert -m "fails meta-internal tests" -c ghfirst

@mingfeima, looks like this PR breaks meta-internal int4 quant tests. I can't share the whole context, but here's some excerpts:
https://gist.github.com/izaitsevfb/0d55264b03d1472968f3114e1db3f100

======================================================================
ERROR: test_save_load_int4woqtensors (pytorch.ao.test.test.TestSaveLoadMeta)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)

...

  File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/torch/__init__.py", line 1121, in _check_with
    raise error_type(message_evaluated)
torch._dynamo.exc.TorchRuntimeError: Failed running call_module L__self___lin1(*(FakeTensor(..., device='cuda:0', size=(32, 64), dtype=torch.bfloat16),), **{}):
shape '[32, 32]' is invalid for input of size 128

from user code:
   File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/pytorch/ao/test/test.py", line 1026, in forward
    x = self.lin1(x)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

cc @malfet
internal: D54443571

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@mingfeima your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Mar 4, 2024
This reverts commit 30befa5.

Reverted #117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](#117475 (comment)))
@mingfeima
Copy link
Collaborator Author

@pytorchbot revert -m "fails meta-internal tests" -c ghfirst

@mingfeima, looks like this PR breaks meta-internal int4 quant tests. I can't share the whole context, but here's some excerpts: https://gist.github.com/izaitsevfb/0d55264b03d1472968f3114e1db3f100

======================================================================
ERROR: test_save_load_int4woqtensors (pytorch.ao.test.test.TestSaveLoadMeta)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)

...

  File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/torch/__init__.py", line 1121, in _check_with
    raise error_type(message_evaluated)
torch._dynamo.exc.TorchRuntimeError: Failed running call_module L__self___lin1(*(FakeTensor(..., device='cuda:0', size=(32, 64), dtype=torch.bfloat16),), **{}):
shape '[32, 32]' is invalid for input of size 128

from user code:
   File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/pytorch/ao/test/test.py", line 1026, in forward
    x = self.lin1(x)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

@izaitsevfb it feels like dynamo compiles with the wrong shape. Given a weight of {32, 64} where n=32 and k=64, it will be {n, k/2} with CPU device and { n // 8, k // (inner_k_tiles * 16), 32, inner_k_tiles // 2} on CUDA. I think we can start from change_linear_weights_to_int4_woqtensors to debug this issue.

@dervon
Copy link

dervon commented Mar 5, 2024

@mingfeima, It seems that the behavier of cpu and gpu are different for op _convert_weight_to_int4pack.

on cpu

>>>import torch
>>>torch.__version__
'2.3.0a0+gitf84375c'
>>>torch._convert_weight_to_int4pack(torch.arange(16*16, dtype=torch.int32).reshape(16, 16), 8).shape
torch.Size([16, 8])

on gpu

>>>import torch
>>>torch.__version__
2.2.1+cu121
>>>torch._convert_weight_to_int4pack(torch.arange(128*256, dtype=torch.int32).reshape(128, 256).cuda(), 8).shape
torch.Size([16, 2, 32, 4])

As you can see,
the dim of output are different.

@mingfeima
Copy link
Collaborator Author

mingfeima commented Mar 5, 2024

@mingfeima, It seems that the behavier of cpu and gpu are different for op _convert_weight_to_int4pack.

@dervon Yes, cpu and gpu has different packed formats for int4. cpu uses a 2d tensor and gpu uses a 4d tensor. Will this be a trouble for dynamo? Will the issue be fixed if I use a fake 4d tensor, e.g. {n, k/2, 1, 1} for cpu packed weight ?

Additionally, the packed weight dtype is also different. cpu uses torch.uint8 and gpu uses torch.int32, will this be a problem as well ?

@malfet malfet self-requested a review March 6, 2024 00:48
@malfet
Copy link
Contributor

malfet commented Mar 6, 2024

@mingfeima sorry, I've missed this part earlier, but point of meta registration is that for the same input shape it should return some output shapes, regardless of whether tensor is on CPU, GPU, XPU or whatever. How interpret this output is up to operator implementation, but shapes and dtypes be the same. So, if you don't mind, can you change the code to return 4D int32(each int32 is a pack of 4 int8) tensor rather than 2D int8 tensor?

This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@mingfeima
Copy link
Collaborator Author

@malfet Just updated this patch to use the same shape for packed weight on both CPU and CUDA device! Hopefully dynamo tests won't fail this time.

Lourencom pushed a commit to Lourencom/pytorch that referenced this pull request Mar 6, 2024
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: pytorch#117475
Approved by: https://github.com/jgong5, https://github.com/malfet
Lourencom pushed a commit to Lourencom/pytorch that referenced this pull request Mar 6, 2024
Lourencom pushed a commit to Lourencom/pytorch that referenced this pull request Mar 6, 2024
@malfet
Copy link
Contributor

malfet commented Mar 6, 2024

@mingfeima, looks like this PR breaks meta-internal int4 quant tests. I can't share the whole context, but here's some excerpts: https://gist.github.com/izaitsevfb/0d55264b03d1472968f3114e1db3f100

@izaitsevfb just FYI, torch.ao is not internal only, its just that we run those tests right away internally, but would've taken us a day to discover it in OSS, when torch.ao will pick latest nightly

Test that was failing is likely this one https://github.com/pytorch-labs/ao/blob/c9b397de3895610cfbbca2ccef96fc12c9208885/test/test.py#L1013

Manually run those tests before merging

@malfet
Copy link
Contributor

malfet commented Mar 6, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

}
#else
for (int n = 0; n < nb_size; n += 2) {
int32_t val0 = src[n * K + k];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, what is the reason for transposing the tensor while packing it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major reason is that x86 doesn't have a horizontal reduce from a SIMD vector to a Scalar. Note that on avx512, _mm512_reduce_add_ps is a SEQUENCE instruction, which means it will be translated into multiple uops (6 or 7 I don't remember precisely).

If we do gemm in the form of NT (which is default in pytorch Linear definition): {M, K} * {N, K}, we will end up doing dot-product on each m row by n col, and eventually we have a horizontal reduce and write to C would be scalar.

A better approach would be NN: {M, K} * {K, N}, we are doing FMA, broadcast each A[row_index] to a vector and load 4 or 6 vectors from B cols, then use C as an accumulator, we won't need a horizontal reduce and write to C would be a vectorized store.

Above is the explanation of avx512f with fp32 FMA, the other ISAs are similar:

  • For avx512f, we transpose B from [N, K] to [K, N]
  • For avx-vnni, we can assume that 4 * int8 is a float32, so we packed B from [N, K] to [K/4, N, 4], e.g. [K/4, N4]
  • For avx512-bf16 or avx512-f16, we can assume that 2 * bf16 or 2 * f16 is a float32, so we packed B from [N, K] to [K/2, N, 2], e.g. [K/2, N2]
  • For amx-vnni, it is just doing 16 cols together, you may assume that [K/4, N4] = [K/4, 64]
  • for amx-bf16 or amx-f16, also it it doing 16 cols together, you may assume that [K/2, N2] = [k/2, 32]

So you see, all the gemms (fp32, int8, bf16) are following the same pattern.

The above is the explanation of the inner block, usually we need to break the weight B into multiples of inner blocks, this is ISA dependent:

For this piece of code, on avx512 we have 32 regs and we want to use them efficiently. A will take 1 reg, B will take 4, and C will take 4 by 4, the zero and scale takes 2 by 4. So we end up having block size of 4 x 64 (4 x 4 x 16 floats). avx512-vnni, avx512-bf16 follow the same rule. avx2 would be different as it has only 16 regs.

For amx, we have 8 tiles. Usually, the pattern is 2-2-4 or 1-3-3, which means A takes 2 tiles and B takes 2 tiles, C takes 4 tiles. and each tile computes 16 rows and 16 cols, so the inner block size is usually 16x16 and we handle 2 A blocks and 2 B blocks at a tiles (the 2-2-4 pattern). When M < 16, we use 1-3-3 pattern, waste 1 tile here.

It is often very difficult to understand the packed layout if we read from oneDNN primitive verbose. We can break it down into 2 parts and in this way it would be much easier to sort out the layout:

  • the inner block (described in the front), are all the same no matter int8, bf16, f16, f32;
  • the outer block (described in the latter), are related to register allocation and caching behaviors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: linalg_frontend release notes category Reverted
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

7 participants