add int4 packed gemm support on CPU device #117475

mingfeima · 2024-01-15T02:31:34Z

Stack from ghstack (oldest at bottom):

This patch adds int4 packed gemm support on CPU, both avx512 and avx2 are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is 16.13 sec total, 12.40 tokens/sec

WOQ int4 on avx512: 5.92 sec total, 33.79 tokens/sec
WOQ int4 on avx2: 6.90 sec total, 29.00 tokens/sec

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10

add int4 packed gemm on avx2 refine blocking on K [ghstack-poisoned]

pytorch-bot · 2024-01-15T02:31:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117475

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5391b4b with merge base 8a42cff ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 40a43dc98cbce3dd7559b69697cc69d799bca3b7 Pull Request resolved: #117475

This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 5e48822450acb02cb3fe5db374129b4e135b249c Pull Request resolved: #117475

This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: dfeffb51fd67c547c5ff64a72dc09e7c11b32fc6 Pull Request resolved: #117475

This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 7a63882f95d07d36a365b498325eccca7085e58f Pull Request resolved: #117475

This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 2d5b5ec52693d76b4e98772031638e0d3b68c35b Pull Request resolved: #117475

This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 5fa83e92c70927b296a5e2abebdd83da7f8cb3f6 Pull Request resolved: #117475

This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: f8013a8f16dc8dca6cf16a6b8677215a0fc6b7b4 Pull Request resolved: #117475

This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: bdcef93e555e45e1d70ffbcb31508b96d021ffc7 Pull Request resolved: #117475

This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `28.36 sec total, 7.05 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `11.91 sec total, 16.80 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 161a875ff2c10ee0d1ddd3cf9560aa52adbde4f5 Pull Request resolved: #117475

Pull Request resolved: #118056 Approved by: https://github.com/mikekgfb ghstack dependencies: #117475

izaitsevfb · 2024-03-04T21:19:15Z

@pytorchbot revert -m "fails meta-internal tests" -c ghfirst

@mingfeima, looks like this PR breaks meta-internal int4 quant tests. I can't share the whole context, but here's some excerpts:
https://gist.github.com/izaitsevfb/0d55264b03d1472968f3114e1db3f100

======================================================================
ERROR: test_save_load_int4woqtensors (pytorch.ao.test.test.TestSaveLoadMeta)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)

...

  File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/torch/__init__.py", line 1121, in _check_with
    raise error_type(message_evaluated)
torch._dynamo.exc.TorchRuntimeError: Failed running call_module L__self___lin1(*(FakeTensor(..., device='cuda:0', size=(32, 64), dtype=torch.bfloat16),), **{}):
shape '[32, 32]' is invalid for input of size 128

from user code:
   File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/pytorch/ao/test/test.py", line 1026, in forward
    x = self.lin1(x)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

cc @malfet
internal: D54443571

pytorchmergebot · 2024-03-04T21:20:47Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2024-03-04T21:21:01Z

@mingfeima your PR has been successfully reverted.

This reverts commit 30befa5. Reverted #117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](#117475 (comment)))

mingfeima · 2024-03-05T06:26:27Z

@pytorchbot revert -m "fails meta-internal tests" -c ghfirst

@mingfeima, looks like this PR breaks meta-internal int4 quant tests. I can't share the whole context, but here's some excerpts: https://gist.github.com/izaitsevfb/0d55264b03d1472968f3114e1db3f100

======================================================================
ERROR: test_save_load_int4woqtensors (pytorch.ao.test.test.TestSaveLoadMeta)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)

...

  File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/torch/__init__.py", line 1121, in _check_with
    raise error_type(message_evaluated)
torch._dynamo.exc.TorchRuntimeError: Failed running call_module L__self___lin1(*(FakeTensor(..., device='cuda:0', size=(32, 64), dtype=torch.bfloat16),), **{}):
shape '[32, 32]' is invalid for input of size 128

from user code:
   File "/re_cwd/buck-out/v2/gen/fbcode/299b386e280e5808/pytorch/ao/test/__torchao_tests__/torchao_tests#link-tree/pytorch/ao/test/test.py", line 1026, in forward
    x = self.lin1(x)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

@izaitsevfb it feels like dynamo compiles with the wrong shape. Given a weight of {32, 64} where n=32 and k=64, it will be {n, k/2} with CPU device and { n // 8, k // (inner_k_tiles * 16), 32, inner_k_tiles // 2} on CUDA. I think we can start from change_linear_weights_to_int4_woqtensors to debug this issue.

dervon · 2024-03-05T07:55:18Z

@mingfeima, It seems that the behavier of cpu and gpu are different for op _convert_weight_to_int4pack.

on cpu

>>>import torch
>>>torch.__version__
'2.3.0a0+gitf84375c'
>>>torch._convert_weight_to_int4pack(torch.arange(16*16, dtype=torch.int32).reshape(16, 16), 8).shape
torch.Size([16, 8])

on gpu

>>>import torch
>>>torch.__version__
2.2.1+cu121
>>>torch._convert_weight_to_int4pack(torch.arange(128*256, dtype=torch.int32).reshape(128, 256).cuda(), 8).shape
torch.Size([16, 2, 32, 4])

As you can see,
the dim of output are different.

mingfeima · 2024-03-05T08:15:36Z

@mingfeima, It seems that the behavier of cpu and gpu are different for op _convert_weight_to_int4pack.

@dervon Yes, cpu and gpu has different packed formats for int4. cpu uses a 2d tensor and gpu uses a 4d tensor. Will this be a trouble for dynamo? Will the issue be fixed if I use a fake 4d tensor, e.g. {n, k/2, 1, 1} for cpu packed weight ?

Additionally, the packed weight dtype is also different. cpu uses torch.uint8 and gpu uses torch.int32, will this be a problem as well ?

malfet · 2024-03-06T00:51:52Z

@mingfeima sorry, I've missed this part earlier, but point of meta registration is that for the same input shape it should return some output shapes, regardless of whether tensor is on CPU, GPU, XPU or whatever. How interpret this output is up to operator implementation, but shapes and dtypes be the same. So, if you don't mind, can you change the code to return 4D int32(each int32 is a pack of 4 int8) tensor rather than 2D int8 tensor?

This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

mingfeima · 2024-03-06T04:26:09Z

@malfet Just updated this patch to use the same shape for packed weight on both CPU and CUDA device! Hopefully dynamo tests won't fail this time.

This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization Pull Request resolved: pytorch#117475 Approved by: https://github.com/jgong5, https://github.com/malfet

Pull Request resolved: pytorch#118056 Approved by: https://github.com/mikekgfb ghstack dependencies: pytorch#117475

This reverts commit 30befa5. Reverted pytorch#117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](pytorch#117475 (comment)))

malfet · 2024-03-06T15:18:07Z

@mingfeima, looks like this PR breaks meta-internal int4 quant tests. I can't share the whole context, but here's some excerpts: https://gist.github.com/izaitsevfb/0d55264b03d1472968f3114e1db3f100

@izaitsevfb just FYI, torch.ao is not internal only, its just that we run those tests right away internally, but would've taken us a day to discover it in OSS, when torch.ao will pick latest nightly

Test that was failing is likely this one https://github.com/pytorch-labs/ao/blob/c9b397de3895610cfbbca2ccef96fc12c9208885/test/test.py#L1013

Manually run those tests before merging

malfet · 2024-03-06T16:23:35Z

@pytorchbot merge

pytorchmergebot · 2024-03-06T16:25:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2024-05-17T02:59:28Z

aten/src/ATen/native/cpu/int4mm_kernel.cpp

+        }
+#else
+        for (int n = 0; n < nb_size; n += 2) {
+          int32_t val0 = src[n * K + k];


I'm curious, what is the reason for transposing the tensor while packing it?

The major reason is that x86 doesn't have a horizontal reduce from a SIMD vector to a Scalar. Note that on avx512, _mm512_reduce_add_ps is a SEQUENCE instruction, which means it will be translated into multiple uops (6 or 7 I don't remember precisely).

If we do gemm in the form of NT (which is default in pytorch Linear definition): {M, K} * {N, K}, we will end up doing dot-product on each m row by n col, and eventually we have a horizontal reduce and write to C would be scalar.

A better approach would be NN： {M, K} * {K, N}, we are doing FMA, broadcast each A[row_index] to a vector and load 4 or 6 vectors from B cols, then use C as an accumulator, we won't need a horizontal reduce and write to C would be a vectorized store.

Above is the explanation of avx512f with fp32 FMA, the other ISAs are similar:

For avx512f, we transpose B from [N, K] to [K, N]

For avx-vnni, we can assume that 4 * int8 is a float32, so we packed B from [N, K] to [K/4, N, 4], e.g. [K/4, N4]

For avx512-bf16 or avx512-f16, we can assume that 2 * bf16 or 2 * f16 is a float32, so we packed B from [N, K] to [K/2, N, 2], e.g. [K/2, N2]

For amx-vnni, it is just doing 16 cols together, you may assume that [K/4, N4] = [K/4, 64]

for amx-bf16 or amx-f16, also it it doing 16 cols together, you may assume that [K/2, N2] = [k/2, 32]

So you see, all the gemms (fp32, int8, bf16) are following the same pattern.

The above is the explanation of the inner block, usually we need to break the weight B into multiples of inner blocks, this is ISA dependent:

For this piece of code, on avx512 we have 32 regs and we want to use them efficiently. A will take 1 reg, B will take 4, and C will take 4 by 4, the zero and scale takes 2 by 4. So we end up having block size of 4 x 64 (4 x 4 x 16 floats). avx512-vnni, avx512-bf16 follow the same rule. avx2 would be different as it has only 16 regs.

For amx, we have 8 tiles. Usually, the pattern is 2-2-4 or 1-3-3, which means A takes 2 tiles and B takes 2 tiles, C takes 4 tiles. and each tile computes 16 rows and 16 cols, so the inner block size is usually 16x16 and we handle 2 A blocks and 2 B blocks at a tiles (the 2-2-4 pattern). When M < 16, we use 1-3-3 pattern, waste 1 tile here.

It is often very difficult to understand the packed layout if we read from oneDNN primitive verbose. We can break it down into 2 parts and in this way it would be much easier to sort out the layout:

the inner block (described in the front), are all the same no matter int8, bf16, f16, f32;

the outer block (described in the latter), are related to register allocation and caching behaviors.

add int4 packed gemm support on CPU device

cff4c52

add int4 packed gemm on avx2 refine blocking on K [ghstack-poisoned]

mingfeima requested review from lezcano, nikitaved and IvanYashchuk as code owners January 15, 2024 02:31

pytorch-bot bot added the release notes: linalg_frontend release notes category label Jan 15, 2024

pytorchbot added the open source label Jan 15, 2024

mingfeima added a commit that referenced this pull request Jan 15, 2024

add int4 packed gemm support on CPU device

5dcfd87

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 40a43dc98cbce3dd7559b69697cc69d799bca3b7 Pull Request resolved: #117475

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jan 15, 2024

mingfeima marked this pull request as draft January 15, 2024 02:32

mingfeima removed request for nikitaved, lezcano and IvanYashchuk January 15, 2024 02:32

mingfeima mentioned this pull request Jan 15, 2024

add int4 cpu support pytorch-labs/gpt-fast#84

Merged

mingfeima added a commit that referenced this pull request Jan 15, 2024

add int4 packed gemm support on CPU device

018b7c2

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 5e48822450acb02cb3fe5db374129b4e135b249c Pull Request resolved: #117475

mingfeima added a commit that referenced this pull request Jan 15, 2024

add int4 packed gemm support on CPU device

a964738

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: dfeffb51fd67c547c5ff64a72dc09e7c11b32fc6 Pull Request resolved: #117475

mingfeima added a commit that referenced this pull request Jan 15, 2024

add int4 packed gemm support on CPU device

65313d9

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 7a63882f95d07d36a365b498325eccca7085e58f Pull Request resolved: #117475

mingfeima added a commit that referenced this pull request Jan 15, 2024

add int4 packed gemm support on CPU device

9c58431

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 2d5b5ec52693d76b4e98772031638e0d3b68c35b Pull Request resolved: #117475

mingfeima added a commit that referenced this pull request Jan 15, 2024

add int4 packed gemm support on CPU device

ae4350f

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 5fa83e92c70927b296a5e2abebdd83da7f8cb3f6 Pull Request resolved: #117475

mingfeima added a commit that referenced this pull request Jan 15, 2024

add int4 packed gemm support on CPU device

cc81590

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: f8013a8f16dc8dca6cf16a6b8677215a0fc6b7b4 Pull Request resolved: #117475

mingfeima added a commit that referenced this pull request Jan 15, 2024

add int4 packed gemm support on CPU device

f926a24

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: bdcef93e555e45e1d70ffbcb31508b96d021ffc7 Pull Request resolved: #117475

mingfeima added a commit that referenced this pull request Jan 15, 2024

add int4 packed gemm support on CPU device

d843062

add int4 packed gemm on avx2 refine blocking on K ghstack-source-id: 161a875ff2c10ee0d1ddd3cf9560aa52adbde4f5 Pull Request resolved: #117475

pytorchmergebot removed the merging label Mar 2, 2024

pytorchmergebot pushed a commit that referenced this pull request Mar 2, 2024

add int8 packed gemm support on CPU device (#118056)

f84375c

Pull Request resolved: #118056 Approved by: https://github.com/mikekgfb ghstack dependencies: #117475

pytorchmergebot added the Reverted label Mar 4, 2024

pytorchmergebot reopened this Mar 4, 2024

malfet self-requested a review March 6, 2024 00:48

Lourencom pushed a commit to Lourencom/pytorch that referenced this pull request Mar 6, 2024

add int8 packed gemm support on CPU device (pytorch#118056)

36caaf2

Pull Request resolved: pytorch#118056 Approved by: https://github.com/mikekgfb ghstack dependencies: pytorch#117475

malfet approved these changes Mar 6, 2024

View reviewed changes

pytorchmergebot added the merging label Mar 6, 2024

pytorchmergebot closed this in a427d90 Mar 6, 2024

pytorchmergebot removed the merging label Mar 6, 2024

supriyar mentioned this pull request Mar 12, 2024

[RFC] Plans for torchao pytorch/ao#47

Open

yanbing-j mentioned this pull request Mar 25, 2024

Add CPU support in mixtral-moe for int8 woq pytorch-labs/gpt-fast#145

Closed

malfet reviewed May 17, 2024

View reviewed changes

mingfeima mentioned this pull request Jun 3, 2024

Add Intel Advanced Matrix Extensions (AMX) support to ggml ggerganov/llama.cpp#7707

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add int4 packed gemm support on CPU device #117475

add int4 packed gemm support on CPU device #117475

mingfeima commented Jan 15, 2024 •

edited

pytorch-bot bot commented Jan 15, 2024 •

edited

izaitsevfb commented Mar 4, 2024

pytorchmergebot commented Mar 4, 2024

pytorchmergebot commented Mar 4, 2024

mingfeima commented Mar 5, 2024

dervon commented Mar 5, 2024

mingfeima commented Mar 5, 2024 •

edited

malfet commented Mar 6, 2024

mingfeima commented Mar 6, 2024

malfet commented Mar 6, 2024 •

edited

malfet commented Mar 6, 2024

pytorchmergebot commented Mar 6, 2024

malfet May 17, 2024

mingfeima May 17, 2024

add int4 packed gemm support on CPU device #117475

add int4 packed gemm support on CPU device #117475

Conversation

mingfeima commented Jan 15, 2024 • edited

pytorch-bot bot commented Jan 15, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117475

✅ No Failures

izaitsevfb commented Mar 4, 2024

pytorchmergebot commented Mar 4, 2024

pytorchmergebot commented Mar 4, 2024

mingfeima commented Mar 5, 2024

dervon commented Mar 5, 2024

on cpu

on gpu

mingfeima commented Mar 5, 2024 • edited

malfet commented Mar 6, 2024

mingfeima commented Mar 6, 2024

malfet commented Mar 6, 2024 • edited

malfet commented Mar 6, 2024

pytorchmergebot commented Mar 6, 2024

Merge started

malfet May 17, 2024

Choose a reason for hiding this comment

mingfeima May 17, 2024

Choose a reason for hiding this comment

mingfeima commented Jan 15, 2024 •

edited

pytorch-bot bot commented Jan 15, 2024 •

edited

mingfeima commented Mar 5, 2024 •

edited

malfet commented Mar 6, 2024 •

edited