[Kernel] W8A16 Int8 inside FusedMoE #7415

mzusman · 2024-08-12T09:10:34Z

🚀 The feature, motivation and pitch

Additional feature for fused_moe triton kernel to support W8A16 with Int8, supports Ampere/Ada lovelace/Hopper, called ExpertsInt8.
Based on symmetric per-column per-expert Int8 quantization, casting the weights to FP16/BF16 before matmul inside the fused_moe kernel (compute_type in FP/BF16).
Support quantization and scales extraction on startup (takes 1min on Jamba).
We've ran quality benchmarks on Jamba and it shows no quality degradation:

Method	BF16	ExpertsInt8
gsm8k cot	59.5	59.8
MMLU	67.3	67.3
gsm8k	50.6	50.1
narrative_qa	67.8	68.7
ppl/c4	-0.4301	-0.4301

Performance:
E2E latency in seconds on requests with Prompt length=1024, decode length=128:

Model	Hardware	Method	BS=1	BS=4	BS=8
Mixtral8x22B	H100*8	FP8 (MoE quant-only)	1.3	1.79	2.26
Mixtral8x22B	H100*8	ExpertsInt8	1.3	1.84	2.205
--	--	--	--	--	--
Mixtral8x7B	H100*4	FP8 (MoE quant-only)	0.835	1.22	1.44
Mixtral8x7B	H100*4	ExpertsInt8	0.83	1.22	1.42
--	--	--	--	--	--
Jamba	H100*2	FP8 (MoE quant-only)	1.3	2	3.2
Jamba	H100*2	ExpertsInt8	1.16	2	3.14
--	--	--	--	--	--
Jamba	A100*2	FP16	1.75	3.2	4.6
Jamba	A100*2	ExpertsInt8	1.65	2.7	4
Jamba	A100*2	GPTQ 8bit (W/o FusedMoe)	3.8	5.9	9.4

Advantages:

Doesn't require calibration preprocess.
No quality degradation.
Quantized FusedMoE methodology that runs on A100s.
Safer in case of large activations since they can be saved in BF16 reducing the risk of overflow.

github-actions · 2024-08-12T09:10:48Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

mzusman · 2024-08-12T09:50:04Z

/ready

halexan · 2024-08-13T01:23:49Z

@mzusman

How to convert models to ExpertsInt8?

robertgshaw2-neuralmagic · 2024-08-13T01:49:05Z

Hey - just FYI there are some ongoing efforts to extend Marlin to support W4A16 and W8A16

Right now the kernels load GPTQ models, but we could really connect them to any models type

We should run benchmarks against these as well in deciding which kernel to use

#7079

mzusman · 2024-08-13T07:19:06Z

@mzusman

How to convert models to ExpertsInt8?

You would just need to run vLLM with --quantization experts_int8, It supports quantization on the fly

mzusman · 2024-08-13T09:55:54Z

Hey - just FYI there are some ongoing efforts to extend Marlin to support W4A16 and W8A16

Right now the kernels load GPTQ models, but we could really connect them to any models type

We should run benchmarks against these as well in deciding which kernel to use

#7079

I understand, I was trying to benchmark this method against the PR #7079 on Nous-Hermes-2-Mixtral-8x7B-DPO-GPTQ-8bit-128g/TheBloke/Mixtral-8x7B-v0.1-GPTQ and it ended up unsuccessful (it hangs on startup).

@robertgshaw2-neuralmagic Do you think this is a blocker for merging this PR? We can have the two options available.

jeejeelee · 2024-08-14T05:55:40Z

It seems that #6502 is also a similar PR.

halexan · 2024-08-15T09:40:25Z

I tested this pull request on deepseek-v2-chat-236B. Indeed more concurrency.

supported

mgoin

Thanks for the quick changes! I think these are my last round of comments

.buildkite/run-cpu-test.sh

benchmarks/kernels/benchmark_moe.py

vllm/model_executor/layers/fused_moe/fused_moe.py

vllm/model_executor/layers/quantization/experts_int8.py

mzusman · 2024-08-15T19:49:45Z

This generally looks good to me and experiments with some interesting ideas. I think the biggest issue is the clashing assumption between use_fp8 and use_int8, where one is W8A8 and the other is W8A16 - we should definitely be explicit with the level of quantization if we're going this route.

It would be nice if we could select between INT8 W8A16 or W8A8 (since we already have efficient activation quant methods), like @qingquansong has proposed in #6978

In the short-term it seems like we might end up with use_fp8_w8a8, use_int8_w8a8, and use_int8_w8a16 - so it would be nice if we could work towards a more sustainable future for the code.

Thank you for the thorough review! I've changed the terms use_int8/use_fp8 to use_int8_w8a16/use_fp8_w8a8 and delivered this idea to the moe config files dtype as well.

mzusman · 2024-08-15T23:53:43Z

CI failures seem to not be related to this PR

dsikka · 2024-08-16T13:55:25Z

vllm/model_executor/layers/quantization/experts_int8.py

+            else:
+                raise ValueError(
+                    f"Shard id must be in [0,1,2] but got {shard_id}")
+            weight_loader(param, loaded_weight, weight_name, shard_id,


Just an fyi - we're updating/expanding the weight loading for Fused MoE layers: #7527

mgoin

LGTM! Ditto with Dipika that it'd be good to make this work with the MoE Parameter refactor eventually

mzusman · 2024-08-16T14:34:40Z

Thanks! I'll rebase, maybe will resolve the CI issues

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 12, 2024

mzusman changed the title ~~[Kernel] W8A16 Int8 MoE~~ [Kernel] W8A16 Int8 inside FusedMoE Aug 12, 2024

robertgshaw2-neuralmagic self-assigned this Aug 13, 2024

mzusman force-pushed the expert_int8_upstream branch from a2fb828 to 2564979 Compare August 13, 2024 10:06

mzusman added 13 commits August 15, 2024 09:01

Add experts int8 config

6b834a3

Add support in fusedmoe

afddd3b

Add experts int8 to quantization list

289367a

Remove logger

084405e

Add to optimized quantization

0c690fe

Format

3100490

Add startup test for experts_int8

413400c

Typo

9e7bc79

Add test

1ebb5d7

Change compute capabiltiy to 80

44a72d6

Format

39660ca

Disable for CPU

a097b6e

Add use_int8 to the moe benchmarks

c12635c

mzusman force-pushed the expert_int8_upstream branch from 8b00352 to c12635c Compare August 15, 2024 06:03

mzusman added 3 commits August 15, 2024 09:13

Use JambaMoE to implement MLP

9436034

Use MoE to implement MLP

4b712e4

Format

3b6967e

Fix

5f5b11e

mzusman added 9 commits August 15, 2024 20:54

Move experts_int8 to quantizatiob subdir and add is quant method

e199b17

supported

Split if else in benchmark moe

9c47ad0

Rename use_int8 to use_int8_w8a16, use_fp8 to use_fp_w8a8

97f0585

Reverse order

0025459

Change dtype in configs filename

a1d75cb

Single function to get dtype config name

505e3d3

Align experts int8 apply with fp8

80d977c

Align with upstream

1c403be

Format

744ecd4

mgoin reviewed Aug 15, 2024

View reviewed changes

mzusman added 6 commits August 15, 2024 22:54

Change fp8 to fp8_w8a8

a5bf0b3

Correct the args

1c7e689

Remove experts int8 from ignore cpu

e438b84

Fix typo

c23a2f4

Fix Jamba tests since MLP layer is not aligned with HF

7e619c7

Merge remote-tracking branch 'github/main' into expert_int8_upstream

70a6598

dsikka reviewed Aug 16, 2024

View reviewed changes

mgoin approved these changes Aug 16, 2024

View reviewed changes

Merge remote-tracking branch 'github/main' into expert_int8_upstream

4d6c546

simon-mo merged commit 7fc23be into vllm-project:main Aug 16, 2024
52 of 56 checks passed

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[Kernel] W8A16 Int8 inside FusedMoE (vllm-project#7415)

2ca2c54

halexan mentioned this pull request Aug 20, 2024

[Feature] Support W8A16 Int8 inside FusedMoE sgl-project/sglang#1161

Closed

2 tasks

zifeitong pushed a commit to zifeitong/vllm that referenced this pull request Aug 20, 2024

[Kernel] W8A16 Int8 inside FusedMoE (vllm-project#7415)

6bbb5c5

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024

[Kernel] W8A16 Int8 inside FusedMoE (vllm-project#7415)

35e8f72

msaroufim mentioned this pull request Aug 22, 2024

MoE example pytorch/ao#729

Open

halexan mentioned this pull request Aug 26, 2024

MODEL REQUESTS vllm-project/llm-compressor#69

Open

omrishiv pushed a commit to omrishiv/vllm that referenced this pull request Aug 26, 2024

[Kernel] W8A16 Int8 inside FusedMoE (vllm-project#7415)

1542089

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] W8A16 Int8 inside FusedMoE #7415

[Kernel] W8A16 Int8 inside FusedMoE #7415

mzusman commented Aug 12, 2024 •

edited

Loading

github-actions bot commented Aug 12, 2024

mzusman commented Aug 12, 2024

halexan commented Aug 13, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Aug 13, 2024

mzusman commented Aug 13, 2024

mzusman commented Aug 13, 2024 •

edited

Loading

jeejeelee commented Aug 14, 2024

halexan commented Aug 15, 2024

mgoin left a comment

mzusman commented Aug 15, 2024

mzusman commented Aug 15, 2024

dsikka Aug 16, 2024

mgoin left a comment

mzusman commented Aug 16, 2024

[Kernel] W8A16 Int8 inside FusedMoE #7415

[Kernel] W8A16 Int8 inside FusedMoE #7415

Conversation

mzusman commented Aug 12, 2024 • edited Loading

🚀 The feature, motivation and pitch

Advantages:

github-actions bot commented Aug 12, 2024

mzusman commented Aug 12, 2024

halexan commented Aug 13, 2024 • edited Loading

robertgshaw2-neuralmagic commented Aug 13, 2024

mzusman commented Aug 13, 2024

mzusman commented Aug 13, 2024 • edited Loading

jeejeelee commented Aug 14, 2024

halexan commented Aug 15, 2024

mgoin left a comment

Choose a reason for hiding this comment

mzusman commented Aug 15, 2024

mzusman commented Aug 15, 2024

dsikka Aug 16, 2024

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

mzusman commented Aug 16, 2024

mzusman commented Aug 12, 2024 •

edited

Loading

halexan commented Aug 13, 2024 •

edited

Loading

mzusman commented Aug 13, 2024 •

edited

Loading