Optimized fused MoE Kernel, take 2 #2979

pcmoritz · 2024-02-22T06:48:20Z

This replaces #2913

With some more aggressive parameter search, we were actually able to beat the TensorRT kernels even on small batch sizes, which is very fortunate since it reduces the complexity a lot.

Here are the results on this PR on H100 with TP2:

qps = 1 => 11.6 ms ITL
qps = 2 => 12.6 ms ITL
qps = 4 => 23.6 ms ITL
qps = 6 => 32.9 ms ITL

current main branch (untuned fused MoE kernel):

qps = 1 => 23.3 ms ITL
qps = 2 => 25.4ms ITL
qps = 4 => 43.0ms ITL
qps = 6 => 60.8ms ITL

only using the TensorRT Moe kernels:

qps = 1 => 18.1 ms ITL
qps = 2 => 23.8 ms ITL
qps = 4 => 48.1 ms ITL
qps = 6 => 90.8 ms ITL

The code is structured in such a way that it is very easy to just drop a new .json file into the fused_moe/configs directory to support a different configuration.

Co-authored-by: Cade Daniel edacih@gmail.com

pcmoritz · 2024-02-22T07:14:49Z

@WoosukKwon This is now ready to review :)

WoosukKwon · 2024-02-24T08:13:11Z

Hi @pcmoritz Could you provide the exact benchmark setup? I benchmarked this PR for Mixtral with different batch sizes (from 1 to 256) on 4 H100 GPUs, but didn't see any noticeable speedup.

pcmoritz · 2024-02-24T17:15:56Z

@WoosukKwon Thanks for trying it out! The PR only includes optimized settings for TP2 on H100 (and I also added the optimizations for TP4 on A100 80 GB that I have gathered since then) so there is no difference for TP4 on H100 vs. main. I added a README to explain this. For Mixtral it only really makes sense to run TP2 on H100. Instead of TP4 on H100, it is better to run two replicas of TP2 on H100 and split the traffic between them (i.e. for a given latency budged, that gives more throughput). On A100, TP4 is the optimal setting in my experience.

WoosukKwon · 2024-02-24T20:39:00Z

@pcmoritz Got it. Thanks for the explanation! ~~BTW, how did you tune the knobs? I believe the performance ranking between different configurations is also dependent on the number of input tokens. What batch size did you tune for?~~ Oh I just realized that the knobs are tuned per batch size. Never mind.

WoosukKwon

@pcmoritz Thanks for the PR! The speedup is really nice.

BTW, could you provide a script to tune the parameters and also to select the default config? Otherwise, I'd be happy to implement it. I have small experience in tuning Triton kernels.

vllm/model_executor/layers/fused_moe/__init__.py

WoosukKwon · 2024-02-24T20:46:16Z

vllm/model_executor/layers/fused_moe/configs/E=8,N=3584,device_name=NVIDIA_A100-SXM4-80GB.json

@@ -0,0 +1,20 @@
+{


Could you provide a script to tune these parameters? No worries otherwise. I can implement it.

Yes, happy to add the script. The process of tuning is not fully automatic at the moment and requires some manual modifications, but I will contribute what I have 😊

I added the script benchmark_mixtral_moe.py now to do the search -- in practice I wasn't using exactly this script but was doing some modifications of the script as I searched through the different batch sizes. But the script should be a good way to get started :)

Maybe by doing a more exhaustive search, we could even improve on these parameters, but I think the gap would be pretty small if we find something even better :)

vllm/model_executor/layers/fused_moe/fused_moe.py

pcmoritz · 2024-02-25T06:42:31Z

I made all the updates now, PTAL :)

vllm/model_executor/layers/fused_moe/fused_moe.py

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

njhill · 2024-02-26T22:33:27Z

This is awesome thanks @pcmoritz! TP2 A100 would also be very useful, we can have a go at adding a config for that.

pcmoritz · 2024-02-26T23:04:52Z

@njhill Thanks, that would be very much appreciated, feel free to tag me in the PR :)

Co-authored-by: Cade Daniel <edacih@gmail.com>

paolovic · 2024-04-07T22:25:08Z

Do I understand it correctly, that I can use this PR to improve inference time on my cluster as this PR allows parameter search for optimal kernel parameters? If so, how?
Or do I misunderstand this feature? Sorry for bothering you @pcmoritz , but could you elaborate real quick?
I have Mixtral-8x7B-Instruct-v0.1-GPTQ (8bit-128g-actorder_True) deployed on 2x Nvidia L40S with vllm and tensor parallelism=2

pcmoritz added 13 commits February 21, 2024 21:27

Autotuned fused_moe kernel

86035f8

update

8b991f5

logging

09b89d0

fix

d29e87b

update

d61bb01

update

a4050b5

update

070d825

log once

145014b

whitespace

e60033d

update

f15f4b9

update

0b891f9

update

bdbe42a

lint

93c5486

pcmoritz mentioned this pull request Feb 22, 2024

Optimized fused MoE Kernel #2913

Closed

Merge branch 'main' into tune-fused-moe-kernel

2495c6c

pcmoritz added kernel moe labels Feb 22, 2024

update

b03e039

pcmoritz requested a review from WoosukKwon February 22, 2024 07:13

fixes

7a07033

pcmoritz added 2 commits February 24, 2024 09:16

whitespace

3fc61e5

update

8afd132

WoosukKwon approved these changes Feb 24, 2024

View reviewed changes

pcmoritz added 3 commits February 24, 2024 21:20

update and simplify

5802ebe

fix

2b56ec0

fix bug

dbad1fb

pcmoritz added 4 commits February 24, 2024 21:51

lint

389052d

add benchmarking script

35d4c9e

lint

f986b70

update

1ef5078

more batchsizes

62af490

WoosukKwon self-requested a review February 26, 2024 21:06

WoosukKwon reviewed Feb 26, 2024

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved

pcmoritz and others added 4 commits February 26, 2024 13:22

Update vllm/model_executor/layers/fused_moe/fused_moe.py

5715887

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

make override clearer

183fa4d

yapf

816a1e3

yapf 2

c5040c3

WoosukKwon merged commit cfc15a1 into vllm-project:main Feb 26, 2024
13 of 21 checks passed

furlat mentioned this pull request Mar 1, 2024

make some tests and choose an openai api compatible local llm server Neural-Dragon-AI/Cynde#7

Open

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024

Optimize Triton MoE Kernel (vllm-project#2979)

78444f8

Co-authored-by: Cade Daniel <edacih@gmail.com>

pcmoritz mentioned this pull request Mar 5, 2024

[RFC/WIP] First steps towards FP8 for Mixtral #3208

Open

sahilsuneja1 mentioned this pull request Mar 6, 2024

adding fused moe kernel config for A100 TP2 #3240

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized fused MoE Kernel, take 2 #2979

Optimized fused MoE Kernel, take 2 #2979

pcmoritz commented Feb 22, 2024 •

edited

pcmoritz commented Feb 22, 2024

WoosukKwon commented Feb 24, 2024

pcmoritz commented Feb 24, 2024 •

edited

WoosukKwon commented Feb 24, 2024 •

edited

WoosukKwon left a comment

WoosukKwon Feb 24, 2024

pcmoritz Feb 25, 2024

pcmoritz Feb 25, 2024

WoosukKwon Feb 26, 2024

pcmoritz commented Feb 25, 2024

njhill commented Feb 26, 2024

pcmoritz commented Feb 26, 2024

paolovic commented Apr 7, 2024 •

edited

Optimized fused MoE Kernel, take 2 #2979

Optimized fused MoE Kernel, take 2 #2979

Conversation

pcmoritz commented Feb 22, 2024 • edited

pcmoritz commented Feb 22, 2024

WoosukKwon commented Feb 24, 2024

pcmoritz commented Feb 24, 2024 • edited

WoosukKwon commented Feb 24, 2024 • edited

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Feb 24, 2024

Choose a reason for hiding this comment

pcmoritz Feb 25, 2024

Choose a reason for hiding this comment

pcmoritz Feb 25, 2024

Choose a reason for hiding this comment

WoosukKwon Feb 26, 2024

Choose a reason for hiding this comment

pcmoritz commented Feb 25, 2024

njhill commented Feb 26, 2024

pcmoritz commented Feb 26, 2024

paolovic commented Apr 7, 2024 • edited

pcmoritz commented Feb 22, 2024 •

edited

pcmoritz commented Feb 24, 2024 •

edited

WoosukKwon commented Feb 24, 2024 •

edited

paolovic commented Apr 7, 2024 •

edited