Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized fused MoE Kernel, take 2 #2979

Merged
merged 30 commits into from
Feb 26, 2024

Conversation

pcmoritz
Copy link
Collaborator

@pcmoritz pcmoritz commented Feb 22, 2024

This replaces #2913

With some more aggressive parameter search, we were actually able to beat the TensorRT kernels even on small batch sizes, which is very fortunate since it reduces the complexity a lot.

Here are the results on this PR on H100 with TP2:

qps = 1 => 11.6 ms ITL
qps = 2 => 12.6 ms ITL
qps = 4 => 23.6 ms ITL
qps = 6 => 32.9 ms ITL

current main branch (untuned fused MoE kernel):

qps = 1 => 23.3 ms ITL
qps = 2 => 25.4ms ITL
qps = 4 => 43.0ms ITL
qps = 6 => 60.8ms ITL

only using the TensorRT Moe kernels:

qps = 1 => 18.1 ms ITL
qps = 2 => 23.8 ms ITL
qps = 4 => 48.1 ms ITL
qps = 6 => 90.8 ms ITL

The code is structured in such a way that it is very easy to just drop a new .json file into the fused_moe/configs directory to support a different configuration.

Co-authored-by: Cade Daniel edacih@gmail.com

@pcmoritz
Copy link
Collaborator Author

@WoosukKwon This is now ready to review :)

@WoosukKwon
Copy link
Collaborator

Hi @pcmoritz Could you provide the exact benchmark setup? I benchmarked this PR for Mixtral with different batch sizes (from 1 to 256) on 4 H100 GPUs, but didn't see any noticeable speedup.

@pcmoritz
Copy link
Collaborator Author

pcmoritz commented Feb 24, 2024

@WoosukKwon Thanks for trying it out! The PR only includes optimized settings for TP2 on H100 (and I also added the optimizations for TP4 on A100 80 GB that I have gathered since then) so there is no difference for TP4 on H100 vs. main. I added a README to explain this. For Mixtral it only really makes sense to run TP2 on H100. Instead of TP4 on H100, it is better to run two replicas of TP2 on H100 and split the traffic between them (i.e. for a given latency budged, that gives more throughput). On A100, TP4 is the optimal setting in my experience.

@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Feb 24, 2024

@pcmoritz Got it. Thanks for the explanation! BTW, how did you tune the knobs? I believe the performance ranking between different configurations is also dependent on the number of input tokens. What batch size did you tune for? Oh I just realized that the knobs are tuned per batch size. Never mind.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcmoritz Thanks for the PR! The speedup is really nice.

BTW, could you provide a script to tune the parameters and also to select the default config? Otherwise, I'd be happy to implement it. I have small experience in tuning Triton kernels.

vllm/model_executor/layers/fused_moe/__init__.py Outdated Show resolved Hide resolved
@@ -0,0 +1,20 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a script to tune these parameters? No worries otherwise. I can implement it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, happy to add the script. The process of tuning is not fully automatic at the moment and requires some manual modifications, but I will contribute what I have 😊

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the script benchmark_mixtral_moe.py now to do the search -- in practice I wasn't using exactly this script but was doing some modifications of the script as I searched through the different batch sizes. But the script should be a good way to get started :)

Maybe by doing a more exhaustive search, we could even improve on these parameters, but I think the gap would be pretty small if we find something even better :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved
vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved
@pcmoritz
Copy link
Collaborator Author

I made all the updates now, PTAL :)

@WoosukKwon WoosukKwon self-requested a review February 26, 2024 21:06
@WoosukKwon WoosukKwon merged commit cfc15a1 into vllm-project:main Feb 26, 2024
13 of 21 checks passed
@njhill
Copy link
Collaborator

njhill commented Feb 26, 2024

This is awesome thanks @pcmoritz! TP2 A100 would also be very useful, we can have a go at adding a config for that.

@pcmoritz
Copy link
Collaborator Author

@njhill Thanks, that would be very much appreciated, feel free to tag me in the PR :)

@paolovic
Copy link

paolovic commented Apr 7, 2024

Do I understand it correctly, that I can use this PR to improve inference time on my cluster as this PR allows parameter search for optimal kernel parameters? If so, how?
Or do I misunderstand this feature? Sorry for bothering you @pcmoritz , but could you elaborate real quick?
I have Mixtral-8x7B-Instruct-v0.1-GPTQ (8bit-128g-actorder_True) deployed on 2x Nvidia L40S with vllm and tensor parallelism=2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants