Add fused top-K softmax kernel for MoE #2769

WoosukKwon · 2024-02-05T18:35:27Z

This PR ports a fused topk-softmax kernel from TensorRT-LLM v0.7.1.

TODO:

Port more MoE-related kernels
Use CUTLASS-based grouped GEMM kernels with appropriate tuning (if they perform better than the current Triton kernel).

WoosukKwon · 2024-02-05T19:30:12Z

@Yard1 @cadedaniel @pcmoritz Can any one of you review the PR?

pcmoritz · 2024-02-05T19:41:08Z

Yes, happy to review! Thanks a lot for writing this :)

WoosukKwon · 2024-02-05T19:59:28Z

@pcmoritz Thanks!

vllm/model_executor/layers/fused_moe.py

csrc/moe/topk_softmax_kernels.cu

pcmoritz · 2024-02-05T22:43:00Z

Btw, I did a little bit of benchmarking on this PR and without touching any of the system parameters in the PR I'm already seeing a 1.5% - 3.5% end-to-end latency improvement. It is higher in the low latency regime. Concretely I tested on TP2 on H100 with 1000 input and 50 output tokens on Mixtral. So it seems worth merging this even though the low-level kernel code is not easy to follow -- most people can probably just treat it as a black box so it shouldn't have a big impact on maintainability.

pcmoritz

I will spend some more time trying to understand the implementation in topk_softmax_kernels.cu but no need to block on that since that's mostly the upstream code from https://github.com/NVIDIA/TensorRT-LLM/blob/v0.7.1/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu and in any case we should probably keep it close to that and not change it :)

tests/kernels/test_moe.py

WoosukKwon · 2024-02-06T00:09:09Z

@pcmoritz Thanks again for you review! Yes, I think we don't have to worry too much about the implementation details, at least at the moment, as I only made a minor change to the kernel.

pcmoritz · 2024-02-06T00:12:48Z

Sounds good, the PR looks great :)

WoosukKwon added 30 commits January 31, 2024 10:12

Add CUTLASS as a submodule

ad66935

Port CUTLASS extensions

396e537

Port MoE kernels

0cd9436

Move moe_kernels

cb4524c

Port MoE GEMM

c191207

Port CUTLASS kernels

cfa4554

Remove MoE gemm

90ccdfa

Merge branch 'main' into cutlass-moe

3e90c1a

Remove unused CUTLASS kernels

77a5c8d

Minor

f1583de

Add topk_softmax kernels

de7a749

Remove unnecessary headers

e5c62e8

Add MoE namespace

e127d9b

Minor

c3096a0

Add permute_kernels

9a561cc

Remove unused

ba07256

Move

def2ccd

Move

72256cc

Remove

e86fd06

Add MoE MLP

612f961

Add cudaUtils

0bf8fb9

Fix headers

c09179d

Enable BF16

2ab65df

Err msg

c74fc79

Add unpermute_and_reduce

6320de4

Add renormalize

9b57e39

Add FusedMoE

55fae45

Remove dependency on cutlass

5dcf104

Remove CUTLASS

92ac8dd

Fix Mixtral & DeepSeek

cf559dc

WoosukKwon added 3 commits February 5, 2024 18:40

yapf

186901b

Minor fix

26ef5a0

Add minor comment

6f33c73

WoosukKwon requested a review from pcmoritz February 5, 2024 19:59

pcmoritz reviewed Feb 5, 2024

View reviewed changes

vllm/model_executor/layers/fused_moe.py Show resolved Hide resolved

pcmoritz reviewed Feb 5, 2024

View reviewed changes

vllm/model_executor/layers/fused_moe.py Outdated Show resolved Hide resolved

pcmoritz reviewed Feb 5, 2024

View reviewed changes

csrc/moe/topk_softmax_kernels.cu Show resolved Hide resolved

pcmoritz approved these changes Feb 5, 2024

View reviewed changes

pcmoritz reviewed Feb 5, 2024

View reviewed changes

tests/kernels/test_moe.py Outdated Show resolved Hide resolved

WoosukKwon added 4 commits February 5, 2024 23:46

Address review on test_moe

a94dd8c

Fix docstring

fe8c108

Add assert statements

9a5d9d8

Merge branch 'main' into topk-softmax

7cd63e9

WoosukKwon merged commit f0d4e14 into main Feb 6, 2024
17 checks passed

WoosukKwon deleted the topk-softmax branch February 6, 2024 01:38

casper-hansen mentioned this pull request Feb 11, 2024

MoE grouped gemm and fused topk_softmax casper-hansen/AutoAWQ_kernels#8

Merged

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add fused top-K softmax kernel for MoE (vllm-project#2769)

32d0b59

alexm-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024

Add fused top-K softmax kernel for MoE (vllm-project#2769)

66f6d9d

jvmncs pushed a commit to jvmncs/vllm that referenced this pull request Feb 14, 2024

Add fused top-K softmax kernel for MoE (vllm-project#2769)

36017aa

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 20, 2024

Add fused top-K softmax kernel for MoE (vllm-project#2769)

2d2e6ee

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 22, 2024

Add fused top-K softmax kernel for MoE (vllm-project#2769)

73f4127

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024

Add fused top-K softmax kernel for MoE (vllm-project#2769)

19cecc4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fused top-K softmax kernel for MoE #2769

Add fused top-K softmax kernel for MoE #2769

WoosukKwon commented Feb 5, 2024 •

edited

WoosukKwon commented Feb 5, 2024

pcmoritz commented Feb 5, 2024 •

edited

WoosukKwon commented Feb 5, 2024

pcmoritz commented Feb 5, 2024 •

edited

pcmoritz left a comment •

edited

WoosukKwon commented Feb 6, 2024

pcmoritz commented Feb 6, 2024

Add fused top-K softmax kernel for MoE #2769

Add fused top-K softmax kernel for MoE #2769

Conversation

WoosukKwon commented Feb 5, 2024 • edited

WoosukKwon commented Feb 5, 2024

pcmoritz commented Feb 5, 2024 • edited

WoosukKwon commented Feb 5, 2024

pcmoritz commented Feb 5, 2024 • edited

pcmoritz left a comment • edited

Choose a reason for hiding this comment

WoosukKwon commented Feb 6, 2024

pcmoritz commented Feb 6, 2024

WoosukKwon commented Feb 5, 2024 •

edited

pcmoritz commented Feb 5, 2024 •

edited

pcmoritz commented Feb 5, 2024 •

edited

pcmoritz left a comment •

edited