Fused MOE for Mixtral #2542

pcmoritz · 2024-01-22T07:36:44Z

This builds on #2453 and #2293 to fuse the MOE kernel for the Mixtral model.

It seems to give a significant performance improvement though (in my setup from 28600 to 33600 tok / s with 1000 input tokens and 50 output tokens on H100).

Latency with python benchmarks/benchmark_latency.py --model=mistralai/Mixtral-8x7B-Instruct-v0.1 --input-len 1000 --output-len 50 -tp 8 --num-iters 100 --batch-size <bs>:

This PR:

bs=1: 0.459s
bs=2: 0.515s
bs=4: 0.610s
bs=8: 0.813s
bs=16: 1.044s
bs=32: 1.489s
bs=64: 2.419s

Compare to master:

bs=1: 0.590s
bs=2: 0.631s
bs=4: 0.709s
bs=8: 0.838s
bs=16: 1.086s
bs=32: 1.615s
bs=64: 2.727s

pcmoritz · 2024-01-22T12:30:01Z

MMLU evaluation on this PR looks good as well:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7038|±  |0.1407|
| - humanities     |N/A    |none  |     5|acc   |0.6459|±  |0.1578|
| - other          |N/A    |none  |     5|acc   |0.7763|±  |0.1104|
| - social_sciences|N/A    |none  |     5|acc   |0.8109|±  |0.0704|
| - stem           |N/A    |none  |     5|acc   |0.6143|±  |0.1402|

pcmoritz · 2024-01-22T13:35:13Z

Latency numbers:

python benchmarks/benchmark_latency.py --model=mistralai/Mixtral-8x7B-Instruct-v0.1 --input-len 1000 --output-len 50 -tp 8 --num-iters 100 --batch-size <bs>

This PR:

bs=1: 0.561s
bs=2: 0.600s
bs=4: 0.725s
bs=8: 0.912s
bs=16: 1.131s
bs=32: 1.547s
bs=64: 2.43s

Master:

bs=1: 0.590s
bs=2: 0.631s
bs=4: 0.709s
bs=8: 0.838s
bs=16: 1.086s
bs=32: 1.615s
bs=64: 2.727s

So very nice improvements on both throughput and latency (except for some medium batch sizes, but maybe that can be further optimized by tuning the block sizes better).

pcmoritz · 2024-01-23T00:34:47Z

With the latest version of the fused MOE kernel, the fused kernel is now strictly dominating the current master (same settings as above):

bs=1: 0.459s
bs=2: 0.515s
bs=4: 0.610s
bs=8: 0.813s
bs=16: 1.044s
bs=32: 1.489s
bs=64: 2.419s

pcmoritz · 2024-01-26T08:09:33Z

@WoosukKwon It probably makes sense to review/merge #2453 first since the fused_moe kernel is from there :)

casper-hansen · 2024-01-28T07:29:45Z

@pcmoritz I tried importing your code from here and found that there is a absolute maximum difference of 0.3545 in the logits between the normal Mixtral MoE and the fused one.

It seems this is a large difference. Could you add a test between the normal MixtralSparseMoeBlock from transformers and the fused module?

pcmoritz · 2024-01-28T21:21:43Z

Thanks @casper-hansen, let me look into this some more / compare the numerics with HuggingFace. Here is what I've figured out so far:

import torch
from transformers import AutoModelForCausalLM

from vllm.model_executor.layers.moe import MoE

from vllm.model_executor.models.mixtral import MixtralModel
from vllm.model_executor.models.mixtral import MixtralForCausalLM


model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
config = model.config

mixtral_moe = model.model.layers[0].block_sparse_moe

hidden_states = torch.randn((1, 1, 4096))
output = mixtral_moe.forward(hidden_states)

vLLM:

First initialize model parallelism (this is needed b/c the model is trying to get the tensor parallelism which needs this to be initialized -- maybe going forward we can make the models run independent of that, it might be useful e.g. for unit tests):

from vllm.model_executor.parallel_utils.parallel_state import initialize_model_parallel

torch.distributed.init_process_group(
    backend="nccl",
    world_size=1,
    rank=0,
    init_method=f"file:///tmp/test",
)

initialize_model_parallel()

vllm_moe = MoE(
    config.num_local_experts,
    config.num_experts_per_tok,
    config.hidden_size,
    config.intermediate_size,
    params_dtype=torch.bfloat16
)

# Load weights:

from vllm.model_executor.weight_utils import hf_model_weights_iterator

expert_params_mapping = [
    ("ws" if weight_name in ["w1", "w3"] else "w2s",
    f"experts.{expert_id}.{weight_name}.weight", expert_id)
    for expert_id in range(config.num_local_experts)
    for weight_name in ["w1", "w2", "w3"]
]

params_dict = dict(vllm_moe.named_parameters())
for name, loaded_weight in hf_model_weights_iterator("mistralai/Mixtral-8x7B-v0.1"):
    if name == "model.layers.0.block_sparse_moe.gate.weight":
        params_dict["gate.weight"][:,:] = loaded_weight
    if name.startswith("model.layers.0.block_sparse_moe.experts"):
        for param_name, weight_name, expert_id in expert_params_mapping:
            if weight_name in name:
                param = params_dict[param_name]
                weight_loader = param.weight_loader
                weight_loader(param, loaded_weight, weight_name, expert_id=expert_id)

vllm_moe.forward(hidden_states.bfloat16().to("cuda"))

casper-hansen · 2024-01-28T21:51:41Z

See the test I created below for reference. I'm not sure what causes the difference, but seems it's a large difference.

https://github.com/casper-hansen/AutoAWQ/blob/mixtral_fused/tests/test_fused_moe.py

… into fused-mixtral

WoosukKwon · 2024-01-30T05:33:42Z

vllm/model_executor/layers/moe.py

@pcmoritz Should we move the MoEclass back to the Mixtral model file? It seems like this MoE layer is not shared between Mixtral and DeepSeek.

Sounds good to me! Feel free to make any edits to the PR you'd like to make or let me know if I should make them :)

I'd appreciate it if you can do it!

vllm/model_executor/layers/moe.py

vllm/model_executor/models/mixtral.py

pcmoritz · 2024-01-30T06:23:57Z

@casper-hansen I don't know if you followed the discussion -- we looked into the numerical differences (#2453 (comment)) and they are due to the TensorFloat tensor cores being used, so it is expected :)

WoosukKwon

@pcmoritz LGTM! Thanks for the great work!

WoosukKwon · 2024-01-30T06:44:53Z

BTW, I added @scv119 (who wrote #2293) as a co-author of the PR. Thanks @scv119 for the original PR!

pcmoritz · 2024-01-30T07:57:49Z

Just to be sure, I re-ran MMLU on the latest version of this PR and the result looks good:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7052|±  |0.1374|
| - humanities     |N/A    |none  |     5|acc   |0.6491|±  |0.1522|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1099|
| - social_sciences|N/A    |none  |     5|acc   |0.8109|±  |0.0690|
| - stem           |N/A    |none  |     5|acc   |0.6178|±  |0.1384|

WoosukKwon · 2024-01-30T08:16:37Z

Oh BTW, this PR will break the quantization support for Mixtral. 🤦

Co-authored-by: chen shen <scv119@gmail.com>

skt7 · 2024-02-14T08:27:32Z

benchmarks

@pcmoritz are you using any specific implememtation to run MMLU benchmark (and others) on LLMs served through vllm, it will be great if you can share the details.

pcmoritz added 2 commits January 21, 2024 23:26

Fused MOE for Mixtral

c33e37a

fix weight loading

03f5dc8

pcmoritz changed the title ~~[WIP] Fused MOE for Mixtral~~ Fused MOE for Mixtral Jan 22, 2024

pcmoritz added 4 commits January 22, 2024 03:02

lint and cleanup

2867f34

style

a82f73f

yapf

406a188

update comment

707479d

pcmoritz added 2 commits January 22, 2024 15:47

Merge branch 'main' into fused-mixtral

649287e

update fused_moe implementation

f283c98

optimize block sizes

adeb80f

WoosukKwon self-requested a review January 26, 2024 07:30

casper-hansen mentioned this pull request Jan 26, 2024

Idea: Dequantize + Fused MoE casper-hansen/AutoAWQ#323

Closed

Merge branch 'main' into fused-mixtral

d6a0b11

Merge branch 'main' into fused-mixtral

74308ba

pcmoritz added 4 commits January 28, 2024 16:59

add dtype

23284e3

Merge branch 'main' into fused-mixtral

429f51c

update

a1f0107

Merge branch 'fused-mixtral' of https://github.com/pcmoritz/vllm-public…

f1691ba

… into fused-mixtral

pcmoritz mentioned this pull request Jan 29, 2024

DeepseekMoE support with Fused MoE kernel #2453

Merged

zhuohan123 mentioned this pull request Jan 30, 2024

Bump up version to v0.3.0 #2656

Merged

WoosukKwon added 2 commits January 30, 2024 05:24

Merge branch 'main' into fused-mixtral

ee0a490

Resolve merge error

a77a2c1

WoosukKwon reviewed Jan 30, 2024

View reviewed changes

vllm/model_executor/layers/moe.py Outdated Show resolved Hide resolved

WoosukKwon reviewed Jan 30, 2024

View reviewed changes

vllm/model_executor/models/mixtral.py Outdated Show resolved Hide resolved

pcmoritz added 6 commits January 29, 2024 21:45

remove extra function

2233c2a

update

44cc7d0

move mixtral moe layer back

7b076aa

put import back

b4cb78c

yapf

111c1b5

fix typo

d30e844

WoosukKwon approved these changes Jan 30, 2024

View reviewed changes

WoosukKwon merged commit ab40644 into vllm-project:main Jan 30, 2024
15 of 17 checks passed

This was referenced Jan 30, 2024

[BUG] Quantization support for MoE models #2663

Closed

Add quantized mixtral support #2673

Merged

zhuohan123 mentioned this pull request Jan 31, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

NikolaBorisov pushed a commit to deepinfra/vllm that referenced this pull request Jan 31, 2024

Fused MOE for Mixtral (vllm-project#2542)

022c347

Co-authored-by: chen shen <scv119@gmail.com>

themrzmaster mentioned this pull request Feb 2, 2024

update vllm to 0.3 MeetKai/functionary#110

Closed

This was referenced Feb 11, 2024

Performance issue comparing sglang to vllm. sgl-project/sglang#169

Closed

Port fused MoE Kernels sgl-project/sglang#179

Open

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Fused MOE for Mixtral (vllm-project#2542)

5a7fb81

Co-authored-by: chen shen <scv119@gmail.com>

alexm-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024

Fused MOE for Mixtral (vllm-project#2542)

a6c40bf

Co-authored-by: chen shen <scv119@gmail.com>

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

pcmoritz mentioned this pull request Mar 5, 2024

[RFC/WIP] First steps towards FP8 for Mixtral #3208

Open

umid-podo mentioned this pull request Apr 18, 2024

[Bug]: AssertionError: libcuda.so cannot found with vllm/vllm-openai:v0.4.0 #3808

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused MOE for Mixtral #2542

Fused MOE for Mixtral #2542

pcmoritz commented Jan 22, 2024 •

edited

pcmoritz commented Jan 22, 2024

pcmoritz commented Jan 22, 2024

pcmoritz commented Jan 23, 2024

pcmoritz commented Jan 26, 2024

casper-hansen commented Jan 28, 2024 •

edited

pcmoritz commented Jan 28, 2024 •

edited

casper-hansen commented Jan 28, 2024

WoosukKwon Jan 30, 2024

pcmoritz Jan 30, 2024

WoosukKwon Jan 30, 2024

pcmoritz commented Jan 30, 2024

WoosukKwon left a comment

WoosukKwon commented Jan 30, 2024

pcmoritz commented Jan 30, 2024

WoosukKwon commented Jan 30, 2024

skt7 commented Feb 14, 2024 •

edited

Fused MOE for Mixtral #2542

Fused MOE for Mixtral #2542

Conversation

pcmoritz commented Jan 22, 2024 • edited

pcmoritz commented Jan 22, 2024

pcmoritz commented Jan 22, 2024

pcmoritz commented Jan 23, 2024

pcmoritz commented Jan 26, 2024

casper-hansen commented Jan 28, 2024 • edited

pcmoritz commented Jan 28, 2024 • edited

casper-hansen commented Jan 28, 2024

WoosukKwon Jan 30, 2024

Choose a reason for hiding this comment

pcmoritz Jan 30, 2024

Choose a reason for hiding this comment

WoosukKwon Jan 30, 2024

Choose a reason for hiding this comment

pcmoritz commented Jan 30, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon commented Jan 30, 2024

pcmoritz commented Jan 30, 2024

WoosukKwon commented Jan 30, 2024

skt7 commented Feb 14, 2024 • edited

pcmoritz commented Jan 22, 2024 •

edited

casper-hansen commented Jan 28, 2024 •

edited

pcmoritz commented Jan 28, 2024 •

edited

skt7 commented Feb 14, 2024 •

edited