[EPLB] Support EPLB for Mixtral Model #22842

rouchenzi · 2025-08-13T17:16:25Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

#20468 Support EPLB for Mixtral Model

Test Plan

Running the following test script
python test_eplb.py --mode eplb
python test_eplb.py --mode normal

import json
import os
import argparse
from vllm import LLM, SamplingParams

prompt = "Explain the theory of relativity in simple terms."

RESULT_FILE = "eplb_test_output.json"

sampling_params = SamplingParams(
    temperature=0.0,
    top_p=1.0,
    top_k=1,
    max_tokens=100
)

def run_inference(model_path: str, enable_eplb: bool, num_redundant_experts: int = 0):
    print(f"Running inference with EPLB={enable_eplb}, redundant experts={num_redundant_experts}")

    llm = LLM(
        model=model_path,
        tensor_parallel_size=8,
        enable_expert_parallel=True,
        enable_eplb=enable_eplb,
        num_redundant_experts=num_redundant_experts if enable_eplb else 0,
        eplb_window_size=1000,
        eplb_step_interval=100,
        eplb_log_balancedness=True,
        enforce_eager=True,
        trust_remote_code=True
    )

    result = llm.generate([prompt], sampling_params)
    output_text = result[0].outputs[0].text.strip()

    print("Output:")
    print(output_text)
    print("-" * 50)

    return output_text

def save_result(key: str, value: list):
    if os.path.exists(RESULT_FILE):
        with open(RESULT_FILE, "r") as f:
            results = json.load(f)
    else:
        results = {}

    results[key] = value

    with open(RESULT_FILE, "w") as f:
        json.dump(results, f, indent=2)

    print(f"Output saved to {RESULT_FILE}")

def load_results():
    if os.path.exists(RESULT_FILE):
        with open(RESULT_FILE, "r") as f:
            return json.load(f)
    return {}

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--mode", type=str, choices=["eplb", "normal", "compare"], required=True)
    args = parser.parse_args()

    MODEL_PATH = "/root/vllm/models/Mixtral-8x7B-v0.1"

    if args.mode == "eplb":
        outputs = run_inference(MODEL_PATH, enable_eplb=True, num_redundant_experts=32)
        save_result("eplb", outputs)
    elif args.mode == "normal":
        outputs = run_inference(MODEL_PATH, enable_eplb=False)
        save_result("normal", outputs)

Test Result

Result w/ verses w/o EPLB looks consistent given the example prompt tested

{
  "normal": "The theory of relativity is a theory of physics that explains the relationship between space and time. It is based on the idea that the laws of physics are the same for all observers, regardless of their motion relative to each other.\n\nThe theory of relativity has two main parts: the special theory of relativity and the general theory of relativity. The special theory of relativity deals with the relationship between space and time for observers who are moving at a constant speed relative",
  "eplb": "The theory of relativity is a theory of physics that explains the relationship between space and time. It is based on the idea that the laws of physics are the same for all observers, regardless of their motion relative to each other.\n\nThe theory of relativity has two main parts: the special theory of relativity and the general theory of relativity. The special theory of relativity deals with the relationship between space and time for observers who are moving at a constant speed relative"
}

Example eplb_log_balancedness result

INFO 08-13 17:13:21 [eplb_state.py:389] EPLB step: avg_tokens=278720.00, max_tokens=285088, balancedness=0.9777
INFO 08-13 17:13:21 [eplb_state.py:389] EPLB step: avg_tokens=16.00, max_tokens=88, balancedness=0.1818

(Optional) Documentation Update

Signed-off-by: rouchenzi <ruochenwen@gmail.com>

github-actions · 2025-08-13T17:16:32Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request adds support for Expert Parallelism Load Balancing (EPLB) for the Mixtral model. The changes involve updating the Mixtral model implementation to conform to the MixtureOfExperts interface, which is necessary for EPLB integration. This includes modifications to weight loading logic to handle redundant experts and methods to manage EPLB state. The implementation is largely correct, but I've identified a couple of areas for improvement to enhance robustness and future extensibility.

gemini-code-assist · 2025-08-13T17:17:46Z

vllm/model_executor/models/mixtral.py

+    def set_eplb_state(
+        self,
+        expert_load_view: torch.Tensor,
+        logical_to_physical_map: torch.Tensor,
+        logical_replica_count: torch.Tensor,
+    ) -> None:
+        for layer_idx, layer in enumerate(self.moe_layers):
+            # Register the expert weights.
+            self.expert_weights.append(layer.get_expert_weights())
+            layer.set_eplb_state(
+                moe_layer_idx=layer_idx,
+                expert_load_view=expert_load_view,
+                logical_to_physical_map=logical_to_physical_map,
+                logical_replica_count=logical_replica_count,
+            )


The self.expert_weights list is appended to in a loop. Since it's only initialized once in __init__, multiple calls to set_eplb_state would cause the list to grow indefinitely, leading to incorrect behavior and a memory leak. While the current implementation of the EPLB scheduler calls this method only once, it's safer to make this method idempotent by clearing the list before populating it.

Suggested change

def set_eplb_state(

self,

expert_load_view: torch.Tensor,

logical_to_physical_map: torch.Tensor,

logical_replica_count: torch.Tensor,

) -> None:

for layer_idx, layer in enumerate(self.moe_layers):

# Register the expert weights.

self.expert_weights.append(layer.get_expert_weights())

layer.set_eplb_state(

moe_layer_idx=layer_idx,

expert_load_view=expert_load_view,

logical_to_physical_map=logical_to_physical_map,

logical_replica_count=logical_replica_count,

)

def set_eplb_state(

self,

expert_load_view: torch.Tensor,

logical_to_physical_map: torch.Tensor,

logical_replica_count: torch.Tensor,

) -> None:

self.expert_weights.clear()

for layer_idx, layer in enumerate(self.moe_layers):

# Register the expert weights.

self.expert_weights.append(layer.get_expert_weights())

layer.set_eplb_state(

moe_layer_idx=layer_idx,

expert_load_view=expert_load_view,

logical_to_physical_map=logical_to_physical_map,

logical_replica_count=logical_replica_count,

)

gemini-code-assist · 2025-08-13T17:17:46Z

vllm/model_executor/models/mixtral.py

+        num_physical_experts: int,
+        num_local_physical_experts: int,
+    ) -> None:
+        assert self.num_local_physical_experts == num_local_physical_experts


This assertion enforces that num_local_physical_experts cannot change. However, the method signature and the broader context of EPLB (which can involve dynamic scaling) suggest that this value might need to be updated. If the number of GPUs in the expert parallel group changes, this assertion will fail, preventing dynamic scaling. If this value is not expected to change, the parameter is redundant. This assertion seems overly restrictive and may prevent future features like dynamic scaling.

rouchenzi · 2025-08-13T19:26:16Z

Hey @abmfy this PR supports EPLB for Mixtral, tested E2E workflow with Mixtral-8x7B, please let me know if more tests suggested, thanks!

abmfy

LGTM, just very tiny changes to reflect the new config interface. I'll do an accuracy test and if that passes we're good to go.

Thanks for the contribution!

vllm/model_executor/models/mixtral.py

abmfy · 2025-08-22T18:12:20Z

Accuracy tests passed:

w/o EPLB:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.5792	±	0.0136
		strict-match	5	exact_match	↑	0.5777	±	0.0136

w/ EPLB:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.5921	±	0.0135
		strict-match	5	exact_match	↑	0.5914	±	0.0135

mergify · 2025-09-16T13:42:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rouchenzi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>

Co-authored-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>

rouchenzi · 2025-09-17T01:12:52Z

Thanks @abmfy for reviewing and helping with accuracy test!

Resolved conflicts and updated change with suggestion, please take a look whenever free.

Signed-off-by: rouchenzi <ruochenwen@gmail.com>

abmfy · 2025-09-17T04:35:14Z

LGTM!
cc @simon-mo could you please take a look and approve if everything looks good when you have time? Thank you!

Signed-off-by: rouchenzi <ruochenwen@gmail.com> Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com> Co-authored-by: Bowen Wang <abmfy@icloud.com>

Signed-off-by: rouchenzi <ruochenwen@gmail.com> Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com> Co-authored-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: charlifu <charlifu@amd.com>

Support EPLB for Mixtral Model

68f2c6a

Signed-off-by: rouchenzi <ruochenwen@gmail.com>

rouchenzi requested a review from patrickvonplaten as a code owner August 13, 2025 17:16

rouchenzi marked this pull request as draft August 13, 2025 17:17

gemini-code-assist bot reviewed Aug 13, 2025

View reviewed changes

rouchenzi marked this pull request as ready for review August 13, 2025 17:27

abmfy suggested changes Aug 21, 2025

View reviewed changes

vllm/model_executor/models/mixtral.py Outdated Show resolved Hide resolved

vllm/model_executor/models/mixtral.py Outdated Show resolved Hide resolved

robertgshaw2-redhat added the eplb label Sep 16, 2025

robertgshaw2-redhat changed the title ~~Support EPLB for Mixtral Model~~ [EPLB] Support EPLB for Mixtral Model Sep 16, 2025

mergify bot added the needs-rebase label Sep 16, 2025

Merge branch 'main' into mixtral-eplb

611bd9e

Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>

mergify bot removed the needs-rebase label Sep 17, 2025

rouchenzi requested a review from abmfy September 17, 2025 00:49

rouchenzi and others added 2 commits September 16, 2025 17:52

Update vllm/model_executor/models/mixtral.py

4fa48d7

Co-authored-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>

Update vllm/model_executor/models/mixtral.py

f3cf066

Co-authored-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>

rouchenzi and others added 2 commits September 16, 2025 19:19

Merge branch 'vllm-project:main' into mixtral-eplb

18cfd22

Fix formatting

6eb2703

Signed-off-by: rouchenzi <ruochenwen@gmail.com>

abmfy approved these changes Sep 17, 2025

View reviewed changes

simon-mo approved these changes Sep 17, 2025

View reviewed changes

simon-mo enabled auto-merge (squash) September 17, 2025 05:35

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 17, 2025

simon-mo merged commit b77bf34 into vllm-project:main Sep 17, 2025
59 checks passed

rouchenzi deleted the mixtral-eplb branch September 19, 2025 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[EPLB] Support EPLB for Mixtral Model #22842

[EPLB] Support EPLB for Mixtral Model #22842

rouchenzi commented Aug 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 13, 2025

Uh oh!

gemini-code-assist bot Aug 13, 2025

Uh oh!

rouchenzi commented Aug 13, 2025

Uh oh!

abmfy left a comment

Uh oh!

Uh oh!

Uh oh!

abmfy commented Aug 22, 2025

Uh oh!

mergify bot commented Sep 16, 2025

Uh oh!

rouchenzi commented Sep 17, 2025

Uh oh!

abmfy commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[EPLB] Support EPLB for Mixtral Model #22842

[EPLB] Support EPLB for Mixtral Model #22842

Conversation

rouchenzi commented Aug 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

rouchenzi commented Aug 13, 2025

Uh oh!

abmfy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

abmfy commented Aug 22, 2025

Uh oh!

mergify bot commented Sep 16, 2025

Uh oh!

rouchenzi commented Sep 17, 2025

Uh oh!

abmfy commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

rouchenzi commented Aug 13, 2025 •

edited by github-actions bot

Loading