Skip to content

Conversation

rouchenzi
Copy link
Contributor

@rouchenzi rouchenzi commented Aug 13, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

#20468 Support EPLB for Mixtral Model

Test Plan

Running the following test script
python test_eplb.py --mode eplb
python test_eplb.py --mode normal

import json
import os
import argparse
from vllm import LLM, SamplingParams

prompt = "Explain the theory of relativity in simple terms."

RESULT_FILE = "eplb_test_output.json"

sampling_params = SamplingParams(
    temperature=0.0,
    top_p=1.0,
    top_k=1,
    max_tokens=100
)

def run_inference(model_path: str, enable_eplb: bool, num_redundant_experts: int = 0):
    print(f"Running inference with EPLB={enable_eplb}, redundant experts={num_redundant_experts}")

    llm = LLM(
        model=model_path,
        tensor_parallel_size=8,
        enable_expert_parallel=True,
        enable_eplb=enable_eplb,
        num_redundant_experts=num_redundant_experts if enable_eplb else 0,
        eplb_window_size=1000,
        eplb_step_interval=100,
        eplb_log_balancedness=True,
        enforce_eager=True,
        trust_remote_code=True
    )

    result = llm.generate([prompt], sampling_params)
    output_text = result[0].outputs[0].text.strip()

    print("Output:")
    print(output_text)
    print("-" * 50)

    return output_text

def save_result(key: str, value: list):
    if os.path.exists(RESULT_FILE):
        with open(RESULT_FILE, "r") as f:
            results = json.load(f)
    else:
        results = {}

    results[key] = value

    with open(RESULT_FILE, "w") as f:
        json.dump(results, f, indent=2)

    print(f"Output saved to {RESULT_FILE}")

def load_results():
    if os.path.exists(RESULT_FILE):
        with open(RESULT_FILE, "r") as f:
            return json.load(f)
    return {}

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--mode", type=str, choices=["eplb", "normal", "compare"], required=True)
    args = parser.parse_args()

    MODEL_PATH = "/root/vllm/models/Mixtral-8x7B-v0.1"

    if args.mode == "eplb":
        outputs = run_inference(MODEL_PATH, enable_eplb=True, num_redundant_experts=32)
        save_result("eplb", outputs)
    elif args.mode == "normal":
        outputs = run_inference(MODEL_PATH, enable_eplb=False)
        save_result("normal", outputs)

Test Result

Result w/ verses w/o EPLB looks consistent given the example prompt tested

{
  "normal": "The theory of relativity is a theory of physics that explains the relationship between space and time. It is based on the idea that the laws of physics are the same for all observers, regardless of their motion relative to each other.\n\nThe theory of relativity has two main parts: the special theory of relativity and the general theory of relativity. The special theory of relativity deals with the relationship between space and time for observers who are moving at a constant speed relative",
  "eplb": "The theory of relativity is a theory of physics that explains the relationship between space and time. It is based on the idea that the laws of physics are the same for all observers, regardless of their motion relative to each other.\n\nThe theory of relativity has two main parts: the special theory of relativity and the general theory of relativity. The special theory of relativity deals with the relationship between space and time for observers who are moving at a constant speed relative"
}

Example eplb_log_balancedness result

INFO 08-13 17:13:21 [eplb_state.py:389] EPLB step: avg_tokens=278720.00, max_tokens=285088, balancedness=0.9777
INFO 08-13 17:13:21 [eplb_state.py:389] EPLB step: avg_tokens=16.00, max_tokens=88, balancedness=0.1818

(Optional) Documentation Update

Signed-off-by: rouchenzi <ruochenwen@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@rouchenzi rouchenzi marked this pull request as draft August 13, 2025 17:17
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Expert Parallelism Load Balancing (EPLB) for the Mixtral model. The changes involve updating the Mixtral model implementation to conform to the MixtureOfExperts interface, which is necessary for EPLB integration. This includes modifications to weight loading logic to handle redundant experts and methods to manage EPLB state. The implementation is largely correct, but I've identified a couple of areas for improvement to enhance robustness and future extensibility.

Comment on lines +541 to +555
def set_eplb_state(
self,
expert_load_view: torch.Tensor,
logical_to_physical_map: torch.Tensor,
logical_replica_count: torch.Tensor,
) -> None:
for layer_idx, layer in enumerate(self.moe_layers):
# Register the expert weights.
self.expert_weights.append(layer.get_expert_weights())
layer.set_eplb_state(
moe_layer_idx=layer_idx,
expert_load_view=expert_load_view,
logical_to_physical_map=logical_to_physical_map,
logical_replica_count=logical_replica_count,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The self.expert_weights list is appended to in a loop. Since it's only initialized once in __init__, multiple calls to set_eplb_state would cause the list to grow indefinitely, leading to incorrect behavior and a memory leak. While the current implementation of the EPLB scheduler calls this method only once, it's safer to make this method idempotent by clearing the list before populating it.

Suggested change
def set_eplb_state(
self,
expert_load_view: torch.Tensor,
logical_to_physical_map: torch.Tensor,
logical_replica_count: torch.Tensor,
) -> None:
for layer_idx, layer in enumerate(self.moe_layers):
# Register the expert weights.
self.expert_weights.append(layer.get_expert_weights())
layer.set_eplb_state(
moe_layer_idx=layer_idx,
expert_load_view=expert_load_view,
logical_to_physical_map=logical_to_physical_map,
logical_replica_count=logical_replica_count,
)
def set_eplb_state(
self,
expert_load_view: torch.Tensor,
logical_to_physical_map: torch.Tensor,
logical_replica_count: torch.Tensor,
) -> None:
self.expert_weights.clear()
for layer_idx, layer in enumerate(self.moe_layers):
# Register the expert weights.
self.expert_weights.append(layer.get_expert_weights())
layer.set_eplb_state(
moe_layer_idx=layer_idx,
expert_load_view=expert_load_view,
logical_to_physical_map=logical_to_physical_map,
logical_replica_count=logical_replica_count,
)

num_physical_experts: int,
num_local_physical_experts: int,
) -> None:
assert self.num_local_physical_experts == num_local_physical_experts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This assertion enforces that num_local_physical_experts cannot change. However, the method signature and the broader context of EPLB (which can involve dynamic scaling) suggest that this value might need to be updated. If the number of GPUs in the expert parallel group changes, this assertion will fail, preventing dynamic scaling. If this value is not expected to change, the parameter is redundant. This assertion seems overly restrictive and may prevent future features like dynamic scaling.

@rouchenzi rouchenzi marked this pull request as ready for review August 13, 2025 17:27
@rouchenzi
Copy link
Contributor Author

Hey @abmfy this PR supports EPLB for Mixtral, tested E2E workflow with Mixtral-8x7B, please let me know if more tests suggested, thanks!

Copy link
Member

@abmfy abmfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just very tiny changes to reflect the new config interface. I'll do an accuracy test and if that passes we're good to go.

Thanks for the contribution!

@abmfy
Copy link
Member

abmfy commented Aug 22, 2025

Accuracy tests passed:

w/o EPLB:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.5792 ± 0.0136
strict-match 5 exact_match 0.5777 ± 0.0136

w/ EPLB:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.5921 ± 0.0135
strict-match 5 exact_match 0.5914 ± 0.0135

@robertgshaw2-redhat robertgshaw2-redhat changed the title Support EPLB for Mixtral Model [EPLB] Support EPLB for Mixtral Model Sep 16, 2025
Copy link

mergify bot commented Sep 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rouchenzi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 16, 2025
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>
@mergify mergify bot removed the needs-rebase label Sep 17, 2025
@rouchenzi rouchenzi requested a review from abmfy September 17, 2025 00:49
rouchenzi and others added 2 commits September 16, 2025 17:52
Co-authored-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>
Co-authored-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>
@rouchenzi
Copy link
Contributor Author

Thanks @abmfy for reviewing and helping with accuracy test!

Resolved conflicts and updated change with suggestion, please take a look whenever free.

rouchenzi and others added 2 commits September 16, 2025 19:19
Signed-off-by: rouchenzi <ruochenwen@gmail.com>
@abmfy
Copy link
Member

abmfy commented Sep 17, 2025

LGTM!
cc @simon-mo could you please take a look and approve if everything looks good when you have time? Thank you!

@simon-mo simon-mo enabled auto-merge (squash) September 17, 2025 05:35
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 17, 2025
@simon-mo simon-mo merged commit b77bf34 into vllm-project:main Sep 17, 2025
59 checks passed
@rouchenzi rouchenzi deleted the mixtral-eplb branch September 19, 2025 23:32
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: rouchenzi <ruochenwen@gmail.com>
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>
Co-authored-by: Bowen Wang <abmfy@icloud.com>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: rouchenzi <ruochenwen@gmail.com>
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com>
Co-authored-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: charlifu <charlifu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eplb ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants