Remove full hidden states capturing in custom backend #337

yubofredwang · 2025-11-28T09:19:18Z

Motivation

We are always capturing full hidden states with HF for target, which lead to more hidden states kept in memory. This can lead to OOM with larger batch size.

Modifications

This pull request introduces support for selectively outputting hidden states from specific layers in several custom backend model implementations. It does so by adding a new argument, layers_to_output_hidden_states, to the forward methods of multiple model classes. Additionally, it removes the _can_record_outputs attribute from these models, which previously specified which intermediate outputs could be recorded. These changes provide more granular control over which hidden states are returned during model execution and simplify the output recording logic.

Key changes by theme:

Selective Hidden State Output:

Added a layers_to_output_hidden_states argument to the forward methods of the following models: Llama, Llama4, Phi3, GptOss, Qwen2, Qwen3, and Qwen3_moe. This allows users to specify a list of layer indices for which hidden states should be returned, rather than returning all or none. The logic for collecting hidden states was updated accordingly in each model.
Updated the hidden state accumulation logic in each model's forward loop to check for the presence of layers_to_output_hidden_states and only collect hidden states for the specified layers.
Ensured that the output objects (BaseModelOutputWithPast, MoeModelOutputWithPast) include the filtered hidden_states tuple in their return values. [1] [2] [3] [4] [5] [6] [7]

Simplification and Cleanup:

Removed the _can_record_outputs attribute from all affected model classes, which previously defined which intermediate outputs could be recorded. This streamlines the code and delegates output selection to the new argument. [1] [2] [3] [4]

Type and Import Adjustments:

Updated import statements to include List from the typing module in all affected files, supporting the new argument's type annotation. [1] [2] [3] [4] [5]

These changes collectively provide more flexible and efficient access to intermediate model representations, which is useful for debugging, analysis, and advanced use cases.

Related Issues

Accuracy Test

Tested locally, both qwen and llama online training runs fine

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist · 2025-11-28T09:19:44Z

Summary of Changes

Hello @yubofredwang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances memory efficiency in custom backend model implementations by enabling selective capturing of hidden states. Previously, models would capture all hidden states, which could lead to Out-of-Memory errors, especially with larger batch sizes. The changes introduce a new mechanism to specify exactly which layers' hidden states are needed, thereby reducing memory footprint and improving overall performance for debugging, analysis, and advanced use cases.

Highlights

Granular Hidden State Control: Introduced a "layers_to_output_hidden_states" argument to the "forward" methods of various custom backend models (Llama, Llama4, Phi3, GptOss, Qwen2, Qwen3, Qwen3_moe). This allows users to specify a list of layer indices for which hidden states should be returned, rather than capturing all or none, addressing potential Out-of-Memory (OOM) issues.
Streamlined Output Recording: Removed the "_can_record_outputs" attribute from affected model classes, simplifying the internal logic for specifying recordable intermediate outputs.
Updated generate_eagle3_data Method: The generate_eagle3_data method in eagle3_target_model.py was refactored. It is now an abstract method in the base class, with the Hugging Face (HF) backend implementation moved to HFEagle3TargetModel. The SGLangEagle3TargetModel now leverages the new layers_to_output_hidden_states argument for efficient hidden state extraction.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a useful feature for selectively capturing hidden states to reduce memory usage. The overall approach is sound, but there are critical inconsistencies in the implementation across different models that need to be addressed. Specifically, the custom backend models collect layer inputs instead of outputs, which is inconsistent with the Hugging Face backend implementation and will lead to incorrect behavior. Additionally, there are some minor performance and style issues. My review provides detailed feedback on how to fix these issues to ensure correctness and consistency.

gemini-code-assist · 2025-11-28T20:57:30Z

It looks like you might have intended to request a code review. The correct command for that is /gemini review. Would you like me to perform a code review for this pull request?

yubofredwang · 2025-11-28T20:58:31Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable optimization by allowing selective capturing of hidden states, which should help mitigate OOM errors with large batch sizes. The changes are implemented across several custom backend models by adding a layers_to_output_hidden_states parameter.

The overall approach is sound and the refactoring in eagle3_target_model.py to abstract backend-specific logic is a good improvement. However, I've found a critical bug in qwen3_moe.py that will cause a NameError, and a high-severity issue in qwen2.py where a performance optimization is not correctly applied. There are also several opportunities for code simplification and ensuring consistency across the different model implementations, which I've detailed in the comments.

specforge/modeling/target/custom_backend/qwen3_moe.py

specforge/modeling/target/custom_backend/qwen2.py

specforge/modeling/target/custom_backend/gpt_oss.py

specforge/modeling/target/custom_backend/llama.py

specforge/modeling/target/custom_backend/llama4.py

specforge/modeling/target/custom_backend/phi3.py

specforge/modeling/target/custom_backend/qwen3.py

specforge/modeling/target/eagle3_target_model.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* support checkpoint * lint * capture only required hidden states * revert regen * fix llama * backward compatible * Update specforge/modeling/target/custom_backend/qwen3_moe.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * gemini suggests * fix * fix phi --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

yubofredwang added 4 commits November 28, 2025 05:07

support checkpoint

7ccada6

lint

57878dc

capture only required hidden states

c2e51e9

revert regen

749eda1

yubofredwang requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners November 28, 2025 09:19

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

yubofredwang marked this pull request as draft November 28, 2025 09:57

yubofredwang added 2 commits November 28, 2025 20:35

fix llama

9eefc7e

backward compatible

ba9f0d1

yubofredwang marked this pull request as ready for review November 28, 2025 20:57

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

yubofredwang and others added 4 commits November 28, 2025 14:52

Update specforge/modeling/target/custom_backend/qwen3_moe.py

30661b8

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini suggests

7f7f473

fix

3fce661

fix phi

53db109

FrankLeeeee merged commit e7b3716 into sgl-project:main Nov 30, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove full hidden states capturing in custom backend #337

Remove full hidden states capturing in custom backend #337

Uh oh!

yubofredwang commented Nov 28, 2025

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Uh oh!

yubofredwang commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove full hidden states capturing in custom backend #337

Remove full hidden states capturing in custom backend #337

Uh oh!

Conversation

yubofredwang commented Nov 28, 2025

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Uh oh!

yubofredwang commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants