Skip to content

Conversation

@yubofredwang
Copy link
Collaborator

Motivation

We are always capturing full hidden states with HF for target, which lead to more hidden states kept in memory. This can lead to OOM with larger batch size.

Modifications

This pull request introduces support for selectively outputting hidden states from specific layers in several custom backend model implementations. It does so by adding a new argument, layers_to_output_hidden_states, to the forward methods of multiple model classes. Additionally, it removes the _can_record_outputs attribute from these models, which previously specified which intermediate outputs could be recorded. These changes provide more granular control over which hidden states are returned during model execution and simplify the output recording logic.

Key changes by theme:

Selective Hidden State Output:

  • Added a layers_to_output_hidden_states argument to the forward methods of the following models: Llama, Llama4, Phi3, GptOss, Qwen2, Qwen3, and Qwen3_moe. This allows users to specify a list of layer indices for which hidden states should be returned, rather than returning all or none. The logic for collecting hidden states was updated accordingly in each model.

  • Updated the hidden state accumulation logic in each model's forward loop to check for the presence of layers_to_output_hidden_states and only collect hidden states for the specified layers.

  • Ensured that the output objects (BaseModelOutputWithPast, MoeModelOutputWithPast) include the filtered hidden_states tuple in their return values. [1] [2] [3] [4] [5] [6] [7]

Simplification and Cleanup:

  • Removed the _can_record_outputs attribute from all affected model classes, which previously defined which intermediate outputs could be recorded. This streamlines the code and delegates output selection to the new argument. [1] [2] [3] [4]

Type and Import Adjustments:

  • Updated import statements to include List from the typing module in all affected files, supporting the new argument's type annotation. [1] [2] [3] [4] [5]

These changes collectively provide more flexible and efficient access to intermediate model representations, which is useful for debugging, analysis, and advanced use cases.

Related Issues

Accuracy Test

Tested locally, both qwen and llama online training runs fine

Benchmark & Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yubofredwang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances memory efficiency in custom backend model implementations by enabling selective capturing of hidden states. Previously, models would capture all hidden states, which could lead to Out-of-Memory errors, especially with larger batch sizes. The changes introduce a new mechanism to specify exactly which layers' hidden states are needed, thereby reducing memory footprint and improving overall performance for debugging, analysis, and advanced use cases.

Highlights

  • Granular Hidden State Control: Introduced a "layers_to_output_hidden_states" argument to the "forward" methods of various custom backend models (Llama, Llama4, Phi3, GptOss, Qwen2, Qwen3, Qwen3_moe). This allows users to specify a list of layer indices for which hidden states should be returned, rather than capturing all or none, addressing potential Out-of-Memory (OOM) issues.
  • Streamlined Output Recording: Removed the "_can_record_outputs" attribute from affected model classes, simplifying the internal logic for specifying recordable intermediate outputs.
  • Updated generate_eagle3_data Method: The generate_eagle3_data method in eagle3_target_model.py was refactored. It is now an abstract method in the base class, with the Hugging Face (HF) backend implementation moved to HFEagle3TargetModel. The SGLangEagle3TargetModel now leverages the new layers_to_output_hidden_states argument for efficient hidden state extraction.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful feature for selectively capturing hidden states to reduce memory usage. The overall approach is sound, but there are critical inconsistencies in the implementation across different models that need to be addressed. Specifically, the custom backend models collect layer inputs instead of outputs, which is inconsistent with the Hugging Face backend implementation and will lead to incorrect behavior. Additionally, there are some minor performance and style issues. My review provides detailed feedback on how to fix these issues to ensure correctness and consistency.

@yubofredwang yubofredwang marked this pull request as draft November 28, 2025 09:57
@yubofredwang yubofredwang marked this pull request as ready for review November 28, 2025 20:57
@gemini-code-assist
Copy link
Contributor

It looks like you might have intended to request a code review. The correct command for that is /gemini review. Would you like me to perform a code review for this pull request?

@yubofredwang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable optimization by allowing selective capturing of hidden states, which should help mitigate OOM errors with large batch sizes. The changes are implemented across several custom backend models by adding a layers_to_output_hidden_states parameter.

The overall approach is sound and the refactoring in eagle3_target_model.py to abstract backend-specific logic is a good improvement. However, I've found a critical bug in qwen3_moe.py that will cause a NameError, and a high-severity issue in qwen2.py where a performance optimization is not correctly applied. There are also several opportunities for code simplification and ensuring consistency across the different model implementations, which I've detailed in the comments.

yubofredwang and others added 4 commits November 28, 2025 14:52
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@FrankLeeeee FrankLeeeee merged commit e7b3716 into sgl-project:main Nov 30, 2025
2 checks passed
xiaomin-D pushed a commit to eigen-ai-labs/SpecForge_public that referenced this pull request Jan 10, 2026
* support checkpoint

* lint

* capture only required hidden states

* revert regen

* fix llama

* backward compatible

* Update specforge/modeling/target/custom_backend/qwen3_moe.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* gemini suggests

* fix

* fix phi

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants