Skip to content

Conversation

@uygnef
Copy link
Contributor

@uygnef uygnef commented Dec 15, 2025

Motivation

This PR introduces Sequence Parallelism (SP) support for SpecForge training. The primary goal is to enable the training of models with significantly longer sequence lengths, which was previously limited by memory constraints on single GPUs.

Modifications

  • Implemented USP Support: Added support for Unified Sequence Parallelism (USP), integrating both Ring Attention and Ulysses attention mechanisms to handle long contexts efficiently.

  • Draft Model SP Group: Established a dedicated sequence parallel process group for the draft model to ensure correct communication and synchronization during distributed

Related Issues

#299

Accuracy Test

python tests/test_layers/test_decoder.py

for fp32, output diff less than 1e-5
for bf16, output diff less than 2e-2
grad diff is less than 1e-6

image

loss

train acc
image

Benchmark & Profiling

LIMIT

  1. Suboptimal Loss Aggregation: Currently uses all_gather within the SP group. Will be optimized to local calculation + reduce_sum to save VRAM.
  2. No Online Mode Support: Limited to offline mode. Online training requires resolving TP/SP size mismatches, managing large hidden states for data distribution, and modifying the gradient accumulation loop.
  3. Attention Implementation: Currently uses sdpa, which is memory-intensive. Future updates will replace this with flex_attention for better optimization

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @uygnef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces Sequence Parallelism (SP) to SpecForge, a critical enhancement for training large language models with extended sequence lengths. By leveraging Unified Sequence Parallelism (USP), which combines Ring Attention and Ulysses attention, the system can now efficiently distribute and process longer sequences across multiple GPUs, effectively mitigating memory constraints. This foundational change paves the way for developing more powerful and context-aware models.

Highlights

  • Sequence Parallelism (SP) Support: Introduced comprehensive Sequence Parallelism (SP) support for SpecForge training, enabling the handling of significantly longer sequence lengths than previously possible.
  • Unified Sequence Parallelism (USP) Implementation: Implemented Unified Sequence Parallelism (USP), which integrates both Ring Attention and Ulysses attention mechanisms for efficient distributed processing of long contexts.
  • Dedicated SP Process Groups: Established dedicated sequence parallel process groups for the draft model to ensure correct communication and synchronization during distributed training.
  • New Configuration Arguments: Added new command-line arguments (--sp-ulysses-size, --sp-ring-size) to train_eagle3.py for configuring the dimensions of sequence parallelism.
  • Data Collator Adaptation: Modified the DataCollatorWithPadding to adjust padding dynamically, ensuring sequence lengths are multiples of the sequence parallel degree for proper data distribution.
  • LlamaUSPAttention Module: Introduced a new LlamaUSPAttention class, extending LlamaAttention, to encapsulate the specific logic for Ulysses and Ring Attention within the draft model.
  • Unit Testing for USP: Added a new unit test (tests/test_layers/test_decoder.py) to validate the correctness and numerical stability of the USP implementation against the standard attention mechanism.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by adding Sequence Parallelism (SP) support, specifically Unified Sequence Parallelism (USP) with Ring and Ulysses attention. This is a valuable addition for training models with long sequences. The implementation is comprehensive, touching distributed setup, data collation, and the core attention mechanism, and includes a new distributed test case.

My review focuses on a few key areas:

  • Security and Portability: The example script contains hardcoded, environment-specific, and insecure configurations that should be removed.
  • Resource Management: There's a potential resource leak in the distributed setup where a process group is not destroyed.
  • Code Clarity and Correctness: I've pointed out areas for simplification, documentation updates, and removal of debugging artifacts.

Overall, this is a strong contribution. Addressing these points will improve the robustness and usability of the new feature.

@uygnef uygnef force-pushed the pr/cp branch 2 times, most recently from ed6d3d2 to f676da4 Compare December 16, 2025 06:59
@uygnef uygnef changed the title [feature] add Sequence Parallelism support for training [feature] add Sequence Parallelism support for offline training Dec 16, 2025
@xhdidi
Copy link

xhdidi commented Dec 20, 2025

With 8×H800, it seems that TP=8 can support a longer maximum training length than any other USP configuration. Is there any plan to enable flex-attention in the SP?

@uygnef
Copy link
Contributor Author

uygnef commented Dec 22, 2025

With 8×H800, it seems that TP=8 can support a longer maximum training length than any other USP configuration. Is there any plan to enable flex-attention in the SP?

Flex attention does not work well with B/A-series GPUs. I aim to support this PR based on Flash Attention (this PR already exists). If there are precision issues, we will then enable Flex Attention support, which is estimated to take about 2 weeks.

Have you tried sp_ring_size=8? TP=8 maxes out at 16k sequence length, and sp_ring_size=8 should support longer sequences.

@xhdidi
Copy link

xhdidi commented Dec 22, 2025

With 8×H800, it seems that TP=8 can support a longer maximum training length than any other USP configuration. Is there any plan to enable flex-attention in the SP?

Flex attention does not work well with B/A-series GPUs. I aim to support this PR based on Flash Attention (this PR already exists). If there are precision issues, we will then enable Flex Attention support, which is estimated to take about 2 weeks.

Have you tried sp_ring_size=8? TP=8 maxes out at 16k sequence length, and sp_ring_size=8 should support longer sequences.

I tried to train draft model for Qwen3-480B with 8*H800(80G). When tp=8 and using flex-attention backend, the maximum training length could reach 24k, but when ring-size=8 and training with usp backend, the maximum training length is only 12k.
When using the same attention backend, ring-size=8 might support a larger context length than tp=8. However, the memory saved by flex-attention seems to play a greater role while training long texts.

@FrankLeeeee FrankLeeeee merged commit ee22b87 into sgl-project:main Dec 25, 2025
2 checks passed
@ggg-s
Copy link
Contributor

ggg-s commented Jan 4, 2026

hi @uygnef,i want to know Why SP can't use Flex?

@uygnef
Copy link
Contributor Author

uygnef commented Jan 5, 2026

hi @uygnef,i want to know Why SP can't use Flex?

It's easy to implement Ulysses with Flex Attention, but supporting Ring Attention involves complex mask handling. I'm adding USP support on top of Flash Attention, nearly done. PR coming in a few days.

@ggg-s
Copy link
Contributor

ggg-s commented Jan 5, 2026

@uygnef fine!

@uygnef
Copy link
Contributor Author

uygnef commented Jan 7, 2026

@uygnef fine!

hi, here is the PR #411

xiaomin-D pushed a commit to eigen-ai-labs/SpecForge_public that referenced this pull request Jan 10, 2026
…project#366)

* add ds v3

* init

* modify sglang fit deepseek

* fix deepseek rparser

* ulysses finish

* ring offline finish

* tmp

* test pass

* test fail

* test

* clean up

* remove deepseek

* clean up

* clean up

* -

* -

* format

* fix unit test

---------

Co-authored-by: Yu Feng <fengyufengyu@didiglobal.com>
Co-authored-by: daiyajun <daiyajun@didiglobal.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants