[feature] add Sequence Parallelism support for offline training #366

uygnef · 2025-12-15T13:46:28Z

Motivation

This PR introduces Sequence Parallelism (SP) support for SpecForge training. The primary goal is to enable the training of models with significantly longer sequence lengths, which was previously limited by memory constraints on single GPUs.

Modifications

Implemented USP Support: Added support for Unified Sequence Parallelism (USP), integrating both Ring Attention and Ulysses attention mechanisms to handle long contexts efficiently.
Draft Model SP Group: Established a dedicated sequence parallel process group for the draft model to ensure correct communication and synchronization during distributed

Related Issues

#299

Accuracy Test

python tests/test_layers/test_decoder.py

for fp32, output diff less than 1e-5
for bf16, output diff less than 2e-2
grad diff is less than 1e-6

loss

train acc

Benchmark & Profiling

LIMIT

Suboptimal Loss Aggregation: Currently uses all_gather within the SP group. Will be optimized to local calculation + reduce_sum to save VRAM.
No Online Mode Support: Limited to offline mode. Online training requires resolving TP/SP size mismatches, managing large hidden states for data distribution, and modifying the gradient accumulation loop.
Attention Implementation: Currently uses sdpa, which is memory-intensive. Future updates will replace this with flex_attention for better optimization

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist · 2025-12-15T13:47:01Z

Summary of Changes

Hello @uygnef, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces Sequence Parallelism (SP) to SpecForge, a critical enhancement for training large language models with extended sequence lengths. By leveraging Unified Sequence Parallelism (USP), which combines Ring Attention and Ulysses attention, the system can now efficiently distribute and process longer sequences across multiple GPUs, effectively mitigating memory constraints. This foundational change paves the way for developing more powerful and context-aware models.

Highlights

Sequence Parallelism (SP) Support: Introduced comprehensive Sequence Parallelism (SP) support for SpecForge training, enabling the handling of significantly longer sequence lengths than previously possible.
Unified Sequence Parallelism (USP) Implementation: Implemented Unified Sequence Parallelism (USP), which integrates both Ring Attention and Ulysses attention mechanisms for efficient distributed processing of long contexts.
Dedicated SP Process Groups: Established dedicated sequence parallel process groups for the draft model to ensure correct communication and synchronization during distributed training.
New Configuration Arguments: Added new command-line arguments (--sp-ulysses-size, --sp-ring-size) to train_eagle3.py for configuring the dimensions of sequence parallelism.
Data Collator Adaptation: Modified the DataCollatorWithPadding to adjust padding dynamically, ensuring sequence lengths are multiples of the sequence parallel degree for proper data distribution.
LlamaUSPAttention Module: Introduced a new LlamaUSPAttention class, extending LlamaAttention, to encapsulate the specific logic for Ulysses and Ring Attention within the draft model.
Unit Testing for USP: Added a new unit test (tests/test_layers/test_decoder.py) to validate the correctness and numerical stability of the USP implementation against the standard attention mechanism.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant new functionality by adding Sequence Parallelism (SP) support, specifically Unified Sequence Parallelism (USP) with Ring and Ulysses attention. This is a valuable addition for training models with long sequences. The implementation is comprehensive, touching distributed setup, data collation, and the core attention mechanism, and includes a new distributed test case.

My review focuses on a few key areas:

Security and Portability: The example script contains hardcoded, environment-specific, and insecure configurations that should be removed.
Resource Management: There's a potential resource leak in the distributed setup where a process group is not destroyed.
Code Clarity and Correctness: I've pointed out areas for simplification, documentation updates, and removal of debugging artifacts.

Overall, this is a strong contribution. Addressing these points will improve the robustness and usability of the new feature.

examples/run_qwen3_8b_eagle3_online.sh

specforge/distributed.py

examples/run_qwen3_8b_eagle3_online.sh

scripts/train_eagle3.py

specforge/data/utils.py

specforge/distributed.py

specforge/modeling/draft/llama3_eagle.py

specforge/modeling/target/eagle3_target_model.py

xhdidi · 2025-12-20T02:41:02Z

With 8×H800, it seems that TP=8 can support a longer maximum training length than any other USP configuration. Is there any plan to enable flex-attention in the SP?

uygnef · 2025-12-22T03:57:44Z

With 8×H800, it seems that TP=8 can support a longer maximum training length than any other USP configuration. Is there any plan to enable flex-attention in the SP?

Flex attention does not work well with B/A-series GPUs. I aim to support this PR based on Flash Attention (this PR already exists). If there are precision issues, we will then enable Flex Attention support, which is estimated to take about 2 weeks.

Have you tried sp_ring_size=8? TP=8 maxes out at 16k sequence length, and sp_ring_size=8 should support longer sequences.

xhdidi · 2025-12-22T07:17:21Z

With 8×H800, it seems that TP=8 can support a longer maximum training length than any other USP configuration. Is there any plan to enable flex-attention in the SP?

Flex attention does not work well with B/A-series GPUs. I aim to support this PR based on Flash Attention (this PR already exists). If there are precision issues, we will then enable Flex Attention support, which is estimated to take about 2 weeks.

Have you tried sp_ring_size=8? TP=8 maxes out at 16k sequence length, and sp_ring_size=8 should support longer sequences.

I tried to train draft model for Qwen3-480B with 8*H800(80G). When tp=8 and using flex-attention backend, the maximum training length could reach 24k, but when ring-size=8 and training with usp backend, the maximum training length is only 12k.
When using the same attention backend, ring-size=8 might support a larger context length than tp=8. However, the memory saved by flex-attention seems to play a greater role while training long texts.

ggg-s · 2026-01-04T06:43:07Z

hi @uygnef，i want to know Why SP can't use Flex?

uygnef · 2026-01-05T08:00:29Z

hi @uygnef，i want to know Why SP can't use Flex?

It's easy to implement Ulysses with Flex Attention, but supporting Ring Attention involves complex mask handling. I'm adding USP support on top of Flash Attention, nearly done. PR coming in a few days.

ggg-s · 2026-01-05T08:04:06Z

@uygnef fine！

uygnef · 2026-01-07T07:55:38Z

@uygnef fine！

hi, here is the PR #411

…project#366) * add ds v3 * init * modify sglang fit deepseek * fix deepseek rparser * ulysses finish * ring offline finish * tmp * test pass * test fail * test * clean up * remove deepseek * clean up * clean up * - * - * format * fix unit test --------- Co-authored-by: Yu Feng <fengyufengyu@didiglobal.com> Co-authored-by: daiyajun <daiyajun@didiglobal.com>

Yu Feng and others added 12 commits December 8, 2025 19:26

add ds v3

e402dd0

init

685b2c2

modify sglang fit deepseek

f7b59da

fix deepseek rparser

6387580

ulysses finish

2fea6a8

ring offline finish

760fe95

tmp

a96d3ed

test pass

95d8aed

test fail

688e036

test

da88c5f

clean up

dccb350

remove deepseek

eaf3c2b

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

Yu Feng and others added 2 commits December 15, 2025 21:50

clean up

f3368c2

Merge branch 'sgl-project:main' into pr/cp

3098dff

uygnef force-pushed the pr/cp branch 2 times, most recently from ed6d3d2 to f676da4 Compare December 16, 2025 06:59

clean up

a97bf8f

uygnef force-pushed the pr/cp branch from f676da4 to a97bf8f Compare December 16, 2025 07:01

uygnef marked this pull request as ready for review December 16, 2025 11:29

uygnef requested review from FlamingoPg, FrankLeeeee, shuaills, sleepcoo and zyksir as code owners December 16, 2025 11:29

-

629cbab

uygnef force-pushed the pr/cp branch from 7a14b7c to 629cbab Compare December 16, 2025 11:46

uygnef changed the title ~~[feature] add Sequence Parallelism support for training~~ [feature] add Sequence Parallelism support for offline training Dec 16, 2025

-

7e2f833

Yu Feng added 2 commits December 17, 2025 15:58

format

194e18e

fix unit test

89241ea

FrankLeeeee approved these changes Dec 25, 2025

View reviewed changes

FrankLeeeee merged commit ee22b87 into sgl-project:main Dec 25, 2025
2 checks passed

[feature] add Sequence Parallelism support for offline training #366

[feature] add Sequence Parallelism support for offline training #366

Uh oh!

Conversation

uygnef commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

LIMIT

Checklist

Uh oh!

gemini-code-assist bot commented Dec 15, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xhdidi commented Dec 20, 2025

Uh oh!

uygnef commented Dec 22, 2025

Uh oh!

xhdidi commented Dec 22, 2025

Uh oh!

Uh oh!

ggg-s commented Jan 4, 2026

Uh oh!

uygnef commented Jan 5, 2026

Uh oh!

ggg-s commented Jan 5, 2026

Uh oh!

uygnef commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

uygnef commented Dec 15, 2025 •

edited

Loading