-
Notifications
You must be signed in to change notification settings - Fork 132
[feature] add Sequence Parallelism support for offline training #366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @uygnef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces Sequence Parallelism (SP) to SpecForge, a critical enhancement for training large language models with extended sequence lengths. By leveraging Unified Sequence Parallelism (USP), which combines Ring Attention and Ulysses attention, the system can now efficiently distribute and process longer sequences across multiple GPUs, effectively mitigating memory constraints. This foundational change paves the way for developing more powerful and context-aware models. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant new functionality by adding Sequence Parallelism (SP) support, specifically Unified Sequence Parallelism (USP) with Ring and Ulysses attention. This is a valuable addition for training models with long sequences. The implementation is comprehensive, touching distributed setup, data collation, and the core attention mechanism, and includes a new distributed test case.
My review focuses on a few key areas:
- Security and Portability: The example script contains hardcoded, environment-specific, and insecure configurations that should be removed.
- Resource Management: There's a potential resource leak in the distributed setup where a process group is not destroyed.
- Code Clarity and Correctness: I've pointed out areas for simplification, documentation updates, and removal of debugging artifacts.
Overall, this is a strong contribution. Addressing these points will improve the robustness and usability of the new feature.
ed6d3d2 to
f676da4
Compare
|
With 8×H800, it seems that TP=8 can support a longer maximum training length than any other USP configuration. Is there any plan to enable flex-attention in the SP? |
Flex attention does not work well with B/A-series GPUs. I aim to support this PR based on Flash Attention (this PR already exists). If there are precision issues, we will then enable Flex Attention support, which is estimated to take about 2 weeks. Have you tried sp_ring_size=8? TP=8 maxes out at 16k sequence length, and sp_ring_size=8 should support longer sequences. |
I tried to train draft model for Qwen3-480B with 8*H800(80G). When tp=8 and using flex-attention backend, the maximum training length could reach 24k, but when ring-size=8 and training with usp backend, the maximum training length is only 12k. |
|
hi @uygnef,i want to know Why SP can't use Flex? |
It's easy to implement Ulysses with Flex Attention, but supporting Ring Attention involves complex mask handling. I'm adding USP support on top of Flash Attention, nearly done. PR coming in a few days. |
|
@uygnef fine! |
…project#366) * add ds v3 * init * modify sglang fit deepseek * fix deepseek rparser * ulysses finish * ring offline finish * tmp * test pass * test fail * test * clean up * remove deepseek * clean up * clean up * - * - * format * fix unit test --------- Co-authored-by: Yu Feng <fengyufengyu@didiglobal.com> Co-authored-by: daiyajun <daiyajun@didiglobal.com>
Motivation
This PR introduces Sequence Parallelism (SP) support for SpecForge training. The primary goal is to enable the training of models with significantly longer sequence lengths, which was previously limited by memory constraints on single GPUs.
Modifications
Implemented USP Support: Added support for Unified Sequence Parallelism (USP), integrating both Ring Attention and Ulysses attention mechanisms to handle long contexts efficiently.
Draft Model SP Group: Established a dedicated sequence parallel process group for the draft model to ensure correct communication and synchronization during distributed
Related Issues
#299
Accuracy Test
for fp32, output diff less than 1e-5
for bf16, output diff less than 2e-2
grad diff is less than 1e-6
loss
train acc

Benchmark & Profiling
LIMIT
Checklist