Skip to content

Conversation

awgu
Copy link
Collaborator

@awgu awgu commented Jun 6, 2024

Stack from ghstack (oldest at bottom):

The test seems to be flaky due to multi-threaded process group. This PR converts the test to use normal multi-process ProcessGroupNCCL to fix the flakiness.

This PR closes #126851.

Interestingly, the original MTPG version passes for me on devgpu. Either way, the new version also passes on devgpu, so we can see in CI.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 6, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128100

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 16249ba with merge base 2fc9079 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Jun 6, 2024
awgu pushed a commit that referenced this pull request Jun 6, 2024
@awgu awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 6, 2024
@awgu awgu marked this pull request as ready for review June 6, 2024 02:43

torch.manual_seed(42)
model = Model()
for param in model.parameters():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only needed for MTPG

@awgu awgu requested review from wanchaol and weifengpy June 6, 2024 02:43
@awgu
Copy link
Collaborator Author

awgu commented Jun 6, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jun 6, 2024
There is still ongoing discussion on how this API should work.

Current approach:
- The pre-all-gather ops run in the default stream and the all-gather is called from the default stream with `async_op=True`.
- Pros:
    - The all-gather input and output tensors are allocated in the default stream, so there is no increased memory fragmentation across stream pools.
    - There is no need for additional CUDA synchronization. The API is self-contained.
- Cons:
    - The pre-all-gather ops (e.g. cast from fp32 -> bf16 and all-gather copy-in device copies) cannot overlap with other default stream compute. The biggest concern here is for CPU offloading, the H2D copies cannot overlap.

Alternative approach:
- Follow the default implicit prefetching approach, where the pre-all-gather ops and all-gather run in separate streams.
- Pros:
    - The pre-all-gather ops can overlap with default stream compute.
- Cons:
    - We require an API that should be called after the last optimizer step (namely, last op that modified sharded parameters) and before the first `unshard` call that has the all-gather streams wait for the default stream. The API is no longer self-contained and now has a complementary API.
    - The all-gather input and output tensors are allocated in separate streams (not the default stream), so there can be increased memory fragmentation across pools.

Pull Request resolved: #128138
Approved by: https://github.com/wanchaol
ghstack dependencies: #128100
TharinduRusira pushed a commit to TharinduRusira/pytorch that referenced this pull request Jun 14, 2024
…st (pytorch#128100)

The test seems to be flaky due to multi-threaded process group. This PR converts the test to use normal multi-process `ProcessGroupNCCL` to fix the flakiness.

This PR closes pytorch#126851.

Interestingly, the original MTPG version passes for me on devgpu. Either way, the new version also passes on devgpu, so we can see in CI.

Pull Request resolved: pytorch#128100
Approved by: https://github.com/weifengpy
TharinduRusira pushed a commit to TharinduRusira/pytorch that referenced this pull request Jun 14, 2024
There is still ongoing discussion on how this API should work.

Current approach:
- The pre-all-gather ops run in the default stream and the all-gather is called from the default stream with `async_op=True`.
- Pros:
    - The all-gather input and output tensors are allocated in the default stream, so there is no increased memory fragmentation across stream pools.
    - There is no need for additional CUDA synchronization. The API is self-contained.
- Cons:
    - The pre-all-gather ops (e.g. cast from fp32 -> bf16 and all-gather copy-in device copies) cannot overlap with other default stream compute. The biggest concern here is for CPU offloading, the H2D copies cannot overlap.

Alternative approach:
- Follow the default implicit prefetching approach, where the pre-all-gather ops and all-gather run in separate streams.
- Pros:
    - The pre-all-gather ops can overlap with default stream compute.
- Cons:
    - We require an API that should be called after the last optimizer step (namely, last op that modified sharded parameters) and before the first `unshard` call that has the all-gather streams wait for the default stream. The API is no longer self-contained and now has a complementary API.
    - The all-gather input and output tensors are allocated in separate streams (not the default stream), so there can be increased memory fragmentation across pools.

Pull Request resolved: pytorch#128138
Approved by: https://github.com/wanchaol
ghstack dependencies: pytorch#128100
@github-actions github-actions bot deleted the gh/awgu/600/head branch July 7, 2024 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants