Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Sep 25, 2025

Async TP related CI started to fail since Sep 22 2025. However even if we roll back the nightly PyTorch to 0919, the tests still failed.

python -m pip install --force-reinstall torch==2.10.0.dev20250917+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126

This is not an async TP issue but symmetric memory. This simple line can cause issues on the CI machine/docker.

symm_mem = get_symm_mem_workspace(torch.distributed.group.WORLD.group_name, min_size=1024*1024*64)

Async TP related CI started to fail since Sep 22 2025. However even if
we roll back the nightly PyTorch to 0919, the tests still failed.
```
python -m pip install --force-reinstall torch==2.10.0.dev20250917+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126
```

This is not an async TP issue but symmetric memory. This simple line can
cause issues on the CI machine/docker.

```
symm_mem = get_symm_mem_workspace(torch.distributed.group.WORLD.group_name, min_size=1024*1024*64)
```
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 25, 2025
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you invented the field disabled, it may be good opportunity to audit other integration test files to do an overall improvement. IIRC, including the (a100) feature test and simple_fsdp integration test (under its own folder).

@fegin fegin merged commit 82f0287 into main Sep 25, 2025
9 checks passed
@fegin fegin deleted the chienchin/disable_async_tp branch September 25, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants