Skip to content

Conversation

syedshahbaaz
Copy link

@syedshahbaaz syedshahbaaz commented Jun 30, 2025

Fixes #PYTORCHDGQ-6374. The existing test accesses local_shards[0] in FSDP for all ranks by default. Based on the ranks and data, not all ranks will have local shards with tensor every time. This throws a list index out of range error. Adding a check before accessing elements resolves this for all cases.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

daisyden and others added 30 commits May 10, 2024 19:43
This reverts commit f5cbd50.
This reverts commit d0d8271.
Signed-off-by: Cheng Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng Penghui <penghui.cheng@intel.com>
…equires_nccl_or and requires_nccl_version_or to replace requires_nccl and requires_nccl_version when xccl test is enabled on a test
Signed-off-by: Cheng Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
PenghuiCheng and others added 21 commits May 7, 2025 01:50
Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
…-power-of-two for XPU and XCCL backend. The skip test relies on WORLD_SIZE being set. If WORLD_SIZE is not set, the test case is skipped. Used this skip-test for test/distributed/fsdp/test_fsdp_tp_integration.py as it is not designed when non-power-of-two, e.g., 12, ranks is used as WORLD_SIZE. The test case is not skipped when 4 or 8 ranks are used for XPU devices. The test case is not skipped for a non-XPU device.
Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
[XPU] Added a new skip test to detect if the world size is set to non-power of 2
@syedshahbaaz syedshahbaaz requested review from mruberry and a team as code owners June 30, 2025 15:32
Copy link

pytorch-bot bot commented Jun 30, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157275

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit c3f9fc0 with merge base 53d06e1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: dynamo module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jun 30, 2025
Copy link

CLA Missing ID CLA Not Signed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dynamo module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants