[FSDP][Replicate] tests replicate parity for single and multigroup #162650

anshul-si · 2025-09-10T22:28:09Z

Summary: The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training.

Test Cases

pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group
pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group
pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager

Stack from ghstack (oldest at bottom):

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-09-10T22:28:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162650

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8ff8a63 with merge base 0819de4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 5bb4643 Pull Request resolved: #162650

…ltigroup" **Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group 2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group 3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

anshul-si · 2025-09-18T00:31:28Z

@pytorchbot merge

pytorchmergebot · 2025-09-18T00:33:10Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…non-root module (#162654) **Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward Pull Request resolved: #162654 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650

…iple times in a forward pass (#162656) **Summary:** Verifies that Replicate works correctly when a module is used multiple times in a single forward pass. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module Pull Request resolved: #162656 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650, #162654

**Summary:** Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching Pull Request resolved: #162658 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650, #162654, #162656

…tes (#162785) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event Pull Request resolved: #162785 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650, #162654, #162656, #162658

…ytorch#162650) **Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group 2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group 3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager Pull Request resolved: pytorch#162650 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636

…non-root module (pytorch#162654) **Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward Pull Request resolved: pytorch#162654 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650

…iple times in a forward pass (pytorch#162656) **Summary:** Verifies that Replicate works correctly when a module is used multiple times in a single forward pass. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module Pull Request resolved: pytorch#162656 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654

**Summary:** Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching Pull Request resolved: pytorch#162658 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654, pytorch#162656

…tes (pytorch#162785) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event Pull Request resolved: pytorch#162785 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654, pytorch#162656, pytorch#162658