[FSDP][Replicate] tests replicate parameter registration #162631

anshul-si · 2025-09-10T20:12:45Z

Summary
Tests parameter state management after forward and backward passes for single and multiple replicate groups

Test Cases

pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_forward
pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_backward

Stack from ghstack (oldest at bottom):

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-09-10T20:12:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162631

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0f7c67c with merge base 0819de4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 76d243f Pull Request resolved: #162631

**Summary** Tests parameter state management after forward and backward passes for single and multiple replicate groups **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_forward 2. pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_backward cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

anshul-si · 2025-09-16T18:09:30Z

@pytorchbot merge

pytorchmergebot · 2025-09-16T18:11:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-17T00:09:59Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

anshul-si · 2025-09-17T00:11:25Z

@pytorchbot merge

pytorchmergebot · 2025-09-17T00:13:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision, **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init Pull Request resolved: #162636 Approved by: https://github.com/mori360 ghstack dependencies: #162631

…162650) **Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group 2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group 3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager Pull Request resolved: #162650 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636

…non-root module (#162654) **Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward Pull Request resolved: #162654 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650

…iple times in a forward pass (#162656) **Summary:** Verifies that Replicate works correctly when a module is used multiple times in a single forward pass. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module Pull Request resolved: #162656 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650, #162654

**Summary:** Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching Pull Request resolved: #162658 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650, #162654, #162656

…tes (#162785) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event Pull Request resolved: #162785 Approved by: https://github.com/mori360 ghstack dependencies: #162631, #162636, #162650, #162654, #162656, #162658

) **Summary** Tests parameter state management after forward and backward passes for single and multiple replicate groups **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_forward 2. pytest test/distributed/_composable/test_replicate_training.py -k test_param_registration_after_backward Pull Request resolved: pytorch#162631 Approved by: https://github.com/mori360

…162636) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision, **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init Pull Request resolved: pytorch#162636 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631

…ytorch#162650) **Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group 2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group 3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager Pull Request resolved: pytorch#162650 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636

…non-root module (pytorch#162654) **Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward Pull Request resolved: pytorch#162654 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650

…iple times in a forward pass (pytorch#162656) **Summary:** Verifies that Replicate works correctly when a module is used multiple times in a single forward pass. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module Pull Request resolved: pytorch#162656 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654

**Summary:** Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching Pull Request resolved: pytorch#162658 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654, pytorch#162656

…tes (pytorch#162785) **Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events. **Test Cases** 1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event Pull Request resolved: pytorch#162785 Approved by: https://github.com/mori360 ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654, pytorch#162656, pytorch#162658