Skip to content

Conversation

anshul-si
Copy link
Contributor

@anshul-si anshul-si commented Sep 10, 2025

Summary: In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision,

Test Cases

  1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init

Stack from ghstack (oldest at bottom):

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

Copy link

pytorch-bot bot commented Sep 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162636

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 54938f1 with merge base 0819de4 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@mori360 mori360 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test

**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision,

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init




cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision,

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init




cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
@anshul-si
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Sep 18, 2025
…162650)

**Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group
2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group
3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager

Pull Request resolved: #162650
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636
pytorchmergebot pushed a commit that referenced this pull request Sep 18, 2025
…non-root module (#162654)

**Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward

Pull Request resolved: #162654
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650
pytorchmergebot pushed a commit that referenced this pull request Sep 18, 2025
…iple times in a forward pass (#162656)

**Summary:** Verifies that Replicate works correctly when a module is used multiple times in a single forward pass.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module

Pull Request resolved: #162656
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650, #162654
pytorchmergebot pushed a commit that referenced this pull request Sep 18, 2025
**Summary:** Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching

Pull Request resolved: #162658
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650, #162654, #162656
pytorchmergebot pushed a commit that referenced this pull request Sep 18, 2025
…tes (#162785)

**Summary:**  In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event

Pull Request resolved: #162785
Approved by: https://github.com/mori360
ghstack dependencies: #162631, #162636, #162650, #162654, #162656, #162658
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…162636)

**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision,

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init

Pull Request resolved: pytorch#162636
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…ytorch#162650)

**Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group
2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group
3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager

Pull Request resolved: pytorch#162650
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…non-root module (pytorch#162654)

**Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward

Pull Request resolved: pytorch#162654
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…iple times in a forward pass (pytorch#162656)

**Summary:** Verifies that Replicate works correctly when a module is used multiple times in a single forward pass.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module

Pull Request resolved: pytorch#162656
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
**Summary:** Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching

Pull Request resolved: pytorch#162658
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654, pytorch#162656
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…tes (pytorch#162785)

**Summary:**  In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event

Pull Request resolved: pytorch#162785
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654, pytorch#162656, pytorch#162658
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…162636)

**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision,

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init

Pull Request resolved: pytorch#162636
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…ytorch#162650)

**Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group
2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group
3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager

Pull Request resolved: pytorch#162650
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…non-root module (pytorch#162654)

**Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward

Pull Request resolved: pytorch#162654
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…iple times in a forward pass (pytorch#162656)

**Summary:** Verifies that Replicate works correctly when a module is used multiple times in a single forward pass.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module

Pull Request resolved: pytorch#162656
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
**Summary:** Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching

Pull Request resolved: pytorch#162658
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654, pytorch#162656
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…tes (pytorch#162785)

**Summary:**  In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event

Pull Request resolved: pytorch#162785
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654, pytorch#162656, pytorch#162658
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…162636)

**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This test is important as it verifies we can cast a replicated module to a different type after initialization, and import feature for enabling mixed precision,

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_to_float64_after_init

Pull Request resolved: pytorch#162636
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…ytorch#162650)

**Summary:** The parity tests train two identical models with the same inputs - one using a reference approach and one using the test approach (replicate) - then check that both models produce identical losses. This ensures the distributed training methods don't change the mathematical results compared to standard training.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_single_group
2. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group
3. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_multi_group_cpu_offload_eager

Pull Request resolved: pytorch#162650
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…non-root module (pytorch#162654)

**Summary:** Verifies that Replicate correctly handles the scenario where forward and backward passes are run through both the root module and a non-root module.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_non_root_forward_backward

Pull Request resolved: pytorch#162654
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…iple times in a forward pass (pytorch#162656)

**Summary:** Verifies that Replicate works correctly when a module is used multiple times in a single forward pass.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_multi_forward_module

Pull Request resolved: pytorch#162656
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
**Summary:** Prefetching tests validate that distributed training systems can correctly overlap communication and computation by pre-loading parameters or data before they're needed. This test ensures the prefetching mechanism doesn't break training correctness while potentially improving performance by reducing idle time where computation waits for communication to complete.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_explicit_prefetching

Pull Request resolved: pytorch#162658
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654, pytorch#162656
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…tes (pytorch#162785)

**Summary:**  In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. Verify replicate correctly handles post-optimizer events.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_post_optim_event

Pull Request resolved: pytorch#162785
Approved by: https://github.com/mori360
ghstack dependencies: pytorch#162631, pytorch#162636, pytorch#162650, pytorch#162654, pytorch#162656, pytorch#162658
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants