E2E test for FSDP, HSDP, FSDP+TP in Distributed Checkpointing #112541

LucasLLC · 2023-10-31T23:35:50Z

Adds E2E tests for saving/loading distributed checkpoints. Supported so far are:

FSDP
HSDP
FSDP + TP

Each method is also tested using torch.compile

To run all tests:
python test/distributed/checkpoint/test/distributed/checkpoint/e2e/test_e2e_save_and_load.py

pytorch-bot · 2023-10-31T23:35:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112541

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 044fb94 with merge base 6a3922d ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / macos-12-py3-x86-64 / test (default, 4, 4, macos-12) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

LucasLLC · 2023-10-31T23:39:22Z

test/distributed/checkpoint/e2e/test_fsdp.py

+
+        return model, optim
+
+    def _equal_state_dict(self, model_0, model_1):


@fegin Is your recommendation (https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/parallel/test_fsdp_2d_parallel.py#L638-L641) necessary here, or do you think this is sufficient?

Assuming we'll need something similar to compare optimizers?

Chien-Chin has some utils to compare state_dict here. See if anything is useful for you, and maybe we can just create a test utils for these so we can re-use these in tests. https://github.com/pytorch/pytorch/blob/main/test/distributed/checkpoint/test_state_dict.py#L56

I believe since I can compare DTensor's directly with torch.equal we don't really need this logic, so I left the model state dict comparison the same for now.

I also opted for a direct comparison of the optim state dict, since it was a bit simpler and that seems to work here as well -

pytorch/test/distributed/checkpoint/test_state_dict.py

Line 146 in 5a6f801

self.assertEqual(optim.state_dict(), new_optim.state_dict())

.

If you are comparing 2 identical state_dicts, this should be enough. However, if you need to compare with non-converted optimizer state_dict, a customized function is required as non-converted optimizer state use parameter id.

test/distributed/checkpoint/e2e/test_e2e_save_and_load.py

wz337 · 2023-11-02T04:49:05Z

Adding a few tags. @LucasLLC Just FYI, the multi-gpu tests do not run by default. To enable the multi gpu tests, we need to add CI/periodic to the PR.

wz337 · 2023-11-02T05:14:27Z

test/distributed/checkpoint/e2e/test_e2e_save_and_load.py

+        return torch.rand(8, 8, device="cuda")
+
+
+class ModelType(Enum):


Noice!! Very clean!!

H-Huang

Looks great! I would also include an example python / pytest command in the PR summary to include as a reference.

Will defer the approval stamp to the DCP experts :)

H-Huang · 2023-11-02T12:48:13Z

test/distributed/checkpoint/e2e/test_e2e_save_and_load.py

+
+class TestE2ELoadAndSave(DTensorTestBase):
+    def _create_model(self, compile, model_type):
+        dummy_model = TestDummyModel().cuda()


nit: moving to cuda necessary? I thought FSDP does it

fegin · 2023-11-02T14:04:50Z

test/distributed/checkpoint/e2e/test_e2e_save_and_load.py

+    def test_e2e(self, compile, model_type):
+        # first create and save a checkpoint
+        model, optim = self._create_model(compile, model_type)
+        model_state_dict_0, optim_state_dict_0 = get_state_dict(model, optimizers=optim)


I suggest that we use a non-parallelized model and directly call state_dict as the source of truth to compare.

create a non-parallelized model

create a parallelized model

train both models 2 steps

save the parallelized model

create a new parallelized model

load from the trained parallelized model to the new parallelized model

train another step both the new parallelized model and the non-parallelized model and compare the accuracy.

fegin

LGTM, we can land this first and improve it to compare the result with a non-parallelized model.

Please fix the lint issue and skip_if_lt_x_gpu issue @wz337 mentioned before landing.

wz337

Thanks for having this ready! Super-duper fast!
With this PR, now we can mark our FSDP, HSDP, 2D checkpointing B/E ready on PT-D feature support matrix! Thanks @LucasLLC!

LucasLLC · 2023-11-02T21:05:26Z

@pytorchbot merge

pytorchmergebot · 2023-11-02T21:07:32Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-11-02T21:39:30Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / lintrunner / linux-job

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

fegin · 2023-11-02T22:50:06Z

Try lintrunner locally to get the signal of lint and typing errors.

LucasLLC · 2023-11-02T23:30:02Z

@pytorchbot merge

pytorchmergebot · 2023-11-02T23:33:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

wanchaol · 2023-11-03T00:08:13Z

test/distributed/checkpoint/e2e/test_e2e_save_and_load.py

+            )
+            tp_mesh = mesh_2d["tp"]
+            dp_mesh = mesh_2d["dp"]
+            model = parallelize_module(dummy_model, tp_mesh, PairwiseParallel())


As a follow up, we should probably switch PairwiseParallel to Colwise + Rowwise as we plan to deprecate the former soon

@LucasLLC

For usage of Colwise + Rowwise, you can refer to this slide. https://docs.google.com/presentation/d/1e9TNYu_u_Hz9IpfhS4_R_77de59g5AnSWEf7U-FDphg/edit#slide=id.g28cb00a468a_0_16

So, for this model, we can specify a parallelize_plan for some of the layers, maybe layer 1 and layer 2.

actually since we use sequential in the module definition, I am not sure if PairwiseParallel is working here. So using Colwise + Rowwise might be a necessary. Basically, you need to pass in a dictionary as a plan to parallelize_module with its key equals to module's FQN and value is ColwiseParallel or RowwiseParallel.

…h#112541) Adds E2E tests for saving/loading distributed checkpoints. Supported so far are: - FSDP - HSDP - FSDP + TP Each method is also tested using `torch.compile` To run all tests: `python test/distributed/checkpoint/test/distributed/checkpoint/e2e/test_e2e_save_and_load.py` Pull Request resolved: pytorch#112541 Approved by: https://github.com/fegin, https://github.com/wz337

Addresses the following comment - #112541 (comment) Changes the comparison of models in the checkpointing E2E test to compare a non-parallelized model against distribued model after training, saving, & loading. Pull Request resolved: #113181 Approved by: https://github.com/fegin

Addresses the following comment - #112541 (comment) Changes the comparison of models in the checkpointing E2E test to compare a non-parallelized model against distribued model after training, saving, & loading. Pull Request resolved: #113181 Approved by: https://github.com/fegin, https://github.com/huydhn, https://github.com/wz337

adds rough draft for fsdp e2e test in distributed checkpointing

4ad8964

pytorch-bot bot added the topic: not user facing topic category label Oct 31, 2023

LucasLLC requested review from fegin and wz337 October 31, 2023 23:37

LucasLLC self-assigned this Oct 31, 2023

LucasLLC commented Oct 31, 2023

View reviewed changes

LucasLLC added 2 commits November 1, 2023 11:26

Adds parameterized tests for compile, fsdp,hsdp,fsdp+tp

8ca9f4f

Renames fsdp test to better reflect e2e load and save

d979d34

LucasLLC changed the title ~~e2e test for FSDP in Distributed Checkpointing~~ E2E test for FSDP, HSDP, FSDP+TP in Distributed Checkpointing Nov 1, 2023

LucasLLC marked this pull request as ready for review November 1, 2023 20:22

LucasLLC requested review from H-Huang, awgu, fduwjj, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners November 1, 2023 20:22

removes comments

60b8181

wz337 added ciflow/trunk Trigger trunk jobs on your pull request release notes: distributed (checkpoint) ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Nov 2, 2023

wz337 reviewed Nov 2, 2023

View reviewed changes

test/distributed/checkpoint/e2e/test_e2e_save_and_load.py Outdated Show resolved Hide resolved

wz337 reviewed Nov 2, 2023

View reviewed changes

H-Huang reviewed Nov 2, 2023

View reviewed changes

fegin reviewed Nov 2, 2023

View reviewed changes

fegin approved these changes Nov 2, 2023

View reviewed changes

Merge branch 'main' into distributed_checkpointing_e2e_tests

20693b0

Removes unused import

9b1e435

wz337 approved these changes Nov 2, 2023

View reviewed changes

pytorchmergebot added the merging label Nov 2, 2023

pytorchmergebot removed the merging label Nov 2, 2023

fixes linting errors

044fb94

pytorchmergebot added the merging label Nov 2, 2023

wanchaol reviewed Nov 3, 2023

View reviewed changes

pytorchmergebot added Merged and removed merging labels Nov 3, 2023

pytorchmergebot closed this in 62c88ba Nov 3, 2023

LucasLLC mentioned this pull request Nov 7, 2023

Improves comparison of state dicts for Checkpoint E2E Tests #113181

Closed

github-actions bot deleted the distributed_checkpointing_e2e_tests branch May 12, 2025 02:17


		return model, optim

		def _equal_state_dict(self, model_0, model_1):

		return torch.rand(8, 8, device="cuda")


		class ModelType(Enum):

E2E test for FSDP, HSDP, FSDP+TP in Distributed Checkpointing #112541

E2E test for FSDP, HSDP, FSDP+TP in Distributed Checkpointing #112541

Uh oh!

Conversation

LucasLLC commented Oct 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112541

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wz337 Nov 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wz337 commented Nov 2, 2023

Uh oh!

wz337 Nov 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wz337 left a comment

Choose a reason for hiding this comment

Uh oh!

LucasLLC commented Nov 2, 2023

Uh oh!

pytorchmergebot commented Nov 2, 2023

Merge started

Uh oh!

pytorchmergebot commented Nov 2, 2023

Merge failed

Uh oh!

fegin commented Nov 2, 2023

Uh oh!

LucasLLC commented Nov 2, 2023

Uh oh!

pytorchmergebot commented Nov 2, 2023

Merge started

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wz337 Nov 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

LucasLLC commented Oct 31, 2023 •

edited

Loading

pytorch-bot bot commented Oct 31, 2023 •

edited

Loading

wz337 Nov 1, 2023 •

edited

Loading

wz337 Nov 2, 2023 •

edited

Loading

fegin left a comment •

edited

Loading

wz337 Nov 3, 2023 •

edited

Loading