[FSDP] Ensure that customized non tensor optimizer state can be saved #99214

fegin · 2023-04-15T00:48:31Z

Stack from ghstack (oldest at bottom):

-> [FSDP] Ensure that customized non tensor optimizer state can be saved #99214

The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test.

This PR will solve #99079

Differential Revision: D45021331

The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test. Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/) [ghstack-poisoned]

pytorch-bot · 2023-04-15T00:48:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99214

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fd264bd:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test. Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/) ghstack-source-id: 186204536 Pull Request resolved: #99214

awgu

Thanks for the super fast fix! This looks good to me. I just left a few nits.

awgu · 2023-04-15T01:24:57Z

test/distributed/fsdp/test_fsdp_optim_state.py

+            osd = FSDP.optim_state_dict(model, optim, optim_state_dict=original_osd)
+            osd_to_load = FSDP.optim_state_dict_to_load(model, optim, osd )
+            for param_id, state in osd_to_load["state"].items():
+                # Addd customized value


nit: typo 😄

Suggested change

# Addd customized value

# Add customized value

awgu · 2023-04-15T01:25:12Z

test/distributed/fsdp/test_fsdp_optim_state.py

+            step()
+            original_osd = deepcopy(optim.state_dict())
+            for param_id, state in original_osd["state"].items():
+                # Addd customized value


Suggested change

# Addd customized value

# Add customized value

awgu · 2023-04-15T01:26:02Z

test/distributed/fsdp/test_fsdp_optim_state.py

+                state["value2"] = None
+
+            osd = FSDP.optim_state_dict(model, optim, optim_state_dict=original_osd)
+            osd_to_load = FSDP.optim_state_dict_to_load(model, optim, osd )


nit: I wonder if lint will complain about this one

Suggested change

osd_to_load = FSDP.optim_state_dict_to_load(model, optim, osd )

osd_to_load = FSDP.optim_state_dict_to_load(model, optim, osd)

…an be saved" The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test. Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/) [ghstack-poisoned]

Pull Request resolved: #99214 The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test. ghstack-source-id: 186225950 Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/)

…an be saved" The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test. This PR will solve #99079 Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/) [ghstack-poisoned]

Pull Request resolved: #99214 The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test. ghstack-source-id: 186239769 Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/)

fegin · 2023-04-17T15:34:39Z

@pytorchbot merge

pytorchmergebot · 2023-04-17T15:37:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-04-17T15:43:05Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / build

Details for Dev Infra team

Raised by workflow job

awaelchli

Thanks 🎉 !

fegin · 2023-04-17T21:52:14Z

@pytorchbot merge

pytorchmergebot · 2023-04-17T21:54:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

fegin requested review from mrshenli, zhaojuanmao and rohan-varma as code owners April 15, 2023 00:48

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Apr 15, 2023

fegin requested review from H-Huang, awgu, kwen2501, wanchaol, kiukchung and d4l3k as code owners April 15, 2023 00:48

awgu approved these changes Apr 15, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 17, 2023

fegin mentioned this pull request Apr 17, 2023

AttributeError in FSDP.optim_state_dict() for None values in optimizer state #99079

Closed

pytorchmergebot added the merging label Apr 17, 2023

awaelchli approved these changes Apr 17, 2023

View reviewed changes

pytorchmergebot added Merged and removed merging labels Apr 17, 2023

pytorchmergebot closed this in bdaf322 Apr 17, 2023

eracah mentioned this pull request May 18, 2023

Error while saving checkpoint mosaicml/composer#2231

Closed

facebook-github-bot deleted the gh/fegin/111/head branch June 8, 2023 17:14

sachalevy mentioned this pull request Oct 13, 2023

Error on save_steps using FSDP pacman100/LLM-Workshop#6

Open

jyothisambolu mentioned this pull request Apr 22, 2024

Adding experimental FSDP support on HPU Lightning-AI/lightning-Habana#174

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Ensure that customized non tensor optimizer state can be saved #99214

[FSDP] Ensure that customized non tensor optimizer state can be saved #99214

fegin commented Apr 15, 2023 •

edited

Loading

pytorch-bot bot commented Apr 15, 2023 •

edited

Loading

awgu left a comment

awgu Apr 15, 2023

awgu Apr 15, 2023

awgu Apr 15, 2023

fegin commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

awaelchli left a comment

fegin commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

	osd_to_load = FSDP.optim_state_dict_to_load(model, optim, osd )
	osd_to_load = FSDP.optim_state_dict_to_load(model, optim, osd)

[FSDP] Ensure that customized non tensor optimizer state can be saved #99214

[FSDP] Ensure that customized non tensor optimizer state can be saved #99214

Conversation

fegin commented Apr 15, 2023 • edited Loading

pytorch-bot bot commented Apr 15, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99214

✅ No Failures

awgu left a comment

Choose a reason for hiding this comment

awgu Apr 15, 2023

Choose a reason for hiding this comment

awgu Apr 15, 2023

Choose a reason for hiding this comment

awgu Apr 15, 2023

Choose a reason for hiding this comment

fegin commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

Merge started

pytorchmergebot commented Apr 17, 2023

Merge failed

awaelchli left a comment

Choose a reason for hiding this comment

fegin commented Apr 17, 2023

pytorchmergebot commented Apr 17, 2023

Merge started

fegin commented Apr 15, 2023 •

edited

Loading

pytorch-bot bot commented Apr 15, 2023 •

edited

Loading