Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PT-D][Checkpoint] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint #88698

Closed
wants to merge 9 commits into from

Conversation

wz337
Copy link
Contributor

@wz337 wz337 commented Nov 8, 2022

Context in RFC: #86620

.rst file will be finalized in subsequent PRs.

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 8, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88698

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9ef96b7:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@wz337 wz337 marked this pull request as draft November 8, 2022 21:42
@wz337 wz337 changed the title [WIP] Rfc 86620 [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint #88641 Nov 8, 2022
@wz337 wz337 marked this pull request as ready for review November 11, 2022 19:00
@wz337 wz337 requested a review from wanchaol November 11, 2022 19:01
@wz337 wz337 changed the title [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint #88641 [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint Nov 12, 2022
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for working on this!

)
sys.modules['torch.distributed._shard.checkpoint'] = torch.distributed.checkpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice!

from .api import CheckpointException


from .planner import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these all of the public APIs we want to expose to the user? shall we also add a __all__ here to hide the ones we don't want to expose?

@@ -4,6 +4,9 @@
ShardMetadata,
)

__all__: List[str] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the __all__ on all python modules?

@wz337
Copy link
Contributor Author

wz337 commented Nov 16, 2022

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased rfc_86620 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout rfc_86620 && git pull --rebase)

@wz337
Copy link
Contributor Author

wz337 commented Nov 16, 2022

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 16, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / cuda11.6-py3.10-gcc7-sm86 / test (default, 2, 4, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@wz337
Copy link
Contributor Author

wz337 commented Nov 16, 2022

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased rfc_86620 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout rfc_86620 && git pull --rebase)

@wz337
Copy link
Contributor Author

wz337 commented Nov 16, 2022

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@wz337 wz337 changed the title [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint [PT-D][Checkpoint] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint Nov 16, 2022
huydhn added a commit to huydhn/pytorch that referenced this pull request Nov 17, 2022
Some distributed tests are moved to a new location after
pytorch#88698
pytorchmergebot pushed a commit that referenced this pull request Nov 17, 2022
wz337 added a commit to pytorch/PiPPy that referenced this pull request Nov 18, 2022
Update import for spmd checkpoint files, as we have moved distributed
checkpointing from torch.distributed._shard.checkpoint to
torch.distributed.checkpoint in PyTorch
(pytorch/pytorch#88698).

Test:
CI
pytorchmergebot pushed a commit that referenced this pull request Nov 18, 2022
… checkpoint (#89256)

Update test import and docstring as we have moved distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698).

Test: CI
Pull Request resolved: #89256
Approved by: https://github.com/fduwjj
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…ibuted._shard.checkpoint to torch.distributed.checkpoint (pytorch#88698)

Context in RFC: pytorch#86620

.rst file will be finalized in subsequent PRs.
Pull Request resolved: pytorch#88698
Approved by: https://github.com/wanchaol
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
… checkpoint (pytorch#89256)

Update test import and docstring as we have moved distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (pytorch#88698).

Test: CI
Pull Request resolved: pytorch#89256
Approved by: https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants