-
Couldn't load subscription status.
- Fork 25.7k
[dist.checkpoint] Change metadata format and improve error reporting #82078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
✅ No Failures (3 Pending)As of commit 570c270 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, left some inlined comments about storage metadata and docstr update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, one minor question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there still a need for BytesStorageMetadata or we just leave it here for completeness and future usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need it as a marker type and probable future usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that we need to have an entry on the metadata state_dict for a given ByteIO entry to signal that it was saved to begin with.
Part of the representational issue here, which might be addressed later is that we're saving opaque stuff. One thing I expect this to be used with is down the line once we support non-pickle payloads that have a schema - and the schema would be here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe call it "storage_md" or "storage_metadata"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole bit around st_md is thankfully going away as it's moved entirely into the storage layer.
|
@pytorchmergebot merge |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
Merge failed due to Refusing to merge as mandatory check(s) pull failed for rule superuser |
|
@pytorchmergebot rebase |
|
@pytorchbot successfully started a rebase job. Check the current status here |
Move dist.checkpoint to the new, simplified metadata schema. This simplifies operations by unifying Tensor and ShardedTensor representation by having the former show up as a single shard ST. This PR addresses one major issue of CheckpointException that it doesn't carry around backtraces.
|
Successfully rebased |
3596139 to
570c270
Compare
|
@pytorchmergebot merge |
|
@pytorchbot successfully started a merge job. Check the current status here |
…82078) (#82078) Summary: This PR implements the following changes. Move to new checkpoint metadata format with split between logical and storage data. This is a step in the direction of supporting extensible checkpointing as it moves us away from the hardcoded storage model enforced by the FileSystem storage layer. Change CheckpointException to include exception traceback. Exception tracebacks are not serializable so we need to take care of that otherwise we provide horribly bad errors to users. Finally, remove `validate_state_dict` as it lost its usefulness. Loading is becoming more and more flexible to the point that the only reasonable way to verify if it's possible to load a given configuration is to actually try it. Pull Request resolved: #82078 Approved by: https://github.com/wanchaol, https://github.com/fduwjj Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/f4ee37453cc8ad9e0b7eafeaabf11d22ba0c50fd Reviewed By: kit1980 Differential Revision: D38395325 Pulled By: kumpera fbshipit-source-id: c3f85b8b52470a07d22529317898463a9c07176d
This PR implements the following changes.
Move to new checkpoint metadata format with split between logical and storage data.
This is a step in the direction of supporting extensible checkpointing as it moves us away from the hardcoded storage model enforced by the FileSystem storage layer.
Change CheckpointException to include exception traceback. Exception tracebacks are not serializable so we need to take care of that otherwise we provide horribly bad errors to users.
Finally, remove
validate_state_dictas it lost its usefulness. Loading is becoming more and more flexible to the point that the only reasonable way to verify if it's possible to load a given configuration is to actually try it.