Skip to content

Conversation

@kumpera
Copy link
Contributor

@kumpera kumpera commented Jul 24, 2022

This PR implements the following changes.

Move to new checkpoint metadata format with split between logical and storage data.
This is a step in the direction of supporting extensible checkpointing as it moves us away from the hardcoded storage model enforced by the FileSystem storage layer.

Change CheckpointException to include exception traceback. Exception tracebacks are not serializable so we need to take care of that otherwise we provide horribly bad errors to users.

Finally, remove validate_state_dict as it lost its usefulness. Loading is becoming more and more flexible to the point that the only reasonable way to verify if it's possible to load a given configuration is to actually try it.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 24, 2022

🔗 Helpful links

✅ No Failures (3 Pending)

As of commit 570c270 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@kumpera kumpera requested a review from fduwjj July 26, 2022 19:12
@kumpera kumpera changed the title [Draft] Change metadata format and couple small cleanups [dist.checkpoint] Change metadata format and improve error reporting Jul 28, 2022
@kumpera kumpera requested a review from wz337 July 28, 2022 16:23
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, left some inlined comments about storage metadata and docstr update

Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, one minor question

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there still a need for BytesStorageMetadata or we just leave it here for completeness and future usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it as a marker type and probable future usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we need to have an entry on the metadata state_dict for a given ByteIO entry to signal that it was saved to begin with.

Part of the representational issue here, which might be addressed later is that we're saving opaque stuff. One thing I expect this to be used with is down the line once we support non-pickle payloads that have a schema - and the schema would be here.

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe call it "storage_md" or "storage_metadata"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole bit around st_md is thankfully going away as it's moved entirely into the storage layer.

@kumpera
Copy link
Contributor Author

kumpera commented Aug 3, 2022

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Merge failed due to Refusing to merge as mandatory check(s) pull failed for rule superuser
Raised by https://github.com/pytorch/pytorch/actions/runs/2789936517

@kumpera
Copy link
Contributor Author

kumpera commented Aug 3, 2022

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

Rodrigo Kumpera added 4 commits August 3, 2022 14:00
Move dist.checkpoint to the new, simplified metadata schema.

This simplifies operations by unifying Tensor and ShardedTensor representation by
having the former show up as a single shard ST.

This PR addresses one major issue of CheckpointException that it doesn't carry around
backtraces.
@pytorchmergebot
Copy link
Collaborator

Successfully rebased change_md onto refs/remotes/origin/master, please pull locally before adding more changes (for example, via git checkout change_md && git pull --rebase)

@kumpera
Copy link
Contributor Author

kumpera commented Aug 3, 2022

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@kumpera kumpera added release notes: distributed (sharded) release notes category topic: improvements topic category labels Aug 3, 2022
facebook-github-bot pushed a commit that referenced this pull request Aug 4, 2022
…82078) (#82078)

Summary:
This PR implements the following changes.

Move to new checkpoint metadata format with split between logical and storage data.
This is a step in the direction of supporting extensible checkpointing as it moves us away from the hardcoded storage model enforced by the FileSystem storage layer.

Change CheckpointException to include exception traceback. Exception tracebacks are not serializable so we need to take care of that otherwise we provide horribly bad errors to users.

Finally, remove `validate_state_dict` as it lost its usefulness. Loading is becoming more and more flexible to the point that the only reasonable way to verify if it's possible to load a given configuration is to actually try it.

Pull Request resolved: #82078
Approved by: https://github.com/wanchaol, https://github.com/fduwjj

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/f4ee37453cc8ad9e0b7eafeaabf11d22ba0c50fd

Reviewed By: kit1980

Differential Revision: D38395325

Pulled By: kumpera

fbshipit-source-id: c3f85b8b52470a07d22529317898463a9c07176d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (sharded) release notes category topic: improvements topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants