Skip to content

Commit

Permalink
add base checkpointer class (#626)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #626

# Context
We want to add another checkpointer using [DCP](https://pytorch.org/docs/stable/distributed.checkpoint.html). However, we don't want to duplicate the logic that already exists in TorchSnapshotSaver related to checkpoint frequency, keeping k latest checkpoints, etc

# This Diff
* Adds abstract `BaseCheckpointer` class to implements common logic like syncing dirpath's across all ranks, implementing all hooks where checkpoint may occur, etc.

* Any class subclassing must implement `_checkpoint_impl` and `restore` functions. The `restore_from_latest` method will call the user defined `restore`.

* copies all applicable tests from `TorchSnapshotSaver` into `BaseCheckpointer`'s test (will remove relevant `TorchSnapshotSaver` tests in next diff)

Reviewed By: galrotem

Differential Revision: D51328340

fbshipit-source-id: b7bc65c294fabf5d3671735dc1afcf54c1c59a1b
  • Loading branch information
JKSenthil authored and facebook-github-bot committed Dec 5, 2023
1 parent de7bbdb commit e11cfb2
Show file tree
Hide file tree
Showing 2 changed files with 763 additions and 0 deletions.

0 comments on commit e11cfb2

Please sign in to comment.