Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
HfHubWriter: Save checkpoints on Hugging Face Hub (#881)
This PR adds the possibility to create model checkpoints on Hugging Face Hub. For this to work, we introduce a new HfHubWriter class, which can be passed instead of a file name to Checkpoint or TrainEndCheckpoint (and hopefully other callbacks out there in the wild). The design goal of this PR was to be able to re-use existing checkpoint callbacks, instead of writing a new one. This is much more scalable, since we could use a similar design to enable storing on S3, GCS, etc. One of the difficulties here was to make this work with both pickle.dump and torch.save. The latter takes a few different turns under the hood. Therefore, I had to adjust open_file_like to be more similar to what torch does. It should, however, still work with existing code (at least the tests still pass). There are some limitations to the current design. For instance, it only supports writing for now. Therefore, a checkpoint using the new feature does not work with LoadInitState. Furthermore, the uploads are performed synchronously. This is mostly to save us some headaches. However, the obvious disadvantage is that if the upload is slow (compared to training time), we can see a considerable slowdown. Therefore, I recommend to use this feature with TrainEndCheckpoint rather than Checkpoint. Notebook There is a notebook included to test this feature on the real HF Hub. This PR also fixes a wrong URL used in Hugging_Face_Finetuning.ipynb.
- Loading branch information