Skip to content

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Oct 11, 2025

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 11, 2025
tianyu-l pushed a commit that referenced this pull request Oct 12, 2025
Summary:
Allows disabling the storage of checkpoints related to torchft.

Users don't really have to rely on any external storage. So it reduces
set up time to get things up and running. Since we also don't really
need model checkpoints when we have torchft. And if checkpoint storage
has issues, this can work as a killswitch to completely disable the
storage so it doesn't impact training.

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1810).
* #1856
* #1811
* __->__ #1810

Co-authored-by: Tushar Jain <tushar00jain@users.noreply.github.com>
Summary:
record the profile trace if the training process receives SIGABRT e.g. when Process Group watchdog aborts the process
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
Summary:
Allows disabling the storage of checkpoints related to torchft.

Users don't really have to rely on any external storage. So it reduces
set up time to get things up and running. Since we also don't really
need model checkpoints when we have torchft. And if checkpoint storage
has issues, this can work as a killswitch to completely disable the
storage so it doesn't impact training.

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1810).
* pytorch#1856
* pytorch#1811
* __->__ pytorch#1810

Co-authored-by: Tushar Jain <tushar00jain@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant