Skip to content

Conversation

dsesh
Copy link
Contributor

@dsesh dsesh commented Jul 18, 2025

Differential Revision: D78493333

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

Copy link

pytorch-bot bot commented Jul 18, 2025

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @dsesh, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) labels Jul 18, 2025
Copy link

pytorch-bot bot commented Jul 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158612

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Pending, 1 Unrelated Failure

As of commit 0c3f615 with merge base 7b72e5b (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78493333

Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Do you also want to set the pthreads name for quickstack/gdb/below?

Ex:

torch.multiprocessing._set_thread_name("pt_data_pin")

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 18, 2025
dsesh added a commit to dsesh/pytorch that referenced this pull request Jul 18, 2025
…ecutor (pytorch#158612)

Summary: Pull Request resolved: pytorch#158612

Test Plan:
Built `training_platform.worker:c44710f0f91f8b4ed58f517864a7ba8b`, and ran [`f765449601-TrainingApplication_S3BTR`](https://www.internalfb.com/mlhub/pipelines/runs/mast/f765449601-TrainingApplication_S3BTR?version=0&tab=summary&env=PRODUCTION).

[SBDive profile](https://www.internalfb.com/intern/sbdive/?id=f765449601-TrainingApplication_S3BTR-84eceeca-8b03-410b-9e39-3840ec9cf185) of the job shows  threads named `AsyncCheckpointExecutor`

All of rank0's Python threads
{F1980370763}

All `AsyncCheckpointExecutor` across ranks
 {F1980371593}

Rollback Plan:

Reviewed By: d4l3k

Differential Revision: D78493333
@dsesh dsesh force-pushed the export-D78493333 branch from 3089a5f to 8809651 Compare July 18, 2025 00:45
…ecutor (pytorch#158612)

Summary: Pull Request resolved: pytorch#158612

Test Plan:
Built `training_platform.worker:c44710f0f91f8b4ed58f517864a7ba8b`, and ran [`f765449601-TrainingApplication_S3BTR`](https://www.internalfb.com/mlhub/pipelines/runs/mast/f765449601-TrainingApplication_S3BTR?version=0&tab=summary&env=PRODUCTION).

[SBDive profile](https://www.internalfb.com/intern/sbdive/?id=f765449601-TrainingApplication_S3BTR-84eceeca-8b03-410b-9e39-3840ec9cf185) of the job shows  threads named `AsyncCheckpointExecutor`

All of rank0's Python threads
{F1980370763}

All `AsyncCheckpointExecutor` across ranks
 {F1980371593}

Rollback Plan:

Reviewed By: d4l3k

Differential Revision: D78493333
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78493333

@dsesh dsesh force-pushed the export-D78493333 branch from 8809651 to 0c3f615 Compare July 18, 2025 00:50
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2022-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)

Details for Dev Infra team Raised by workflow job

@wdvr
Copy link
Contributor

wdvr commented Jul 18, 2025

@pytorchmergebot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable), trunk / win-vs2022-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants