-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[pipelining] [BE] Move pipeline_order validation to schedules.py #129369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129369
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 57f195c with merge base 37e3c60 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
# Changes * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
# Changes * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
# Changes * small fix in stage error message * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
| ] | ||
| # Join the rows into a single string | ||
| formatted_table = ( | ||
| "=========== ALL_RANK_ACTIONS ===========\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i think this function could stay as a helper function maybe called by repr
and it's a little better IMO to remove the ====ALL_RANK.. part so that inside repr we can customize a bit more
repr should probably print out the class name and relevant field values, and maybe print out some shortened summary of the actions but not the whole page-long actions by default? I'd suggest looking at hohw torch.tensor repr works
PipelineSchedule
[..possibly, some args such as number of stages, etc]
Actions
[ ... ]
| _batch_p2p(ops).wait() | ||
| except Exception as e: | ||
| logger.error( | ||
| "Exception in rank %s at time step %s", self.rank, time_step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this string should include more info. not sure what amount is right but some things that might be good
- which stage?
- which action?
Maybe also phrase it more like "PipelineSchedule {schedulename?} caught the following exception when running {}th microbatch {Action} for stage {}" or something?
…les.py" # Changes * small fix in stage error message * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
|
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
…les.py" # Changes * small fix in stage error message * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
|
@pytorchbot merge -i |
…les.py" # Changes * small fix in stage error message * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 2 checks: pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu), trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "broke test/distributed/pipelining/test_schedule.py::ScheduleTest::test_non_symmetric_stage_ids_ScheduleClass0 on distributed cuda https://github.com/pytorch/pytorch/actions/runs/9766039400/job/26959115773 https://hud.pytorch.org/pytorch/pytorch/commit/ec789a3c9ddd4e550b3dea6934ce2d41deb98784. You can see the error on the PR, but Dr. CI classified it wrong" -c weird |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@H-Huang your PR has been successfully reverted. |
….py (#129369)" This reverts commit ec789a3. Reverted #129369 on behalf of https://github.com/clee2000 due to broke test/distributed/pipelining/test_schedule.py::ScheduleTest::test_non_symmetric_stage_ids_ScheduleClass0 on distributed cuda https://github.com/pytorch/pytorch/actions/runs/9766039400/job/26959115773 https://hud.pytorch.org/pytorch/pytorch/commit/ec789a3c9ddd4e550b3dea6934ce2d41deb98784. You can see the error on the PR, but Dr. CI classified it wrong ([comment](#129369 (comment)))
…les.py" # Changes * small fix in stage error message * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
|
Thanks for catching that @clee2000, yeah the error was legit. Fixed now |
|
@pytorchbot merge |
Merge failedReason: Not merging any PRs at the moment because there is a merge blocking https://github.com/pytorch/pytorch/labels/ci:%20sev issue open at: Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Changes
format_pipeline_orderand_validate_pipeline_orderout oftest_schedule.pyintoschedules.py.cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k