Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances #100377

jeanschmidt · 2023-05-01T14:18:31Z

This is reopening of the PR 100091

About this PR

Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.

Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.

As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.

Instances definition can be found here: pytorch/test-infra#4072

This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the windows.g5.4xlarge.nvidia.gpu in order to help reduce queue times for those instances. The phased approach follows:

migrate windows.4xlarge to windows.4xlarge.nonephemeral instances under pytorch/pytorch
migrate windows.8xlarge.nvidia.gpu to windows.8xlarge.nvidia.gpu.nonephemeral instances under pytorch/pytorch
submit PRs to all repositories under pytorch/ organization to migrate windows.4xlarge to windows.4xlarge.nonephemeral
submit PRs to all repositories under pytorch/ organization to migrate windows.8xlarge.nvidia.gpu to windows.8xlarge.nvidia.gpu.nonephemeral
terminate the existence of windows.4xlarge and windows.8xlarge.nvidia.gpu
evaluate and start the work related to the adoption of windows.g5.4xlarge.nvidia.gpu to replace windows.8xlarge.nvidia.gpu.nonephemeral in other repositories and use cases (proposed by @huydhn)

The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.

Copilot Summary

`🤖 Generated by Copilot at 579d87a`

This pull request migrates some windows workflows to use nonephemeral runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: .github/templates/windows_binary_build_workflow.yml.j2, .github/workflows/generated-windows-binary-*.yml, .github/workflows/pull.yml, .github/actionlint.yaml, .github/workflows/_win-build.yml, .github/workflows/periodic.yml, and .github/workflows/trunk.yml.

Copilot Poem

`🤖 Generated by Copilot at 579d87a`

We're breaking free from the ephemeral chains
We're running on the nonephemeral lanes
We're building faster, testing stronger, supporting newer
We're the non-ephemeral runners of fire

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

issue: #101209

pytorch-bot · 2023-05-01T14:18:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100377

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

download.pytorch.org flaky

❌ 14 New Failures, 3 Unrelated Failures

As of commit c944d59:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base 5fbb406:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jeanschmidt · 2023-05-01T14:25:19Z

@pytorchbot rebase

pytorchmergebot · 2023-05-01T14:27:32Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-05-01T14:27:35Z

Rebase failed due to

Aborting rebase because rebasing the branch resulted in the same sha as the target branch.
This usually happens because the PR has already been merged.  Please rebase locally and push.

Raised by https://github.com/pytorch/pytorch/actions/runs/4851954751

…om/pytorch/pytorch into jeanschmidt/windows4x_nonephemeral

…hmidt/windows4x_nonephemeral

…sorBase to ShardedTensor (#100197) Differential Revision: [D45369211](https://our.internmc.facebook.com/intern/diff/D45369211) Pull Request resolved: #100197 Approved by: https://github.com/fduwjj

Pull Request resolved: #100356 Approved by: https://github.com/ezyang

Differential Revision: D45416983nnPull Request resolved: #100388

…100275) Summary: to reduce the peak memory consumption Pull Request resolved: #100275 Approved by: https://github.com/jansel

…#99441)" This reverts commit 97afbcb. Reverted #99441 on behalf of https://github.com/ngimel due to breaks ROCM ([comment](#99441 (comment)))

This PR splits OutputGraph into two classes: - SubgraphTracer (handles FX-tracing) - OutputGraph (handles Dynamo-specific output graph logic, like tracking graph inputs, compiling the graph, and executing it). The motivation behind this is in the next PR up in the stack. TL;DR is: in order to do higher-order operators, we need nested SubgraphTracer, one for each level of nesting of the higher-order operators. I'm happy to flatten the stack into a single PR, but this separate made it easier for me to test. Lmk if you want the stack flattened. Test Plan: - existing tests Pull Request resolved: #99987 Approved by: https://github.com/anijain2305, https://github.com/voznesenskym

This PR introduces a `wrap(body_fn, *args)` higher order operator The semantics of `wrap(body_fn, *args)` is to just run `body_fn(*args)` Underneath Dynamo, this PR makes it so that we rewrite calls to `wrap(body_fn, *args)` with `wrap(new_fn, *new_args)` where `new_fn` has no free variables. This PR does not update cond/map to use the new mechanism yet (we do not support nn.Modues yet, will come in the future). The design we take is: - OutputGraph represents the graph being built by Dynamo that may be compiled and executed. - OutputGraph owns a root SubgraphTracer, where it builds the FX graph. - OutputGraph may own multiple nested SubgraphTracers. - When we need to trace the body function of a HigherOrderOperator, we construct a new SubgraphTracer to build the graph of the body function. Mechanically, when Dynamo sees a new `wrap` HigherOrderOperator with a body function, it: - Creates a new SubgraphTracer via OutputGraph.new_subtracer - Executes the body function This captures the body function into the graph on the new SubgraphTracer while modifying the state of the OutputGraph. For example, the OutputGraph may receive new GraphArgs, new guards, and new side effects. If capture of the body function fails, then Dynamo graph breaks on the HigherOrderOperator. Test Plan: - added test/dynamo/test_higher_order_ops.py Future: - We're not actually able to tell Dynamo to completely graph break on the HigherOrderOperator. Instead, when we do graph break, Dynamo begins introspecting `HigherOrderOperator.__call__`. It should probably not do this. - Ideally we would error out on new SideEffects. I don't know how to do this yet. - We don't support dealing with nn.Modules yet (e.g. calling nn.Modules or accessing attributes of tracked nn.Modules from a body_fn). There's an open question on what should actually happen here - Ideally we would rewrite map/cond to use the new mechanism but we need to fix the previous bullet point before we can get there. Pull Request resolved: #99988 Approved by: https://github.com/voznesenskym, https://github.com/anijain2305

This reverts commit ea5f6d7. Reverted #100151 on behalf of https://github.com/atalman due to breaking internal builds ([comment](#100151 (comment)))

Follow up after #100436 to disable download.pytorch.org access over ipv6 access problems.  ### <samp>🤖 Generated by Copilot at 55c9443</samp> This pull request improves the network configuration of the test-pytorch-binary GitHub action and workflow by mounting the host's `/etc/hosts` file into the container. This enables the container to resolve hostname aliases consistently with the host machine. Pull Request resolved: #100475 Approved by: https://github.com/huydhn

Now that expandable_segments has been merged from OSS, we can enable it in the internal build. It still defaults to off, so this should not change any behavior changes in the allocator unless the flag is explicitly set. Differential Revision: D45249535 Pull request resolved: #100184

jeanschmidt · 2023-05-02T20:33:34Z

#100478

exactly what I was trying to avoid by merging as quick as possible

I am opting to merge without waiting for signals, they take a full day and I've gone for a week trying to get all signals green before merging. As the windows runners and those files have a very fast iteration rate, it is very difficult to get all green signals and not have a merge conflict.

jeanschmidt · 2023-05-02T20:38:03Z

@pytorchbot merge -f "Getting green signals is very difficult due to flakiness on periodic tests, and the time it takes to run all CIs (usually a day of work). As I proven this change works for the time being, I am risking not properly have done the correct merge conflict resolution as this could be very difficult to get green signals and land with constant windows changes."

pytorchmergebot · 2023-05-02T20:41:02Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2023-05-02T21:03:41Z

@pytorchbot revert -m "This is not the PR I've reviewed" -c weird

janeyx99 · 2023-05-02T21:03:41Z

@pytorchbot revert -c ignoredsignal -m "breaks many things"

pytorchmergebot · 2023-05-02T21:05:45Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-05-02T21:05:59Z

@jeanschmidt your PR has been successfully reverted.

…l instances (#100377)" This reverts commit 7caac54. Reverted #100377 on behalf of https://github.com/malfet due to This is not the PR I've reviewed ([comment](#100377 (comment)))

pytorchmergebot · 2023-05-02T21:07:52Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-05-02T21:08:08Z

@jeanschmidt your PR has been successfully reverted.

@huydhn

…ces (#100377) This is reopening of the PR [100091](#100091) # About this PR Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral. Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances. As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower. Instances definition can be found here: pytorch/test-infra#4072 This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows: * migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch` * migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` * terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu` * evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn) The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs. # Copilot Summary  ### <samp>🤖 Generated by Copilot at 579d87a</samp> This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`. # Copilot Poem  ### <samp>🤖 Generated by Copilot at 579d87a</samp> > _We're breaking free from the ephemeral chains_ > _We're running on the nonephemeral lanes_ > _We're building faster, testing stronger, supporting newer_ > _We're the non-ephemeral runners of fire_ Pull Request resolved: #100377 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman (cherry picked from commit 7caac54)

@huydhn

) This is reopening of the PR #100377 # About this PR Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral. Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances. As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower. Instances definition can be found here: pytorch/test-infra#4072 This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows: * migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch` * migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` * terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu` * evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn) The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs. # Copilot Summary  ### <samp>🤖 Generated by Copilot at 579d87a</samp> This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`. # Copilot Poem  ### <samp>🤖 Generated by Copilot at 579d87a</samp> > _We're breaking free from the ephemeral chains_ > _We're running on the nonephemeral lanes_ > _We're building faster, testing stronger, supporting newer_ > _We're the non-ephemeral runners of fire_ Pull Request resolved: #100377 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman (cherry picked from commit 7caac54) Fixes #ISSUE_NUMBER Pull Request resolved: #100548 Approved by: https://github.com/jeanschmidt, https://github.com/janeyx99

Caused by #100377, something removes VS2019 installation on the non-ephemeral runner. I think moving this to unstable is nicer to gather signals in trunk without completely disable the job or revert #100377 (for the Nth times) Pull Request resolved: #100581 Approved by: https://github.com/clee2000, https://github.com/malfet

Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances

579d87a

jeanschmidt added ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels May 1, 2023

jeanschmidt requested a review from a team as a code owner May 1, 2023 14:18

jeanschmidt mentioned this pull request May 1, 2023

Migrate jobs from windows.4xlarge to windows.4xlarge.nonephemeral instances #100091

Merged

jeanschmidt requested review from atalman, huydhn and malfet May 1, 2023 14:21

jeanschmidt added the ciflow/nightly Trigger all jobs we run nightly (nightly.yml) label May 1, 2023

jeanschmidt and others added 15 commits May 1, 2023 16:51

Merge branch 'jeanschmidt/windows4x_nonephemeral' of https://github.c…

f94c332

…om/pytorch/pytorch into jeanschmidt/windows4x_nonephemeral

Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances

815824d

Merge branch 'jeanschmidt/windows4x_nonephemeral' of https://github.c…

708273b

…om/pytorch/pytorch into jeanschmidt/windows4x_nonephemeral

new windows jobs have been created

7f4044c

Merge branch 'main' of https://github.com/pytorch/pytorch into jeansc…

4ed2013

…hmidt/windows4x_nonephemeral

[distributed][sharded_tensor] Move local_shards check from ShardedTen…

4582ceb

…sorBase to ShardedTensor (#100197) Differential Revision: [D45369211](https://our.internmc.facebook.com/intern/diff/D45369211) Pull Request resolved: #100197 Approved by: https://github.com/fduwjj

[pt2] add SymInt support for tensordot and inner (#100356)

4136153

Pull Request resolved: #100356 Approved by: https://github.com/ezyang

[export] Migrate internal verifier to subclass export/verifier

af62d09

Differential Revision: D45416983nnPull Request resolved: #100388

[inductor] Remove redundant model copy when running with cpp_wrapper (#…

ec3c8ab

…100275) Summary: to reduce the peak memory consumption Pull Request resolved: #100275 Approved by: https://github.com/jansel

Revert "[pytorch] Accelerate indexing_backward_kernel with duplicates (…

1114673

…#99441)" This reverts commit 97afbcb. Reverted #99441 on behalf of https://github.com/ngimel due to breaks ROCM ([comment](#99441 (comment)))

Revert "[functorch] test for compiling functorch transforms (#100151)"

af92fc1

This reverts commit ea5f6d7. Reverted #100151 on behalf of https://github.com/atalman due to breaking internal builds ([comment](#100151 (comment)))

github-actions bot added module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor labels May 2, 2023

resolving merge conflicts not properly resolved by automatic merge

c944d59

pytorchmergebot added merging Merged and removed merging labels May 2, 2023

pytorchmergebot closed this in 7caac54 May 2, 2023

jeanschmidt deleted the jeanschmidt/windows4x_nonephemeral branch May 2, 2023 20:44

pytorchmergebot added the Reverted label May 2, 2023

jeanschmidt restored the jeanschmidt/windows4x_nonephemeral branch May 3, 2023 09:26

malfet mentioned this pull request May 3, 2023

Migrate jobs: windows.4xlarge->windows.4xlarge.nonephemeral #100548

Closed

huydhn mentioned this pull request May 3, 2023

Mark Windows CPU jobs as unstable #100581

Closed

This was referenced May 11, 2023

Migrate jobs from windows.8xlarge.nvidia.gpu to windows.8xlarge.nvidia.gpu.nonephemeral #101205

Closed

Migrate windows runners to non-ephemeral instances #101209

Closed

github-actions bot deleted the jeanschmidt/windows4x_nonephemeral branch November 3, 2024 02:09

Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances #100377

Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances #100377

Uh oh!

Conversation

jeanschmidt commented May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About this PR

Copilot Summary

🤖 Generated by Copilot at 579d87a

Copilot Poem

🤖 Generated by Copilot at 579d87a

Uh oh!

pytorch-bot bot commented May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100377

❗ 1 Active SEVs

❌ 14 New Failures, 3 Unrelated Failures

Uh oh!

jeanschmidt commented May 1, 2023

Uh oh!

pytorchmergebot commented May 1, 2023

Uh oh!

pytorchmergebot commented May 1, 2023

Uh oh!

jeanschmidt commented May 2, 2023

Uh oh!

jeanschmidt commented May 2, 2023

Uh oh!

pytorchmergebot commented May 2, 2023

Merge started

Uh oh!

malfet commented May 2, 2023

Uh oh!

janeyx99 commented May 2, 2023

Uh oh!

pytorchmergebot commented May 2, 2023

Uh oh!

pytorchmergebot commented May 2, 2023

Uh oh!

pytorchmergebot commented May 2, 2023

Uh oh!

pytorchmergebot commented May 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

jeanschmidt commented May 1, 2023 •

edited

Loading

`🤖 Generated by Copilot at 579d87a`

`🤖 Generated by Copilot at 579d87a`

pytorch-bot bot commented May 1, 2023 •

edited

Loading