-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances #100377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@pytorchbot rebase |
@pytorchbot successfully started a rebase job. Check the current status here |
Rebase failed due to
Raised by https://github.com/pytorch/pytorch/actions/runs/4851954751 |
…om/pytorch/pytorch into jeanschmidt/windows4x_nonephemeral
…om/pytorch/pytorch into jeanschmidt/windows4x_nonephemeral
…hmidt/windows4x_nonephemeral
…sorBase to ShardedTensor (#100197) Differential Revision: [D45369211](https://our.internmc.facebook.com/intern/diff/D45369211) Pull Request resolved: #100197 Approved by: https://github.com/fduwjj
Differential Revision: D45416983nnPull Request resolved: #100388
…100275) Summary: to reduce the peak memory consumption Pull Request resolved: #100275 Approved by: https://github.com/jansel
…#99441)" This reverts commit 97afbcb. Reverted #99441 on behalf of https://github.com/ngimel due to breaks ROCM ([comment](#99441 (comment)))
This PR splits OutputGraph into two classes: - SubgraphTracer (handles FX-tracing) - OutputGraph (handles Dynamo-specific output graph logic, like tracking graph inputs, compiling the graph, and executing it). The motivation behind this is in the next PR up in the stack. TL;DR is: in order to do higher-order operators, we need nested SubgraphTracer, one for each level of nesting of the higher-order operators. I'm happy to flatten the stack into a single PR, but this separate made it easier for me to test. Lmk if you want the stack flattened. Test Plan: - existing tests Pull Request resolved: #99987 Approved by: https://github.com/anijain2305, https://github.com/voznesenskym
This PR introduces a `wrap(body_fn, *args)` higher order operator The semantics of `wrap(body_fn, *args)` is to just run `body_fn(*args)` Underneath Dynamo, this PR makes it so that we rewrite calls to `wrap(body_fn, *args)` with `wrap(new_fn, *new_args)` where `new_fn` has no free variables. This PR does not update cond/map to use the new mechanism yet (we do not support nn.Modues yet, will come in the future). The design we take is: - OutputGraph represents the graph being built by Dynamo that may be compiled and executed. - OutputGraph owns a root SubgraphTracer, where it builds the FX graph. - OutputGraph may own multiple nested SubgraphTracers. - When we need to trace the body function of a HigherOrderOperator, we construct a new SubgraphTracer to build the graph of the body function. Mechanically, when Dynamo sees a new `wrap` HigherOrderOperator with a body function, it: - Creates a new SubgraphTracer via OutputGraph.new_subtracer - Executes the body function This captures the body function into the graph on the new SubgraphTracer while modifying the state of the OutputGraph. For example, the OutputGraph may receive new GraphArgs, new guards, and new side effects. If capture of the body function fails, then Dynamo graph breaks on the HigherOrderOperator. Test Plan: - added test/dynamo/test_higher_order_ops.py Future: - We're not actually able to tell Dynamo to completely graph break on the HigherOrderOperator. Instead, when we do graph break, Dynamo begins introspecting `HigherOrderOperator.__call__`. It should probably not do this. - Ideally we would error out on new SideEffects. I don't know how to do this yet. - We don't support dealing with nn.Modules yet (e.g. calling nn.Modules or accessing attributes of tracked nn.Modules from a body_fn). There's an open question on what should actually happen here - Ideally we would rewrite map/cond to use the new mechanism but we need to fix the previous bullet point before we can get there. Pull Request resolved: #99988 Approved by: https://github.com/voznesenskym, https://github.com/anijain2305
This reverts commit ea5f6d7. Reverted #100151 on behalf of https://github.com/atalman due to breaking internal builds ([comment](#100151 (comment)))
Follow up after #100436 to disable download.pytorch.org access over ipv6 access problems. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 55c9443</samp> This pull request improves the network configuration of the test-pytorch-binary GitHub action and workflow by mounting the host's `/etc/hosts` file into the container. This enables the container to resolve hostname aliases consistently with the host machine. Pull Request resolved: #100475 Approved by: https://github.com/huydhn
Now that expandable_segments has been merged from OSS, we can enable it in the internal build. It still defaults to off, so this should not change any behavior changes in the allocator unless the flag is explicitly set. Differential Revision: D45249535 Pull request resolved: #100184
exactly what I was trying to avoid by merging as quick as possible I am opting to merge without waiting for signals, they take a full day and I've gone for a week trying to get all signals green before merging. As the windows runners and those files have a very fast iteration rate, it is very difficult to get all green signals and not have a merge conflict. |
@pytorchbot merge -f "Getting green signals is very difficult due to flakiness on periodic tests, and the time it takes to run all CIs (usually a day of work). As I proven this change works for the time being, I am risking not properly have done the correct merge conflict resolution as this could be very difficult to get green signals and land with constant windows changes." |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert -m "This is not the PR I've reviewed" -c weird |
@pytorchbot revert -c ignoredsignal -m "breaks many things" |
@pytorchbot successfully started a revert job. Check the current status here. |
@jeanschmidt your PR has been successfully reverted. |
…l instances (#100377)" This reverts commit 7caac54. Reverted #100377 on behalf of https://github.com/malfet due to This is not the PR I've reviewed ([comment](#100377 (comment)))
@pytorchbot successfully started a revert job. Check the current status here. |
@jeanschmidt your PR has been successfully reverted. |
…ces (#100377) This is reopening of the PR [100091](#100091) # About this PR Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral. Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances. As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower. Instances definition can be found here: pytorch/test-infra#4072 This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows: * migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch` * migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` * terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu` * evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn) The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs. # Copilot Summary <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`. # Copilot Poem <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> > _We're breaking free from the ephemeral chains_ > _We're running on the nonephemeral lanes_ > _We're building faster, testing stronger, supporting newer_ > _We're the non-ephemeral runners of fire_ Pull Request resolved: #100377 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman (cherry picked from commit 7caac54)
) This is reopening of the PR #100377 # About this PR Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral. Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances. As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower. Instances definition can be found here: pytorch/test-infra#4072 This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows: * migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch` * migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` * terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu` * evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn) The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs. # Copilot Summary <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`. # Copilot Poem <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> > _We're breaking free from the ephemeral chains_ > _We're running on the nonephemeral lanes_ > _We're building faster, testing stronger, supporting newer_ > _We're the non-ephemeral runners of fire_ Pull Request resolved: #100377 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman (cherry picked from commit 7caac54) Fixes #ISSUE_NUMBER Pull Request resolved: #100548 Approved by: https://github.com/jeanschmidt, https://github.com/janeyx99
Caused by #100377, something removes VS2019 installation on the non-ephemeral runner. I think moving this to unstable is nicer to gather signals in trunk without completely disable the job or revert #100377 (for the Nth times) Pull Request resolved: #100581 Approved by: https://github.com/clee2000, https://github.com/malfet
This is reopening of the PR 100091
About this PR
Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.
Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.
As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.
Instances definition can be found here: pytorch/test-infra#4072
This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the
windows.g5.4xlarge.nvidia.gpu
in order to help reduce queue times for those instances. The phased approach follows:windows.4xlarge
towindows.4xlarge.nonephemeral
instances underpytorch/pytorch
windows.8xlarge.nvidia.gpu
towindows.8xlarge.nvidia.gpu.nonephemeral
instances underpytorch/pytorch
pytorch/
organization to migratewindows.4xlarge
towindows.4xlarge.nonephemeral
pytorch/
organization to migratewindows.8xlarge.nvidia.gpu
towindows.8xlarge.nvidia.gpu.nonephemeral
windows.4xlarge
andwindows.8xlarge.nvidia.gpu
windows.g5.4xlarge.nvidia.gpu
to replacewindows.8xlarge.nvidia.gpu.nonephemeral
in other repositories and use cases (proposed by @huydhn)The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.
Copilot Summary
🤖 Generated by Copilot at 579d87a
This pull request migrates some windows workflows to use
nonephemeral
runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files:.github/templates/windows_binary_build_workflow.yml.j2
,.github/workflows/generated-windows-binary-*.yml
,.github/workflows/pull.yml
,.github/actionlint.yaml
,.github/workflows/_win-build.yml
,.github/workflows/periodic.yml
, and.github/workflows/trunk.yml
.Copilot Poem
🤖 Generated by Copilot at 579d87a
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire
issue: #101209