Skip to content

Conversation

jeanschmidt
Copy link
Contributor

@jeanschmidt jeanschmidt commented May 1, 2023

This is reopening of the PR 100091

About this PR

Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.

Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.

As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.

Instances definition can be found here: pytorch/test-infra#4072

This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the windows.g5.4xlarge.nvidia.gpu in order to help reduce queue times for those instances. The phased approach follows:

  • migrate windows.4xlarge to windows.4xlarge.nonephemeral instances under pytorch/pytorch
  • migrate windows.8xlarge.nvidia.gpu to windows.8xlarge.nvidia.gpu.nonephemeral instances under pytorch/pytorch
  • submit PRs to all repositories under pytorch/ organization to migrate windows.4xlarge to windows.4xlarge.nonephemeral
  • submit PRs to all repositories under pytorch/ organization to migrate windows.8xlarge.nvidia.gpu to windows.8xlarge.nvidia.gpu.nonephemeral
  • terminate the existence of windows.4xlarge and windows.8xlarge.nvidia.gpu
  • evaluate and start the work related to the adoption of windows.g5.4xlarge.nvidia.gpu to replace windows.8xlarge.nvidia.gpu.nonephemeral in other repositories and use cases (proposed by @huydhn)

The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.

Copilot Summary

🤖 Generated by Copilot at 579d87a

This pull request migrates some windows workflows to use nonephemeral runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: .github/templates/windows_binary_build_workflow.yml.j2, .github/workflows/generated-windows-binary-*.yml, .github/workflows/pull.yml, .github/actionlint.yaml, .github/workflows/_win-build.yml, .github/workflows/periodic.yml, and .github/workflows/trunk.yml.

Copilot Poem

🤖 Generated by Copilot at 579d87a

We're breaking free from the ephemeral chains
We're running on the nonephemeral lanes
We're building faster, testing stronger, supporting newer
We're the non-ephemeral runners of fire

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

issue: #101209

@jeanschmidt jeanschmidt added ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels May 1, 2023
@jeanschmidt jeanschmidt requested a review from a team as a code owner May 1, 2023 14:18
@pytorch-bot
Copy link

pytorch-bot bot commented May 1, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100377

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 14 New Failures, 3 Unrelated Failures

As of commit c944d59:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base 5fbb406:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jeanschmidt jeanschmidt requested review from atalman, huydhn and malfet May 1, 2023 14:21
@jeanschmidt jeanschmidt added the ciflow/nightly Trigger all jobs we run nightly (nightly.yml) label May 1, 2023
@jeanschmidt
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to

Aborting rebase because rebasing the branch resulted in the same sha as the target branch.
This usually happens because the PR has already been merged.  Please rebase locally and push.

Raised by https://github.com/pytorch/pytorch/actions/runs/4851954751

jeanschmidt and others added 15 commits May 1, 2023 16:51
Differential Revision: D45416983nnPull Request resolved: #100388
…100275)

Summary: to reduce the peak memory consumption

Pull Request resolved: #100275
Approved by: https://github.com/jansel
This PR splits OutputGraph into two classes:
- SubgraphTracer (handles FX-tracing)
- OutputGraph (handles Dynamo-specific output graph logic, like
tracking graph inputs, compiling the graph, and executing it).

The motivation behind this is in the next PR up in the stack.
TL;DR is: in order to do higher-order operators, we need nested
SubgraphTracer, one for each level of nesting of the higher-order
operators.

I'm happy to flatten the stack into a single PR, but this separate made
it easier for me to test. Lmk if you want the stack flattened.

Test Plan:
- existing tests

Pull Request resolved: #99987
Approved by: https://github.com/anijain2305, https://github.com/voznesenskym
This PR introduces a `wrap(body_fn, *args)` higher order operator
The semantics of `wrap(body_fn, *args)` is to just run `body_fn(*args)`

Underneath Dynamo, this PR makes it so that we rewrite calls to
`wrap(body_fn, *args)` with `wrap(new_fn, *new_args)` where `new_fn` has
no free variables. This PR does not update cond/map to use the new
mechanism yet (we do not support nn.Modues yet, will come in the future).

The design we take is:
- OutputGraph represents the graph being built by Dynamo that may be
compiled and executed.
- OutputGraph owns a root SubgraphTracer, where it builds the FX graph.
- OutputGraph may own multiple nested SubgraphTracers.
- When we need to trace the body function of a HigherOrderOperator, we
construct a new SubgraphTracer to build the graph of the body function.

Mechanically, when Dynamo sees a new `wrap` HigherOrderOperator with a
body function, it:
- Creates a new SubgraphTracer via OutputGraph.new_subtracer
- Executes the body function
This captures the body function into the graph on the new
SubgraphTracer while modifying the state of the OutputGraph. For
example, the OutputGraph may receive new GraphArgs, new guards, and new
side effects.

If capture of the body function fails, then Dynamo graph breaks on the
HigherOrderOperator.

Test Plan:
- added test/dynamo/test_higher_order_ops.py

Future:
- We're not actually able to tell Dynamo to completely graph break on the
HigherOrderOperator. Instead, when we do graph break, Dynamo begins
introspecting `HigherOrderOperator.__call__`. It should probably not do
this.
- Ideally we would error out on new SideEffects. I don't know how to do
this yet.
- We don't support dealing with nn.Modules yet (e.g. calling nn.Modules
or accessing attributes of tracked nn.Modules from a body_fn). There's
an open question on what should actually happen here
- Ideally we would rewrite map/cond to use the new mechanism but we need
to fix the previous bullet point before we can get there.

Pull Request resolved: #99988
Approved by: https://github.com/voznesenskym, https://github.com/anijain2305
This reverts commit ea5f6d7.

Reverted #100151 on behalf of https://github.com/atalman due to breaking internal builds ([comment](#100151 (comment)))
Follow up after #100436 to disable download.pytorch.org access over ipv6 access problems.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 55c9443</samp>

This pull request improves the network configuration of the test-pytorch-binary GitHub action and workflow by mounting the host's `/etc/hosts` file into the container. This enables the container to resolve hostname aliases consistently with the host machine.

Pull Request resolved: #100475
Approved by: https://github.com/huydhn
Now that expandable_segments has been merged from OSS, we can enable it in the internal build. It still defaults to off, so this should not change any behavior changes in the allocator unless the flag is explicitly set.

Differential Revision: D45249535

Pull request resolved: #100184
@github-actions github-actions bot added module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor labels May 2, 2023
@jeanschmidt
Copy link
Contributor Author

#100478

exactly what I was trying to avoid by merging as quick as possible

I am opting to merge without waiting for signals, they take a full day and I've gone for a week trying to get all signals green before merging. As the windows runners and those files have a very fast iteration rate, it is very difficult to get all green signals and not have a merge conflict.

@jeanschmidt
Copy link
Contributor Author

@pytorchbot merge -f "Getting green signals is very difficult due to flakiness on periodic tests, and the time it takes to run all CIs (usually a day of work). As I proven this change works for the time being, I am risking not properly have done the correct merge conflict resolution as this could be very difficult to get green signals and land with constant windows changes."

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jeanschmidt jeanschmidt deleted the jeanschmidt/windows4x_nonephemeral branch May 2, 2023 20:44
@malfet
Copy link
Contributor

malfet commented May 2, 2023

@pytorchbot revert -m "This is not the PR I've reviewed" -c weird

@janeyx99
Copy link
Contributor

janeyx99 commented May 2, 2023

@pytorchbot revert -c ignoredsignal -m "breaks many things"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@jeanschmidt your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request May 2, 2023
…l instances (#100377)"

This reverts commit 7caac54.

Reverted #100377 on behalf of https://github.com/malfet due to This is not the PR I've reviewed ([comment](#100377 (comment)))
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@jeanschmidt your PR has been successfully reverted.

@jeanschmidt jeanschmidt restored the jeanschmidt/windows4x_nonephemeral branch May 3, 2023 09:26
malfet pushed a commit that referenced this pull request May 3, 2023
…ces (#100377)

This is reopening of the PR [100091](#100091)

# About this PR

Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.

Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.

As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.

Instances definition can be found here: pytorch/test-infra#4072

This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows:

* migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch`
* migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral`
* terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu`
* evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn)

The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.

# Copilot Summary

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`.

# Copilot Poem

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

> _We're breaking free from the ephemeral chains_
> _We're running on the nonephemeral lanes_
> _We're building faster, testing stronger, supporting newer_
> _We're the non-ephemeral runners of fire_

Pull Request resolved: #100377
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman

(cherry picked from commit 7caac54)
pytorchmergebot pushed a commit that referenced this pull request May 3, 2023
)

This is reopening of the PR #100377

# About this PR

Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.

Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.

As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.

Instances definition can be found here: pytorch/test-infra#4072

This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows:

* migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch`
* migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral`
* terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu`
* evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn)

The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.

# Copilot Summary

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`.

# Copilot Poem

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

> _We're breaking free from the ephemeral chains_
> _We're running on the nonephemeral lanes_
> _We're building faster, testing stronger, supporting newer_
> _We're the non-ephemeral runners of fire_

Pull Request resolved: #100377
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman

(cherry picked from commit 7caac54)

Fixes #ISSUE_NUMBER

Pull Request resolved: #100548
Approved by: https://github.com/jeanschmidt, https://github.com/janeyx99
pytorchmergebot pushed a commit that referenced this pull request May 3, 2023
Caused by #100377, something removes VS2019 installation on the non-ephemeral runner.  I think moving this to unstable is nicer to gather signals in trunk without completely disable the job or revert #100377 (for the Nth times)

Pull Request resolved: #100581
Approved by: https://github.com/clee2000, https://github.com/malfet
@github-actions github-actions bot deleted the jeanschmidt/windows4x_nonephemeral branch November 3, 2024 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/inductor ciflow/nightly Trigger all jobs we run nightly (nightly.yml) ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.