Skip to content

Conversation

ZainRizvi
Copy link
Contributor

@ZainRizvi ZainRizvi commented Aug 15, 2024

Upgrades the LF scale configs to change the default AMI in accordance with the Amazon 2023 rollout plan.

This PR will be merged on Monday Aug 19 in the morning, and over the next 2-3 days as new linux runners are spun up (and old ones spun down) they'll start using this new AMI

This PR will be paired with pytorch/test-infra#5558, which will be merged after this one

@ZainRizvi ZainRizvi requested a review from a team as a code owner August 15, 2024 22:54
Copy link

pytorch-bot bot commented Aug 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133641

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 0b8dce9 with merge base d3b458e (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jeanschmidt
Copy link
Contributor

Left a comment on test-infra PR 5558, lets not go forward with this change.

There is a cleaner and better way to accomplish this change.

@ZainRizvi
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

ZainRizvi added a commit to pytorch/test-infra that referenced this pull request Aug 19, 2024
Upgrades the LF scale configs to change the default AMI in accordance
with the Amazon 2023 rollout plan.

This PR will be merged on Monday Aug 19 in the morning, and over the
next 2-3 days as new linux runners are spun up (and old ones spun down)
they'll start using this new AMI

This PR will be paired with
pytorch/pytorch#133641, which will be merged
first

Note: I had to remove a check that validated that the variant's AMI is
different from the base runner type's AMI because we want them to be the
same during this rollout

**Rollout steps:**
1. Merge PR1 & PR2 early Monday morning, before folks start using the CI
heavily
2. (To roll out faster) Manually trigger a rollout in gha-labs-infra and
ci-infra. This will accelerate the rollout by causing the runner
instances to be cycled faster than otherwise. Most of the fleet will
have the new runners quickly within the day, but we can still expect it
to take up to 3 days for some stragglers to be cycled.
- In pytorch-gha-infra, run the action [Runners Do Terraform Release
(apply)](https://github.com/pytorch-labs/pytorch-gha-infra/actions/workflows/runners-on-dispatch-release.yml).
- In ci-infra, run the action [Terraform Apply / Runners / Production -
ALI](https://github.com/pytorch/ci-infra/actions/workflows/ali-deploy-prod.yml)
   
**Revert steps (if needed):**
1. Revert PR1 & PR 2
4. (To revert faster) Manually trigger a rollout in gha-labs-infra and
ci-infra. This will accelerate the rollback by marking all current
runners as stale and will cause the instances to be cycled faster than
otherwise. Most of the fleet will have the new runners quickly within
the day, but we can still expect it to take up to 3 days for some
stragglers to be cycled.
- In pytorch-gha-infra, run the action [Runners Do Terraform Release
(apply)](https://github.com/pytorch-labs/pytorch-gha-infra/actions/workflows/runners-on-dispatch-release.yml).
- In ci-infra, run the action [Terraform Apply / Runners / Production -
ALI](https://github.com/pytorch/ci-infra/actions/workflows/ali-deploy-prod.yml)
5. (If we need to revert even faster in the case of a major outage) To
go the EC2 instances in AWS and manually terminate the relevant
instances. Note that this will cancel any jobs running on them. The
scale up script will then provision instances with the new configs as
fresh jobs are requested.

GHA Infra Runbook, in case something goes wrong when attempting the
acceleration rollbacks: https://fburl.com/gdoc/jsvpqrav
pytorch-bot bot pushed a commit that referenced this pull request Sep 13, 2024
Upgrades the LF scale configs to change the default AMI in accordance with the Amazon 2023 rollout plan.

This PR will be merged on Monday Aug 19 in the morning, and over the next 2-3 days as new linux runners are spun up (and old ones spun down) they'll start using this new AMI

This PR will be paired with pytorch/test-infra#5558, which will be merged after this one
Pull Request resolved: #133641
Approved by: https://github.com/jeanschmidt
@github-actions github-actions bot deleted the zainr/ami-default-upgrade branch September 28, 2024 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants