-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Change default runner's AMI to Amazon 2023 AMI - Part 1 #133641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133641
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 0b8dce9 with merge base d3b458e ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Left a comment on test-infra PR 5558, lets not go forward with this change. There is a cleaner and better way to accomplish this change. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Upgrades the LF scale configs to change the default AMI in accordance with the Amazon 2023 rollout plan. This PR will be merged on Monday Aug 19 in the morning, and over the next 2-3 days as new linux runners are spun up (and old ones spun down) they'll start using this new AMI This PR will be paired with pytorch/pytorch#133641, which will be merged first Note: I had to remove a check that validated that the variant's AMI is different from the base runner type's AMI because we want them to be the same during this rollout **Rollout steps:** 1. Merge PR1 & PR2 early Monday morning, before folks start using the CI heavily 2. (To roll out faster) Manually trigger a rollout in gha-labs-infra and ci-infra. This will accelerate the rollout by causing the runner instances to be cycled faster than otherwise. Most of the fleet will have the new runners quickly within the day, but we can still expect it to take up to 3 days for some stragglers to be cycled. - In pytorch-gha-infra, run the action [Runners Do Terraform Release (apply)](https://github.com/pytorch-labs/pytorch-gha-infra/actions/workflows/runners-on-dispatch-release.yml). - In ci-infra, run the action [Terraform Apply / Runners / Production - ALI](https://github.com/pytorch/ci-infra/actions/workflows/ali-deploy-prod.yml) **Revert steps (if needed):** 1. Revert PR1 & PR 2 4. (To revert faster) Manually trigger a rollout in gha-labs-infra and ci-infra. This will accelerate the rollback by marking all current runners as stale and will cause the instances to be cycled faster than otherwise. Most of the fleet will have the new runners quickly within the day, but we can still expect it to take up to 3 days for some stragglers to be cycled. - In pytorch-gha-infra, run the action [Runners Do Terraform Release (apply)](https://github.com/pytorch-labs/pytorch-gha-infra/actions/workflows/runners-on-dispatch-release.yml). - In ci-infra, run the action [Terraform Apply / Runners / Production - ALI](https://github.com/pytorch/ci-infra/actions/workflows/ali-deploy-prod.yml) 5. (If we need to revert even faster in the case of a major outage) To go the EC2 instances in AWS and manually terminate the relevant instances. Note that this will cancel any jobs running on them. The scale up script will then provision instances with the new configs as fresh jobs are requested. GHA Infra Runbook, in case something goes wrong when attempting the acceleration rollbacks: https://fburl.com/gdoc/jsvpqrav
Upgrades the LF scale configs to change the default AMI in accordance with the Amazon 2023 rollout plan. This PR will be merged on Monday Aug 19 in the morning, and over the next 2-3 days as new linux runners are spun up (and old ones spun down) they'll start using this new AMI This PR will be paired with pytorch/test-infra#5558, which will be merged after this one Pull Request resolved: #133641 Approved by: https://github.com/jeanschmidt
Upgrades the LF scale configs to change the default AMI in accordance with the Amazon 2023 rollout plan.
This PR will be merged on Monday Aug 19 in the morning, and over the next 2-3 days as new linux runners are spun up (and old ones spun down) they'll start using this new AMI
This PR will be paired with pytorch/test-infra#5558, which will be merged after this one