-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
🚀 The feature, motivation and pitch
Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.
Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.
As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.
Instances definition can be found here: pytorch/test-infra#4072
- ✅ migrate
windows.4xlarge
towindows.4xlarge.nonephemeral
instances underpytorch/pytorch
(Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances #100377) - 📣 migrate
windows.8xlarge.nvidia.gpu
towindows.8xlarge.nvidia.gpu.nonephemeral
instances underpytorch/pytorch
(Migrate jobs from windows.8xlarge.nvidia.gpu to nonephemeral #104404) - ⏳ submit PRs to all repositories under
pytorch/
organization to migratewindows.4xlarge
towindows.4xlarge.nonephemeral
- ⏳ submit PRs to all repositories under
pytorch/
organization to migratewindows.8xlarge.nvidia.gpu
towindows.8xlarge.nvidia.gpu.nonephemeral
- ⏳ terminate the existence of
windows.4xlarge
andwindows.8xlarge.nvidia.gpu
- ⏳ evaluate and start the work related to the adoption of
windows.g5.4xlarge.nvidia.gpu
to replacewindows.8xlarge.nvidia.gpu.nonephemeral
in other repositories and use cases (proposed by @huydhn)
The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.
Alternatives
No response
Additional context
No response
cc @seemethere @malfet @pytorch/pytorch-dev-infra
Metadata
Metadata
Assignees
Labels
Type
Projects
Status