Skip to content

Migrate windows runners to non-ephemeral instances #101209

@jeanschmidt

Description

@jeanschmidt

🚀 The feature, motivation and pitch

Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.

Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.

As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.

Instances definition can be found here: pytorch/test-infra#4072

  • ✅ migrate windows.4xlarge to windows.4xlarge.nonephemeral instances under pytorch/pytorch (Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances #100377)
  • 📣 migrate windows.8xlarge.nvidia.gpu to windows.8xlarge.nvidia.gpu.nonephemeral instances under pytorch/pytorch (Migrate jobs from windows.8xlarge.nvidia.gpu to nonephemeral #104404)
  • ⏳ submit PRs to all repositories under pytorch/ organization to migrate windows.4xlarge to windows.4xlarge.nonephemeral
  • ⏳ submit PRs to all repositories under pytorch/ organization to migrate windows.8xlarge.nvidia.gpu to windows.8xlarge.nvidia.gpu.nonephemeral
  • ⏳ terminate the existence of windows.4xlarge and windows.8xlarge.nvidia.gpu
  • ⏳ evaluate and start the work related to the adoption of windows.g5.4xlarge.nvidia.gpu to replace windows.8xlarge.nvidia.gpu.nonephemeral in other repositories and use cases (proposed by @huydhn)

The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.

Alternatives

No response

Additional context

No response

cc @seemethere @malfet @pytorch/pytorch-dev-infra

Metadata

Metadata

Assignees

Labels

module: ciRelated to continuous integrationtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions