Skip to content

CI flakiness when using shared Linux runners across PyTorch org #85778

@huydhn

Description

@huydhn

🐛 Describe the bug

With the recent change to share runners across PyTorch org in the context of project Nova, we are seeing a new group of infra flakiness issues in Linux CI in which jobs fail if they are run on runners with unexpected changes from domain libraries. The underlying problem is that, unlike PyTorch, some domain libraries don't use Docker for containerization. This invalidates our assumption about the runner immutability.

Here is a concrete example:

  1. torchrec uses linux.4xlarge.nvidia.gpu runner and has its own way to install NVIDIA driver and CUDA (11.3) from RHEL yum repo
  2. By chance, if the same runner picks up a PyTorch CUDA test job, i.e. https://github.com/pytorch/pytorch/actions/runs/3099236524/jobs/5018444060, the job would fail due to a conflict NVIDIA driver installation (our NVIDIA installation on S3 script vs RHEL repo)

We can fix such issues whenever they appear. But it's hard to know where it comes from. Some domain libraries, i.e. torchrec, haven't been integrated with PyTorch bot yet, so there is no information about them on rockset. It's arguably better if we could figure out a way to address the root cause here. Here are some thoughts:

  • Make the Linux runner immutable or mark one used by non-PyTorch repo as tainted somehow so that it can be recreated
  • Nudge everyone to start using Docker in their CI

Versions

Linux CI

cc @seemethere @malfet @ZainRizvi @pytorch/pytorch-dev-infra

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: ciRelated to continuous integrationtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions