CI flakiness when using shared Linux runners across PyTorch org

### 🐛 Describe the bug

With the recent change to share runners across PyTorch org in the context of project Nova, we are seeing a new group of infra flakiness issues in Linux CI in which jobs fail if they are run on runners with unexpected changes from domain libraries.  The underlying problem is that, unlike PyTorch, some domain libraries don't use Docker for containerization.  This invalidates our assumption about the runner immutability.

Here is a concrete example:

1. [torchrec](https://github.com/pytorch/torchrec/blob/main/.github/workflows/unittest_ci.yml#L62-L71) uses linux.4xlarge.nvidia.gpu runner and has its own way to install NVIDIA driver and CUDA (11.3) from RHEL yum repo
2. By chance, if the same runner picks up a PyTorch CUDA test job, i.e. https://github.com/pytorch/pytorch/actions/runs/3099236524/jobs/5018444060, the job would fail due to a conflict NVIDIA driver installation (our NVIDIA installation on S3 script vs RHEL repo)

We can fix such issues whenever they appear. But it's hard to know where it comes from. Some domain libraries, i.e. torchrec, haven't been integrated with PyTorch bot yet, so there is no information about them on rockset.  It's arguably better if we could figure out a way to address the root cause here.  Here are some thoughts:

* Make the Linux runner immutable or mark one used by non-PyTorch repo as tainted somehow so that it can be recreated
* Nudge everyone to start using Docker in their CI

### Versions

Linux CI

cc @seemethere @malfet @ZainRizvi @pytorch/pytorch-dev-infra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI flakiness when using shared Linux runners across PyTorch org #85778

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CI flakiness when using shared Linux runners across PyTorch org #85778

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions