-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Description
🐛 Describe the bug
With the recent change to share runners across PyTorch org in the context of project Nova, we are seeing a new group of infra flakiness issues in Linux CI in which jobs fail if they are run on runners with unexpected changes from domain libraries. The underlying problem is that, unlike PyTorch, some domain libraries don't use Docker for containerization. This invalidates our assumption about the runner immutability.
Here is a concrete example:
- torchrec uses linux.4xlarge.nvidia.gpu runner and has its own way to install NVIDIA driver and CUDA (11.3) from RHEL yum repo
- By chance, if the same runner picks up a PyTorch CUDA test job, i.e. https://github.com/pytorch/pytorch/actions/runs/3099236524/jobs/5018444060, the job would fail due to a conflict NVIDIA driver installation (our NVIDIA installation on S3 script vs RHEL repo)
We can fix such issues whenever they appear. But it's hard to know where it comes from. Some domain libraries, i.e. torchrec, haven't been integrated with PyTorch bot yet, so there is no information about them on rockset. It's arguably better if we could figure out a way to address the root cause here. Here are some thoughts:
- Make the Linux runner immutable or mark one used by non-PyTorch repo as tainted somehow so that it can be recreated
- Nudge everyone to start using Docker in their CI
Versions
Linux CI
cc @seemethere @malfet @ZainRizvi @pytorch/pytorch-dev-infra
Metadata
Metadata
Assignees
Labels
Type
Projects
Status