[CI] Switch release_model H100 job to linux_job_v3 (OSDC/ARC)#4457
[CI] Switch release_model H100 job to linux_job_v3 (OSDC/ARC)#4457huydhn wants to merge 1 commit into
Conversation
Migrate the release_model H100 job from EC2 linux_job_v2 to the OSDC linux_job_v3 reusable workflow on the ARC runner (linux.aws.h100 -> mt-l-x86iamx-22-225-h100), and pre-install torch's pure-python deps from the in-cluster pypi-cache so the nightly cu126 install doesn't reach the cache-enforcer-blocked files.pythonhosted.org.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4457
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 89e7458 with merge base 9c010ae ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy pillow | ||
| # Clear PIP_EXTRA_INDEX_URL (the runner's default cpu /whl/cpu/) so it can't supply a | ||
| # +cpu torch; the torch-spec's --index-url makes the literal nightly cu126 index the only source. | ||
| PIP_EXTRA_INDEX_URL= pip install ${{ matrix.torch-spec }} |
There was a problem hiding this comment.
Err, is it documented anywhere?
There was a problem hiding this comment.
First I want to set some context here:
- OSDC runners are not allowed to directly access pypi or other registries until we can increase the number of public facing IP address. So, access to files.pythonhosted.org is blocked by firewall. Only access to download.pytorch.org is allowed after Allow direct access to download.pytorch.org from runners ci-infra#545 as LF is the owner there.
- The runner automatically set some local env variable like
PIP_EXTRA_INDEX_URLto point to the local cache at http://pypi-cache-cpu.pypi-cache.svc.cluster.local:8080.
These are 2 new pypi-cache compatibility issues that I found today with the setup, one issue per line. I have created 2 tracking issue for them here to follow on this next week:
-
The first issue is that we point to files.pythonhosted.org in download.pytorch.org for some common python packages. For example, https://download.pytorch.org/whl/nightly/filelock. I observed that a command like
pip install torch --index-url https://download.pytorch.org/whl/nightly/cu130, by fixing the index to download.pytorch.org, could get all packages from download.pytorch.org but not those from files.pythonhosted.org. Runningpip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy pillowis a quick way to fetch these packages separately without setting the download.pytorch.org index. It works because it correctly use the local cache without referring to download.pytorch.org. Tracking issue [pypi-cache] download.pytorch.org index links pure-python deps to files.pythonhosted.org (unreachable on OSDC) ci-infra#660 -
The second issue is that
PIP_EXTRA_INDEX_URLis set wrongly toPIP_EXTRA_INDEX_URL=http://pypi-cache-cpu.pypi-cache.svc.cluster.local:8080/whl/cpu/. With this variable set, torch CPU is somehow preferred over the correct CUDA version. Unset it is a quick way to tell pip to look into https://download.pytorch.org/whl/nightly/cu130 instead. Tracking issue [pypi-cache] Runners default PIP_EXTRA_INDEX_URL to the cpu slug on all archs → pip installs +cpu torch on GPU jobs ci-infra#661
I don't have the full context on these behavior yet and need more time to look into them. (1) is gotcha while (2) is a bug. So, these changes are here to unblock torchao CI if needed. I agree that we shouldn't need them once (1) and (2) are fixed.
| # Pre-install torch/vision's pure-python deps from the in-cluster pypi-cache for speed. | ||
| pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy pillow | ||
| # Clear PIP_EXTRA_INDEX_URL (the runner's default cpu /whl/cpu/) so it can't supply a | ||
| # +cpu torch; the torch-spec's --index-url makes the literal nightly cu126 index the only source. | ||
| PIP_EXTRA_INDEX_URL= pip install ${{ matrix.torch-spec }} |
There was a problem hiding this comment.
Why one can not leave this section as is?
What
Migrates the
release_model.ymlH100 job from EC2linux_job_v2to the OSDC (ARC)linux_job_v3reusable workflow.linux.aws.h100→mt-l-x86iamx-22-225-h100(from.github/arc.yaml).uses: …/linux_job_v3.yml@main.cache-enforceriptables-blocksfiles.pythonhosted.org, so the deps (which the nightly index references there) must come from the reachable cache;mslkis a+cuXXXwheel on the pytorch index and stays intorch-spec.Part of the torchao H100/A100 → OSDC migration. Companion PRs: #4456 (1x/4xH100). Depends on
linux_job_v3(now onpytorch/test-infra@main).