Skip to content

Conversation

clee2000
Copy link
Contributor

@clee2000 clee2000 commented Apr 2, 2025

It is hard to test the docker images that are built for binaries because the the binary workflows are hard coded to run on an image from docker io, with the main tag. To test, you have to make a change to generate_binary_build_matrix to fetch the correct tag from aws ecr and open a separate PR to test

insert example pr here from when i tested nccl

The main idea is to make the binary docker build more similar to the CI docker builds, where if .ci/docker folder is changed, a new docker image gets built that is identified by the hash of .ci/docker folder. Then CI jobs pull this docker image identified by the folder. For the binary docker images, this includes pushing docker images to ecr in addition to docker.io.

The main change is using calculate docker image everywhere and renaming things to use the new convention with docker tag prefix separate from docker image name

Overview:

  • building: if on ciflow/binaries, upload to aws ecr. If on main/version tag, push to aws ecr and docker io
  • aws ecr images get the -foldersha tag like CI. docker io gets the original -main tag, the -headsha tag, and a -foldersha tag. foldersha is the hash of .ci/docker
  • fetching: Always fetch the foldersha tag. if is the workflow is on ciflow/binaries, fetch from aws ecr. If on main or version tag, pull from docker io

Cons

  • This doesn't work for s390x because they would need read/write access to ecr which they don't have. Could be fixed with an OIDC role? s390x docker builds are also just weird in general
  • Binary builds will spin wait for a docker image to finish building (most builds are really fast though <15min)
  • Idk how to test this on main

Notes:

  • Rebuild whenever .ci/docker changes because we need a way to differentiate between docker images built from different changes to the docker script to ensure that we can pull the specific tag for that change. CI does this by tagging every docker image with the hash of .ci/docker, which contains all the docker scripts. The binary docker images usually have a build script in .ci/docker/almalinux or manywheel, but it uses scripts from .ci/docker/common, so we need a hash that includes the common folder. We also need to make sure a docker image exists for every hash, or else the hash calculation is going to point someone to a nonexistent docker image.
    • Pro: Get to reuse calculate-docker-image
    • Con: This results in more image rebuilds than actually necessary
    • Con: The workflow file is not included in the hash
    • Cons could be resolved by using custom hash, but that seem like extra work
  • Reusable action for binary docker builds - they're all pretty much the same so I wanted to consolidate instead of putting a step for calculate docker image everywhere
    • Also converted some things to a matrix - not necessary but maybe we could put every binary docker build into 1 workflow with the matrix?
  • Changes to get rid of env vars like GPU_ARCH_TYPE - all the information necessary is already in the image name and tag
  • Script simplifications - I moved the docker pushes out of the script and into the workflow, so a lot of the variables are unneeded now. Also made some things more similar to the .ci/docker/build.sh script used for CI images
  • changes to calculate docker image action in test infra - accept a custom tag prefix, so the container name will be imagename:tag-prefix-folderhash if tag-prefix is given, and imagename:folderhash if not

Copy link

pytorch-bot bot commented Apr 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150558

Note: Links to docs will display an error until the docs builds have been completed.

❌ 71 New Failures

As of commit 71992a5 with merge base 91923f0 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Apr 2, 2025
@clee2000 clee2000 force-pushed the csl/unify_manywheel_docker_pull branch from d00587a to 9852c18 Compare April 2, 2025 20:18
@clee2000 clee2000 changed the title [CI][docker] Make many wheel + other nightly binary docker images have same name as PR [test] binary docker builds Apr 2, 2025
@clee2000 clee2000 added ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/binaries Trigger all binary build and upload jobs on the PR and removed ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/binaries Trigger all binary build and upload jobs on the PR labels Apr 2, 2025
@clee2000 clee2000 force-pushed the csl/unify_manywheel_docker_pull branch 6 times, most recently from 40b35c9 to b666376 Compare April 3, 2025 22:57
@seemethere
Copy link
Member

This doesn't work for s390x and maybe xpu because they would need write access to ecr which they don't have. Could be fixed with an OIDC role?

How do we achieve this for regular pull request builds nowadays?

I feel like they'd need to have this in order to actually be able to test pull requests

@clee2000 clee2000 force-pushed the csl/unify_manywheel_docker_pull branch 2 times, most recently from 4d9a7be to 45dc37e Compare April 3, 2025 23:23
@clee2000 clee2000 removed the ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR label Apr 3, 2025
@clee2000
Copy link
Contributor Author

clee2000 commented Apr 3, 2025

This doesn't work for s390x and maybe xpu because they would need write access to ecr which they don't have. Could be fixed with an OIDC role?

How do we achieve this for regular pull request builds nowadays?

I feel like they'd need to have this in order to actually be able to test pull requests

Currently s390x doesn't push the image for PRs, just builds it

@seemethere
Copy link
Member

This doesn't work for s390x and maybe xpu because they would need write access to ecr which they don't have. Could be fixed with an OIDC role?

How do we achieve this for regular pull request builds nowadays?
I feel like they'd need to have this in order to actually be able to test pull requests

Currently s390x doesn't push the image for PRs, just builds it

Building the image every time wouldn't be terrible as long as we had upstream cache

@clee2000 clee2000 force-pushed the csl/unify_manywheel_docker_pull branch 2 times, most recently from 9592076 to dbf0750 Compare April 4, 2025 16:24
@clee2000 clee2000 added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Apr 4, 2025
@clee2000 clee2000 force-pushed the csl/unify_manywheel_docker_pull branch 3 times, most recently from 119c2e4 to b8122a0 Compare April 7, 2025 16:15
@clee2000 clee2000 force-pushed the csl/unify_manywheel_docker_pull branch from b8122a0 to 94e34f8 Compare April 14, 2025 20:29
@clee2000 clee2000 force-pushed the csl/unify_manywheel_docker_pull branch 2 times, most recently from 9079033 to 1c68f2d Compare April 16, 2025 17:02
@clee2000 clee2000 force-pushed the csl/unify_manywheel_docker_pull branch 2 times, most recently from 65f226b to 675f14b Compare April 16, 2025 19:20
@clee2000 clee2000 force-pushed the csl/unify_manywheel_docker_pull branch from 675f14b to 71992a5 Compare April 16, 2025 20:22
@clee2000
Copy link
Contributor Author

I'm going to try splitting this up into smaller PRs

pytorchmergebot pushed a commit that referenced this pull request Apr 16, 2025
This is part of splitting up #150558 into smaller chunks, please see that for more context

Uses calculate docker image with the new custom tag prefix, so the naming convention of the docker images is slightly different for images built on PR

based off of https://github.com/pytorch/pytorch/blob/a582f046084d1ea49b2a253ece15a4d6157f2579/.github/workflows/build-manywheel-images.yml#L101

Also moves the push of the docker images from inside the build scripts to inside the workflow

Currently not used anywhere, but the binary docker builds are very similar so I'm going to change them to use this instead

Pull Request resolved: #151471
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/ZainRizvi
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
This is part of splitting up pytorch#150558 into smaller chunks, please see that for more context

Uses calculate docker image with the new custom tag prefix, so the naming convention of the docker images is slightly different for images built on PR

based off of https://github.com/pytorch/pytorch/blob/a582f046084d1ea49b2a253ece15a4d6157f2579/.github/workflows/build-manywheel-images.yml#L101

Also moves the push of the docker images from inside the build scripts to inside the workflow

Currently not used anywhere, but the binary docker builds are very similar so I'm going to change them to use this instead

Pull Request resolved: pytorch#151471
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/ZainRizvi
pytorchmergebot pushed a commit that referenced this pull request Apr 17, 2025
#151483)

This is part of splitting up #150558 into smaller chunks, please see that for more context

Use the binary docker build action from #151471

Change the workflow trigger to be all of .ci/docker so it will make a new image + tag whenever it changes.

build script:
* change to be independent of the CUDA_VERSION env var, since all the info should be in the imagename:tag
* remove docker push parts since that will happen during the workflow
* clean up a bit
* make the build script more like the CI build script (use a temp image name)

I don't think this image is actually used anywhere

Also push docker image to imagename:tag, I got rid of it in the PR making the reusable workflow since I thought it was not in the original scripts but it actually is there
Pull Request resolved: #151483
Approved by: https://github.com/ZainRizvi
pytorchmergebot pushed a commit that referenced this pull request Apr 18, 2025
This is part of splitting up #150558 into smaller chunks, please see that for more context

Similar to #151483 but for libtorch

Changed the job name

Testing:
Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now
Pull Request resolved: #151488
Approved by: https://github.com/atalman
pytorchmergebot pushed a commit that referenced this pull request Apr 18, 2025
This is part of splitting up #150558 into smaller chunks, please see that for more context

Similar to #151483 but for manywheel

Changed the job name

s390x doesn't have access to aws ecr so it doesn't use the action.  manylinuxs390x-builder ecr repo doesn't exist in docker hub so idk why the image name is that

Testing:
Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now
Pull Request resolved: #151489
Approved by: https://github.com/seemethere
@clee2000 clee2000 closed this Apr 22, 2025
@clee2000 clee2000 changed the title [test] binary docker builds Binary docker builds - use image tagged with folder sha Apr 22, 2025
pytorchmergebot pushed a commit that referenced this pull request Apr 22, 2025
Should be the last part of #150558, except for maybe s390x stuff, which I'm still not sure what's going on there

For binary builds, do the thing like we do in CI where we tag each image with a hash of the .ci/docker folder to ensure a docker image built from that commit gets used.  Previously it would use imagename:arch-main, which could be a version of the image based on an older commit

After this, changing a docker image and then tagging with ciflow/binaries on the same PR should use the new docker images

Release and main builds should still pull from docker io

Cons:
* if someone rebuilds the image from main or a PR where the hash is the same (ex folder is unchanged, but retrigger docker build for some reason), the release would use that image instead of one built on the release branch
* spin wait for docker build to finish
Pull Request resolved: #151706
Approved by: https://github.com/atalman
@github-actions github-actions bot deleted the csl/unify_manywheel_docker_pull branch May 28, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/binaries Trigger all binary build and upload jobs on the PR topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants