-
Notifications
You must be signed in to change notification settings - Fork 2.6k
fix(docker) rocm 6.3 based image #8152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(docker) rocm 6.3 based image #8152
Conversation
docker/Dockerfile
Outdated
uv sync --frozen | ||
uv venv --python 3.12 && \ | ||
# Use the public version to install existing known dependencies but using the UV_INDEX, not the hardcoded URLs within the uv.lock | ||
uv pip install invokeai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could conditionalize this logic, to use the uv.lock for cuda, and then use the UV_INDEX for CPU and ROCM, to reduce the risk of this change, but I went with this for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be preferable to continue using uv.lock for the CUDA images, if possible, to keep it consistent with the installations produced by the official installer.
Ideally - if you're willing to work on this - we should find a way to support both cuda and rocm dependencies in a single uv.lock/pyproject.toml, perhaps by leveraging the uv dependency groups: https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the uv.lock, there's some notes about things in the pyproject.toml that I would like your input on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution - left some comments to address
docker/Dockerfile
Outdated
uv sync --frozen | ||
uv venv --python 3.12 && \ | ||
# Use the public version to install existing known dependencies but using the UV_INDEX, not the hardcoded URLs within the uv.lock | ||
uv pip install invokeai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be preferable to continue using uv.lock for the CUDA images, if possible, to keep it consistent with the installations produced by the official installer.
Ideally - if you're willing to work on this - we should find a way to support both cuda and rocm dependencies in a single uv.lock/pyproject.toml, perhaps by leveraging the uv dependency groups: https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies
docker/Dockerfile
Outdated
wget -O /tmp/amdgpu-install.deb \ | ||
https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb && \ | ||
apt install -y /tmp/amdgpu-install.deb && \ | ||
apt update && \ | ||
amdgpu-install --usecase=rocm -y && \ | ||
apt-get autoclean && \ | ||
apt clean && \ | ||
rm -rf /tmp/* /var/tmp/* && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely unnecessary. the gpu driver should be provided by the kernel, and rocm itself is usually not needed in the image because it's already bundled with pytorch. That is unless something changed in the most recent torch/rocm that makes this a requirement.
(to be clear, the video/render group additions for ubuntu user are needed should be kept)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipped the rocm install, but kept the groups and got:
invokeai-rocm-1 | RuntimeError: No HIP GPUs are available
But there are 4 AMD GPUs on my system, so it's failing.
I went and looked at the rocm-pytorch docker, and they are installing the full rocmdev, I limited it to just the rocm binaries (also tried the hip alone but that still error'd).
Suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be sure - are you using amd-container-toolkit
and the amd
runtime for docker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's my goal, I don't want to have to modify the host and ensure that the container has everything. I'm running a proxmox host, with a docker LXC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this isn't ideal, I can split that logic into my own and have this build the minimal way, or make it another config? rocm-standalone
?
elif [ "$GPU_DRIVER" = "rocm" ]; then UV_INDEX="https://download.pytorch.org/whl/rocm6.2"; \ | ||
# Cannot use the uv.lock as that is locked to CUDA version packages, which breaks rocm... | ||
# --mount=type=bind,source=uv.lock,target=uv.lock \ | ||
ulimit -n 30000 && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ulimit doesn't affect much, wondering what's the reason for it here and the value of 30000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA and CPU doesn't hit the limit, but with ROCM it fails as to many files are being opened. I can try to lower the limit if it concerns you, I just made it something high and was able to continue, so never went back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't matter much, this only applies during build, it's just really weird that this is needed at all.
|
The image builds from this PR, but fails to start: Click to expand large traceback
This is likely due to torchvision not using the right index, though i haven't dug into it. The CUDA image is broken in a similar way though. I also rebased on |
Yup, updated the pins, uv.lock, and Dockerfile to ensure it's all in-sync. Please give it another try. |
OK, thank you - the image builds now, but it only works on CPU. I haven't been able to get it to use the HIP device, either using the interestingly, This PR also balloons the image size to 56GB uncompressed - we won't be able to build it in CI. I am still fairly confident we don't need the full ROCm in the image, but we can circle back to that. As an option, maybe keeping this as a separate ROCm Dockerfile would be a better choice for those AMD users who want to build it for themselves, and we can consolidate it in the future once we have a good working image. |
So I started looking at using the amd-container-kit, it was a pain to get installed into the LXC, but once I did the docker still failed. Start debugging and found: Using these in the entrypoint script:
I get:
So something about gosu is messing it up or a permission is missing somewhere because only the ubuntu user can't see the GPUs. Thoughts? Proof: I remove the gosu and just ran invokeai-web as root and:
|
…ch the host's render group ID
@ebr I figured it out, the render group within the container does not match the render group on the host, this doesn't appear to be an issue with the full-rocm install, i bet they have it forced to a certain group number to ensure things are consistent. So I made it an env input and groupmod it in the entrypoint script. Give it a read and tell me if you think of a better way to map this. |
Summary
QA Instructions
Merge Plan
Checklist
What's New
copy (if doing a release after this PR)