Skip to content

fix(docker) rocm 6.3 based image #8152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

heathen711
Copy link
Contributor

Summary

  1. Fix the run script to properly read the GPU_DRIVER
  2. Cloned and adjusted the ROCM dockerbuild for docker
  3. Adjust the docker-compose.yml to use the cloned dockerbuild

QA Instructions

Merge Plan

  1. Talk with devs for speed improvements to the docker build
  2. Investigate if this can be conditionalized into the original dockerbuild (this has issues as the uv.lock only support cuda/cpu env)
  3. Test the build in production pipeline

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

@heathen711 heathen711 changed the title fix(docker) rocm 2.4.6 based image fix(docker) rocm 6.2.4 based image Jul 3, 2025
@heathen711 heathen711 marked this pull request as ready for review July 3, 2025 06:03
Comment on lines 81 to 90
uv sync --frozen
uv venv --python 3.12 && \
# Use the public version to install existing known dependencies but using the UV_INDEX, not the hardcoded URLs within the uv.lock
uv pip install invokeai
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could conditionalize this logic, to use the uv.lock for cuda, and then use the UV_INDEX for CPU and ROCM, to reduce the risk of this change, but I went with this for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable to continue using uv.lock for the CUDA images, if possible, to keep it consistent with the installations produced by the official installer.

Ideally - if you're willing to work on this - we should find a way to support both cuda and rocm dependencies in a single uv.lock/pyproject.toml, perhaps by leveraging the uv dependency groups: https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the uv.lock, there's some notes about things in the pyproject.toml that I would like your input on.

Copy link
Member

@ebr ebr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution - left some comments to address

Comment on lines 81 to 90
uv sync --frozen
uv venv --python 3.12 && \
# Use the public version to install existing known dependencies but using the UV_INDEX, not the hardcoded URLs within the uv.lock
uv pip install invokeai
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable to continue using uv.lock for the CUDA images, if possible, to keep it consistent with the installations produced by the official installer.

Ideally - if you're willing to work on this - we should find a way to support both cuda and rocm dependencies in a single uv.lock/pyproject.toml, perhaps by leveraging the uv dependency groups: https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies

Comment on lines 95 to 102
wget -O /tmp/amdgpu-install.deb \
https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb && \
apt install -y /tmp/amdgpu-install.deb && \
apt update && \
amdgpu-install --usecase=rocm -y && \
apt-get autoclean && \
apt clean && \
rm -rf /tmp/* /var/tmp/* && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely unnecessary. the gpu driver should be provided by the kernel, and rocm itself is usually not needed in the image because it's already bundled with pytorch. That is unless something changed in the most recent torch/rocm that makes this a requirement.

(to be clear, the video/render group additions for ubuntu user are needed should be kept)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped the rocm install, but kept the groups and got:
invokeai-rocm-1 | RuntimeError: No HIP GPUs are available
But there are 4 AMD GPUs on my system, so it's failing.

I went and looked at the rocm-pytorch docker, and they are installing the full rocmdev, I limited it to just the rocm binaries (also tried the hip alone but that still error'd).

Suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be sure - are you using amd-container-toolkit and the amd runtime for docker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's my goal, I don't want to have to modify the host and ensure that the container has everything. I'm running a proxmox host, with a docker LXC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isn't ideal, I can split that logic into my own and have this build the minimal way, or make it another config? rocm-standalone?

elif [ "$GPU_DRIVER" = "rocm" ]; then UV_INDEX="https://download.pytorch.org/whl/rocm6.2"; \
# Cannot use the uv.lock as that is locked to CUDA version packages, which breaks rocm...
# --mount=type=bind,source=uv.lock,target=uv.lock \
ulimit -n 30000 && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ulimit doesn't affect much, wondering what's the reason for it here and the value of 30000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA and CPU doesn't hit the limit, but with ROCM it fails as to many files are being opened. I can try to lower the limit if it concerns you, I just made it something high and was able to continue, so never went back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't matter much, this only applies during build, it's just really weird that this is needed at all.

@github-actions github-actions bot added Root python-deps PRs that change python dependencies labels Jul 3, 2025
@heathen711 heathen711 requested a review from ebr July 3, 2025 20:09
@heathen711
Copy link
Contributor Author

heathen711 commented Jul 3, 2025

  Downloaded pytorch-triton-rocm
  × Failed to download `torch==2.7.1+rocm6.3`
  ├─▶ Failed to extract archive
  ╰─▶ failed to write to file
      `/home/runner/work/_temp/setup-uv-cache/.tmpOmavep/torch/lib/hipblaslt/library/TensileLibrary_HH_SH_A_Bias_SAV_Type_HS_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx90a.co`:
      No space left on device (os error 28)
  help: `torch` (v2.7.1+rocm6.3) was included because `invokeai` depends on
        `torch`

Downloading torch (4.2GiB) probably the culprit... just don't understand why it's downloading the rocm stuff, the default is not rocm...

@heathen711 heathen711 requested a review from jazzhaiku as a code owner July 3, 2025 21:22
@github-actions github-actions bot added the CI-CD Continuous integration / Continuous delivery label Jul 3, 2025
@ebr
Copy link
Member

ebr commented Jul 4, 2025

The image builds from this PR, but fails to start:

Click to expand large traceback
Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2154, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2184, in _get_module
    raise e
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2182, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 27, in <module>
    from ...image_processing_utils import ImageProcessingMixin
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_processing_utils.py", line 22, in <module>
    from .image_transforms import center_crop, normalize, rescale
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_transforms.py", line 22, in <module>
    from .image_utils import (
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_utils.py", line 59, in <module>
    from torchvision.transforms import InterpolationMode
  File "/opt/venv/lib/python3.12/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
    @torch.library.register_fake("torchvision::nms")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/library.py", line 1023, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/opt/venv/lib/python3.12/site-packages/torch/library.py", line 214, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator torchvision::nms does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/loaders/single_file_model.py", line 26, in <module>
    from .single_file_utils import (
  File "/opt/venv/lib/python3.12/site-packages/diffusers/loaders/single_file_utils.py", line 52, in <module>
    from transformers import AutoImageProcessor
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2157, in __getattr__
    raise ModuleNotFoundError(
ModuleNotFoundError: Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/__init__.py", line 1, in <module>
    from .autoencoder_asym_kl import AsymmetricAutoencoderKL
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/autoencoder_asym_kl.py", line 23, in <module>
    from .vae import DecoderOutput, DiagonalGaussianDistribution, Encoder, MaskConditionDecoder
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/vae.py", line 25, in <module>
    from ..unets.unet_2d_blocks import (
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/__init__.py", line 6, in <module>
    from .unet_2d import UNet2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/unet_2d.py", line 24, in <module>
    from .unet_2d_blocks import UNetMidBlock2D, get_down_block, get_up_block
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 36, in <module>
    from ..transformers.dual_transformer_2d import DualTransformer2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/transformers/__init__.py", line 5, in <module>
    from .auraflow_transformer_2d import AuraFlowTransformer2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/transformers/auraflow_transformer_2d.py", line 23, in <module>
    from ...loaders import FromOriginalModelMixin
  File "<frozen importlib._bootstrap>", line 1412, in _handle_fromlist
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/pipelines/pipeline_utils.py", line 47, in <module>
    from ..models import AutoencoderKL
  File "<frozen importlib._bootstrap>", line 1412, in _handle_fromlist
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.models.autoencoders.autoencoder_kl because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/bin/invokeai-web", line 10, in <module>
    sys.exit(run_app())
             ^^^^^^^^^
  File "/opt/invokeai/invokeai/app/run_app.py", line 35, in run_app
    from invokeai.app.invocations.baseinvocation import InvocationRegistry
  File "/opt/invokeai/invokeai/app/invocations/baseinvocation.py", line 41, in <module>
    from invokeai.app.services.shared.invocation_context import InvocationContext
  File "/opt/invokeai/invokeai/app/services/shared/invocation_context.py", line 18, in <module>
    from invokeai.app.services.model_records.model_records_base import UnknownModelException
  File "/opt/invokeai/invokeai/app/services/model_records/__init__.py", line 3, in <module>
    from .model_records_base import (  # noqa F401
  File "/opt/invokeai/invokeai/app/services/model_records/model_records_base.py", line 15, in <module>
    from invokeai.backend.model_manager.config import (
  File "/opt/invokeai/invokeai/backend/model_manager/__init__.py", line 3, in <module>
    from invokeai.backend.model_manager.config import (
  File "/opt/invokeai/invokeai/backend/model_manager/config.py", line 39, in <module>
    from invokeai.backend.model_manager.model_on_disk import ModelOnDisk
  File "/opt/invokeai/invokeai/backend/model_manager/model_on_disk.py", line 10, in <module>
    from invokeai.backend.model_manager.taxonomy import ModelRepoVariant
  File "/opt/invokeai/invokeai/backend/model_manager/taxonomy.py", line 14, in <module>
    ModelMixin, RawModel, torch.nn.Module, Dict[str, torch.Tensor], diffusers.DiffusionPipeline, ort.InferenceSession
                                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 811, in __getattr__
    value = getattr(module, name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.pipelines.pipeline_utils because of the following error (look up to see its traceback):
Failed to import diffusers.models.autoencoders.autoencoder_kl because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

This is likely due to torchvision not using the right index, though i haven't dug into it. The CUDA image is broken in a similar way though. I also rebased on main as a test to be sure, with the same result.

@heathen711
Copy link
Contributor Author

The image builds from this PR, but fails to start:

Click to expand large traceback
This is likely due to torchvision not using the right index, though i haven't dug into it. The CUDA image is broken in a similar way though. I also rebased on main as a test to be sure, with the same result.

Yup, updated the pins, uv.lock, and Dockerfile to ensure it's all in-sync. Please give it another try.

@heathen711 heathen711 changed the title fix(docker) rocm 6.2.4 based image fix(docker) rocm 6.3 based image Jul 5, 2025
@ebr
Copy link
Member

ebr commented Jul 7, 2025

OK, thank you - the image builds now, but it only works on CPU. I haven't been able to get it to use the HIP device, either using the amd runtime or not, and with the kfd / dri devices forwarded to the pod, and either using docker-compose or plain docker run . Confirmed that the CUDA image continues working as expected though.

interestingly, rocm-smi , amd-smi, rocminfo all detect the GPU from inside the container, so hardware is accessible. Pretty sure this has something to do with pytorch. I'm testing this on the Radeon W7900 Pro GPU, so could also be a "me" problem because it's not common hardware (though i don't have issues with it outside of docker, or using other rocm containers). I'll play with it a bit more.

This PR also balloons the image size to 56GB uncompressed - we won't be able to build it in CI. I am still fairly confident we don't need the full ROCm in the image, but we can circle back to that.

As an option, maybe keeping this as a separate ROCm Dockerfile would be a better choice for those AMD users who want to build it for themselves, and we can consolidate it in the future once we have a good working image.

@heathen711
Copy link
Contributor Author

heathen711 commented Jul 9, 2025

OK, thank you - the image builds now, but it only works on CPU. I haven't been able to get it to use the HIP device, either using the amd runtime or not, and with the kfd / dri devices forwarded to the pod, and either using docker-compose or plain docker run . Confirmed that the CUDA image continues working as expected though.

interestingly, rocm-smi , amd-smi, rocminfo all detect the GPU from inside the container, so hardware is accessible. Pretty sure this has something to do with pytorch. I'm testing this on the Radeon W7900 Pro GPU, so could also be a "me" problem because it's not common hardware (though i don't have issues with it outside of docker, or using other rocm containers). I'll play with it a bit more.

This PR also balloons the image size to 56GB uncompressed - we won't be able to build it in CI. I am still fairly confident we don't need the full ROCm in the image, but we can circle back to that.

As an option, maybe keeping this as a separate ROCm Dockerfile would be a better choice for those AMD users who want to build it for themselves, and we can consolidate it in the future once we have a good working image.

So I started looking at using the amd-container-kit, it was a pain to get installed into the LXC, but once I did the docker still failed. Start debugging and found:

Using these in the entrypoint script:

echo "Checking ROCM device availability as root..."
python -c "import torch; print('GPU available:', torch.cuda.is_available()); print('Number of GPUs:', torch.cuda.device_count())"

echo "Checking ROCM device availability as ${USER}..."
exec gosu ${USER} python -c "import torch; print('GPU available:', torch.cuda.is_available()); print('Number of GPUs:', torch.cuda.device_count())"

I get:

Attaching to invokeai-rocm-1
invokeai-rocm-1  | Checking ROCM device availability as root...
invokeai-rocm-1  | GPU available: True
invokeai-rocm-1  | Number of GPUs: 4
invokeai-rocm-1  | Checking ROCM device availability as ubuntu...
invokeai-rocm-1  | GPU available: False
invokeai-rocm-1  | Number of GPUs: 0

So something about gosu is messing it up or a permission is missing somewhere because only the ubuntu user can't see the GPUs. Thoughts?

Proof: I remove the gosu and just ran invokeai-web as root and:

Attaching to invokeai-rocm-1
invokeai-rocm-1  | bitsandbytes library load error: Configured CUDA binary not found at /opt/venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm63.so
invokeai-rocm-1  | Traceback (most recent call last):
invokeai-rocm-1  |   File "/opt/venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 290, in <module>
invokeai-rocm-1  |     lib = get_native_library()
invokeai-rocm-1  |           ^^^^^^^^^^^^^^^^^^^^
invokeai-rocm-1  |   File "/opt/venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 270, in get_native_library
invokeai-rocm-1  |     raise RuntimeError(f"Configured CUDA binary not found at {cuda_binary_path}")
invokeai-rocm-1  | RuntimeError: Configured CUDA binary not found at /opt/venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm63.so
invokeai-rocm-1  | [2025-07-09 06:25:57,821]::[InvokeAI]::INFO --> Using torch device: AMD Radeon Pro V620
invokeai-rocm-1  | [2025-07-09 06:25:57,822]::[InvokeAI]::INFO --> cuDNN version: 3003000
invokeai-rocm-1  | [2025-07-09 06:25:58,221]::[InvokeAI]::INFO --> Patchmatch initialized
invokeai-rocm-1  | [2025-07-09 06:25:59,919]::[InvokeAI]::INFO --> Loading node pack invoke_bria_rmbg
invokeai-rocm-1  | [2025-07-09 06:25:59,924]::[InvokeAI]::INFO --> Loaded 1 node pack from /invokeai/nodes: invoke_bria_rmbg
invokeai-rocm-1  | [2025-07-09 06:26:00,165]::[InvokeAI]::INFO --> InvokeAI version 6.0.0rc5
invokeai-rocm-1  | [2025-07-09 06:26:00,166]::[InvokeAI]::INFO --> Root directory = /invokeai
invokeai-rocm-1  | [2025-07-09 06:26:00,166]::[InvokeAI]::INFO --> Initializing database at /invokeai/databases/invokeai.db
invokeai-rocm-1  | [2025-07-09 06:26:00,204]::[ModelManagerService]::INFO --> [MODEL CACHE] Calculated model RAM cache size: 22512.00 MB. Heuristics applied: [1, 2].
invokeai-rocm-1  | [2025-07-09 06:26:00,599]::[InvokeAI]::INFO --> Invoke running on http://0.0.0.0:9090 (Press CTRL+C to quit)

@heathen711
Copy link
Contributor Author

@ebr I figured it out, the render group within the container does not match the render group on the host, this doesn't appear to be an issue with the full-rocm install, i bet they have it forced to a certain group number to ensure things are consistent. So I made it an env input and groupmod it in the entrypoint script. Give it a read and tell me if you think of a better way to map this.

@heathen711
Copy link
Contributor Author

#7944 @dsisco11 and I both made changes to the toml and uv.index... hopefully we don't collide...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-CD Continuous integration / Continuous delivery docker python-deps PRs that change python dependencies Root
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants