fix(docker) rocm 6.3 based image #8152

heathen711 · 2025-06-29T22:06:17Z

Summary

Fix the run script to properly read the GPU_DRIVER
Cloned and adjusted the ROCM dockerbuild for docker
Adjust the docker-compose.yml to use the cloned dockerbuild

QA Instructions

Merge Plan

Talk with devs for speed improvements to the docker build
Investigate if this can be conditionalized into the original dockerbuild (this has issues as the uv.lock only support cuda/cpu env)
Test the build in production pipeline

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

heathen711 · 2025-07-03T06:06:31Z

docker/Dockerfile

-    uv sync --frozen
+    uv venv --python 3.12 && \
+    # Use the public version to install existing known dependencies but using the UV_INDEX, not the hardcoded URLs within the uv.lock
+    uv pip install invokeai


I could conditionalize this logic, to use the uv.lock for cuda, and then use the UV_INDEX for CPU and ROCM, to reduce the risk of this change, but I went with this for consistency.

It would be preferable to continue using uv.lock for the CUDA images, if possible, to keep it consistent with the installations produced by the official installer.

Ideally - if you're willing to work on this - we should find a way to support both cuda and rocm dependencies in a single uv.lock/pyproject.toml, perhaps by leveraging the uv dependency groups: https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies

Update the uv.lock, there's some notes about things in the pyproject.toml that I would like your input on.

ebr

Thanks for the contribution - left some comments to address

ebr · 2025-07-03T14:16:59Z

docker/Dockerfile

-    uv sync --frozen
+    uv venv --python 3.12 && \
+    # Use the public version to install existing known dependencies but using the UV_INDEX, not the hardcoded URLs within the uv.lock
+    uv pip install invokeai


It would be preferable to continue using uv.lock for the CUDA images, if possible, to keep it consistent with the installations produced by the official installer.

Ideally - if you're willing to work on this - we should find a way to support both cuda and rocm dependencies in a single uv.lock/pyproject.toml, perhaps by leveraging the uv dependency groups: https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies

ebr · 2025-07-03T14:22:38Z

docker/Dockerfile

+    wget -O /tmp/amdgpu-install.deb \
+    https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb && \
+    apt install -y /tmp/amdgpu-install.deb && \
+    apt update && \
+    amdgpu-install --usecase=rocm -y && \
+    apt-get autoclean && \
+    apt clean && \
+    rm -rf /tmp/* /var/tmp/* && \


This is likely unnecessary. the gpu driver should be provided by the kernel, and rocm itself is usually not needed in the image because it's already bundled with pytorch. That is unless something changed in the most recent torch/rocm that makes this a requirement.

(to be clear, the video/render group additions for ubuntu user are needed should be kept)

Skipped the rocm install, but kept the groups and got:
invokeai-rocm-1 | RuntimeError: No HIP GPUs are available
But there are 4 AMD GPUs on my system, so it's failing.

I went and looked at the rocm-pytorch docker, and they are installing the full rocmdev, I limited it to just the rocm binaries (also tried the hip alone but that still error'd).

Suggestions?

to be sure - are you using amd-container-toolkit and the amd runtime for docker?

No, that's my goal, I don't want to have to modify the host and ensure that the container has everything. I'm running a proxmox host, with a docker LXC.

If this isn't ideal, I can split that logic into my own and have this build the minimal way, or make it another config? rocm-standalone?

docker/Dockerfile

ebr · 2025-07-03T14:52:07Z

docker/Dockerfile

-    elif [ "$GPU_DRIVER" = "rocm" ]; then UV_INDEX="https://download.pytorch.org/whl/rocm6.2"; \
+    # Cannot use the uv.lock as that is locked to CUDA version packages, which breaks rocm...
+    # --mount=type=bind,source=uv.lock,target=uv.lock \
+    ulimit -n 30000 && \


this ulimit doesn't affect much, wondering what's the reason for it here and the value of 30000?

CUDA and CPU doesn't hit the limit, but with ROCM it fails as to many files are being opened. I can try to lower the limit if it concerns you, I just made it something high and was able to continue, so never went back.

doesn't matter much, this only applies during build, it's just really weird that this is needed at all.

heathen711 · 2025-07-03T20:49:38Z

  Downloaded pytorch-triton-rocm
  × Failed to download `torch==2.7.1+rocm6.3`
  ├─▶ Failed to extract archive
  ╰─▶ failed to write to file
      `/home/runner/work/_temp/setup-uv-cache/.tmpOmavep/torch/lib/hipblaslt/library/TensileLibrary_HH_SH_A_Bias_SAV_Type_HS_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx90a.co`:
      No space left on device (os error 28)
  help: `torch` (v2.7.1+rocm6.3) was included because `invokeai` depends on
        `torch`

Downloading torch (4.2GiB) probably the culprit... just don't understand why it's downloading the rocm stuff, the default is not rocm...

… the uv.lock

ebr · 2025-07-04T22:18:48Z

The image builds from this PR, but fails to start:

Click to expand large traceback

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2154, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2184, in _get_module
    raise e
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2182, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 27, in <module>
    from ...image_processing_utils import ImageProcessingMixin
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_processing_utils.py", line 22, in <module>
    from .image_transforms import center_crop, normalize, rescale
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_transforms.py", line 22, in <module>
    from .image_utils import (
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_utils.py", line 59, in <module>
    from torchvision.transforms import InterpolationMode
  File "/opt/venv/lib/python3.12/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
    @torch.library.register_fake("torchvision::nms")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/library.py", line 1023, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/opt/venv/lib/python3.12/site-packages/torch/library.py", line 214, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator torchvision::nms does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/loaders/single_file_model.py", line 26, in <module>
    from .single_file_utils import (
  File "/opt/venv/lib/python3.12/site-packages/diffusers/loaders/single_file_utils.py", line 52, in <module>
    from transformers import AutoImageProcessor
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2157, in __getattr__
    raise ModuleNotFoundError(
ModuleNotFoundError: Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/__init__.py", line 1, in <module>
    from .autoencoder_asym_kl import AsymmetricAutoencoderKL
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/autoencoder_asym_kl.py", line 23, in <module>
    from .vae import DecoderOutput, DiagonalGaussianDistribution, Encoder, MaskConditionDecoder
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/vae.py", line 25, in <module>
    from ..unets.unet_2d_blocks import (
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/__init__.py", line 6, in <module>
    from .unet_2d import UNet2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/unet_2d.py", line 24, in <module>
    from .unet_2d_blocks import UNetMidBlock2D, get_down_block, get_up_block
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 36, in <module>
    from ..transformers.dual_transformer_2d import DualTransformer2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/transformers/__init__.py", line 5, in <module>
    from .auraflow_transformer_2d import AuraFlowTransformer2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/transformers/auraflow_transformer_2d.py", line 23, in <module>
    from ...loaders import FromOriginalModelMixin
  File "<frozen importlib._bootstrap>", line 1412, in _handle_fromlist
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/pipelines/pipeline_utils.py", line 47, in <module>
    from ..models import AutoencoderKL
  File "<frozen importlib._bootstrap>", line 1412, in _handle_fromlist
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.models.autoencoders.autoencoder_kl because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/bin/invokeai-web", line 10, in <module>
    sys.exit(run_app())
             ^^^^^^^^^
  File "/opt/invokeai/invokeai/app/run_app.py", line 35, in run_app
    from invokeai.app.invocations.baseinvocation import InvocationRegistry
  File "/opt/invokeai/invokeai/app/invocations/baseinvocation.py", line 41, in <module>
    from invokeai.app.services.shared.invocation_context import InvocationContext
  File "/opt/invokeai/invokeai/app/services/shared/invocation_context.py", line 18, in <module>
    from invokeai.app.services.model_records.model_records_base import UnknownModelException
  File "/opt/invokeai/invokeai/app/services/model_records/__init__.py", line 3, in <module>
    from .model_records_base import (  # noqa F401
  File "/opt/invokeai/invokeai/app/services/model_records/model_records_base.py", line 15, in <module>
    from invokeai.backend.model_manager.config import (
  File "/opt/invokeai/invokeai/backend/model_manager/__init__.py", line 3, in <module>
    from invokeai.backend.model_manager.config import (
  File "/opt/invokeai/invokeai/backend/model_manager/config.py", line 39, in <module>
    from invokeai.backend.model_manager.model_on_disk import ModelOnDisk
  File "/opt/invokeai/invokeai/backend/model_manager/model_on_disk.py", line 10, in <module>
    from invokeai.backend.model_manager.taxonomy import ModelRepoVariant
  File "/opt/invokeai/invokeai/backend/model_manager/taxonomy.py", line 14, in <module>
    ModelMixin, RawModel, torch.nn.Module, Dict[str, torch.Tensor], diffusers.DiffusionPipeline, ort.InferenceSession
                                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 811, in __getattr__
    value = getattr(module, name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.pipelines.pipeline_utils because of the following error (look up to see its traceback):
Failed to import diffusers.models.autoencoders.autoencoder_kl because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

This is likely due to torchvision not using the right index, though i haven't dug into it. The CUDA image is broken in a similar way though. I also rebased on main as a test to be sure, with the same result.

…version

heathen711 · 2025-07-05T03:37:10Z

The image builds from this PR, but fails to start:

Click to expand large traceback
This is likely due to torchvision not using the right index, though i haven't dug into it. The CUDA image is broken in a similar way though. I also rebased on main as a test to be sure, with the same result.

Yup, updated the pins, uv.lock, and Dockerfile to ensure it's all in-sync. Please give it another try.

ebr · 2025-07-07T18:50:34Z

OK, thank you - the image builds now, but it only works on CPU. I haven't been able to get it to use the HIP device, either using the amd runtime or not, and with the kfd / dri devices forwarded to the pod, and either using docker-compose or plain docker run . Confirmed that the CUDA image continues working as expected though.

interestingly, rocm-smi , amd-smi, rocminfo all detect the GPU from inside the container, so hardware is accessible. Pretty sure this has something to do with pytorch. I'm testing this on the Radeon W7900 Pro GPU, so could also be a "me" problem because it's not common hardware (though i don't have issues with it outside of docker, or using other rocm containers). I'll play with it a bit more.

This PR also balloons the image size to 56GB uncompressed - we won't be able to build it in CI. I am still fairly confident we don't need the full ROCm in the image, but we can circle back to that.

As an option, maybe keeping this as a separate ROCm Dockerfile would be a better choice for those AMD users who want to build it for themselves, and we can consolidate it in the future once we have a good working image.

heathen711 · 2025-07-09T06:22:47Z

OK, thank you - the image builds now, but it only works on CPU. I haven't been able to get it to use the HIP device, either using the amd runtime or not, and with the kfd / dri devices forwarded to the pod, and either using docker-compose or plain docker run . Confirmed that the CUDA image continues working as expected though.

interestingly, rocm-smi , amd-smi, rocminfo all detect the GPU from inside the container, so hardware is accessible. Pretty sure this has something to do with pytorch. I'm testing this on the Radeon W7900 Pro GPU, so could also be a "me" problem because it's not common hardware (though i don't have issues with it outside of docker, or using other rocm containers). I'll play with it a bit more.

This PR also balloons the image size to 56GB uncompressed - we won't be able to build it in CI. I am still fairly confident we don't need the full ROCm in the image, but we can circle back to that.

As an option, maybe keeping this as a separate ROCm Dockerfile would be a better choice for those AMD users who want to build it for themselves, and we can consolidate it in the future once we have a good working image.

So I started looking at using the amd-container-kit, it was a pain to get installed into the LXC, but once I did the docker still failed. Start debugging and found:

Using these in the entrypoint script:

echo "Checking ROCM device availability as root..."
python -c "import torch; print('GPU available:', torch.cuda.is_available()); print('Number of GPUs:', torch.cuda.device_count())"

echo "Checking ROCM device availability as ${USER}..."
exec gosu ${USER} python -c "import torch; print('GPU available:', torch.cuda.is_available()); print('Number of GPUs:', torch.cuda.device_count())"

I get:

Attaching to invokeai-rocm-1
invokeai-rocm-1  | Checking ROCM device availability as root...
invokeai-rocm-1  | GPU available: True
invokeai-rocm-1  | Number of GPUs: 4
invokeai-rocm-1  | Checking ROCM device availability as ubuntu...
invokeai-rocm-1  | GPU available: False
invokeai-rocm-1  | Number of GPUs: 0

So something about gosu is messing it up or a permission is missing somewhere because only the ubuntu user can't see the GPUs. Thoughts?

Proof: I remove the gosu and just ran invokeai-web as root and:

Attaching to invokeai-rocm-1
invokeai-rocm-1  | bitsandbytes library load error: Configured CUDA binary not found at /opt/venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm63.so
invokeai-rocm-1  | Traceback (most recent call last):
invokeai-rocm-1  |   File "/opt/venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 290, in <module>
invokeai-rocm-1  |     lib = get_native_library()
invokeai-rocm-1  |           ^^^^^^^^^^^^^^^^^^^^
invokeai-rocm-1  |   File "/opt/venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 270, in get_native_library
invokeai-rocm-1  |     raise RuntimeError(f"Configured CUDA binary not found at {cuda_binary_path}")
invokeai-rocm-1  | RuntimeError: Configured CUDA binary not found at /opt/venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm63.so
invokeai-rocm-1  | [2025-07-09 06:25:57,821]::[InvokeAI]::INFO --> Using torch device: AMD Radeon Pro V620
invokeai-rocm-1  | [2025-07-09 06:25:57,822]::[InvokeAI]::INFO --> cuDNN version: 3003000
invokeai-rocm-1  | [2025-07-09 06:25:58,221]::[InvokeAI]::INFO --> Patchmatch initialized
invokeai-rocm-1  | [2025-07-09 06:25:59,919]::[InvokeAI]::INFO --> Loading node pack invoke_bria_rmbg
invokeai-rocm-1  | [2025-07-09 06:25:59,924]::[InvokeAI]::INFO --> Loaded 1 node pack from /invokeai/nodes: invoke_bria_rmbg
invokeai-rocm-1  | [2025-07-09 06:26:00,165]::[InvokeAI]::INFO --> InvokeAI version 6.0.0rc5
invokeai-rocm-1  | [2025-07-09 06:26:00,166]::[InvokeAI]::INFO --> Root directory = /invokeai
invokeai-rocm-1  | [2025-07-09 06:26:00,166]::[InvokeAI]::INFO --> Initializing database at /invokeai/databases/invokeai.db
invokeai-rocm-1  | [2025-07-09 06:26:00,204]::[ModelManagerService]::INFO --> [MODEL CACHE] Calculated model RAM cache size: 22512.00 MB. Heuristics applied: [1, 2].
invokeai-rocm-1  | [2025-07-09 06:26:00,599]::[InvokeAI]::INFO --> Invoke running on http://0.0.0.0:9090 (Press CTRL+C to quit)

…ch the host's render group ID

heathen711 · 2025-07-09T20:23:31Z

@ebr I figured it out, the render group within the container does not match the render group on the host, this doesn't appear to be an issue with the full-rocm install, i bet they have it forced to a certain group number to ensure things are consistent. So I made it an env input and groupmod it in the entrypoint script. Give it a read and tell me if you think of a better way to map this.

heathen711 · 2025-07-10T00:13:00Z

#7944 @dsisco11 and I both made changes to the toml and uv.index... hopefully we don't collide...

fix(docker) rocm 2.4.6 based image

c10a6fd

github-actions bot added the docker label Jun 29, 2025

heathen711 added 2 commits June 29, 2025 22:07

fix(docker) Add cloned dockerbuild

96523ca

Fix tagging & remove force reinstall

28e0242

heathen711 changed the title ~~fix(docker) rocm 2.4.6 based image~~ fix(docker) rocm 6.2.4 based image Jul 3, 2025

bugfix(docker) combined the dockerfiles and reduced image size

47508b8

heathen711 marked this pull request as ready for review July 3, 2025 06:03

heathen711 requested review from lstein, blessedcoolant, psychedelicious, hipsterusername and ebr as code owners July 3, 2025 06:03

heathen711 commented Jul 3, 2025

View reviewed changes

ebr reviewed Jul 3, 2025

View reviewed changes

bugfix(docker): Use uv.lock for docker, and update to newer index urls.

f27471c

github-actions bot added Root python-deps PRs that change python dependencies labels Jul 3, 2025

heathen711 requested a review from ebr July 3, 2025 20:09

heathen711 added 2 commits July 3, 2025 21:15

bugfix(docker) Remove the need for UV index as that is now baked into…

641a6cf

… the uv.lock

bugfix(ci) Clean up more space for typegen check

a3cb3e0

heathen711 requested a review from jazzhaiku as a code owner July 3, 2025 21:22

github-actions bot added the CI-CD Continuous integration / Continuous delivery label Jul 3, 2025

bugfix(uv) Lock torchvision and ensure the docker uses the same rocm …

0db304f

…version

heathen711 added 3 commits July 5, 2025 15:21

Missed files

31ca314

Merge remote-tracking branch 'origin' into bugfix/heathen711/rocm-docker

6d7b231

cleanup(docker) remove no cache argument

8c5fcfd

heathen711 changed the title ~~fix(docker) rocm 6.2.4 based image~~ fix(docker) rocm 6.3 based image Jul 5, 2025

Merge remote-tracking branch 'origin' into bugfix/heathen711/rocm-docker

233740a

bugfix(docker) render group controls the devices, but it needs to mat…

8213f62

…ch the host's render group ID

fix(docker) rocm 6.3 based image #8152

Are you sure you want to change the base?

fix(docker) rocm 6.3 based image #8152

Uh oh!

Conversation

heathen711 commented Jun 29, 2025

Summary

QA Instructions

Merge Plan

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heathen711 commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebr commented Jul 4, 2025

Uh oh!

heathen711 commented Jul 5, 2025

Uh oh!

ebr commented Jul 7, 2025

Uh oh!

heathen711 commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heathen711 commented Jul 9, 2025

Uh oh!

heathen711 commented Jul 10, 2025

Uh oh!

Uh oh!

heathen711 commented Jul 3, 2025 •

edited

Loading

heathen711 commented Jul 9, 2025 •

edited

Loading