Skip to content

[Bug] NVIDIA eGPU RPC Timeout / KeyError 63 on macOS (mini M4 Pro) via TBT5/USB4 #15843

@OpenWrt01

Description

@OpenWrt01

Describe the bug

I am experiencing hardware initialization failures when using NVIDIA GPUs (RTX 3060, 4080, 4090, 5090) on a Mac mini (M4 Pro, macOS 26.x) via UGreen Thunderbolt 5 and USB4 enclosures. While AMD GPUs (RX 7900 XTX) work flawlessly via macOS native drivers, tinygrad fails during the NV driver handshake.

Ref : #15652

Hardware Environment:

Host: MAC mini (Apple M4 Pro), macOS 16.x

ASM2464 (USB4) -> Works intermittently or better than TBT
UGreen LinkStation ASM2464 (USB4) + RTX3060, 4090 and AMD RX7900 XTX -> Pass

GPUs Tested: RTX 3060, 4080, 4090, 5090 (system info device list can find the eGPU device)

Enclosures: - UGreen LinkStation (Thunderbolt 5) + RTX3060, 4080,4090,5090 -> Fails
Enclosures: - UGreen LinkStation (Thunderbolt 5) + AMD RX7900 XTX -> Pass

Error 1: Architecture Recognition (KeyError)

In nvdev.py, the architecture is read as 0x3F instead of 0x19 (Ada) or 0x17 (Ampere).

Python
File "nvdev.py", line 113, in _early_ip_init

    self.chip_name = {0x17: "GA1", 0x19: "AD1", 0x1b: "GB2"}[self.chip_details['architecture']]
KeyError: 63  # 0x3F

Error 2: Reset Timeout

After manually patching the KeyError (add 0x3F: "GA1" on self.chip_name), the driver hangs at wait_for_reset:

TimeoutError: waiting for reset. Timed out after 10000 ms, condition not met: False != True

Analysis:

It seems the TBT5/Titan Ridge controllers are interfering with low-level PCIe PERST# signals or Atomic Operations required by the GSP-RM initialization in user-space. The ASM2464's transparent PCIe tunneling seems more compatible with tinygrad's bare-metal approach than Intel's Thunderbolt implementation.

To Reproduce:

DEV=NV python3 -m tinygrad.llm --benchmark

Additional Context:

I can provide more logs or test on various hardware if needed. Is there a plan to optimize the RPC timeout or the GSP initialization sequence for high-bandwidth/high-latency Thunderbolt 5 tunnels?

Image Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions