[Train] Trainer fails on multi-GPU cluster #30063
Labels
@author-action-required
The PR author is responsible for the next step. Remove tag to send back to the reviewer.
needs-repro-script
Issue needs a runnable script to be reproduced
train
Ray Train Related Issue
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
What happened + What you expected to happen
Trainer(backend='torch') fails with Ray 2.0 when running on multiple GPU nodes. We are running Trainer on 3 A100 workers, each having 4 GPUs.
(BackendExecutor pid=5631) RuntimeError: Expected tensor for 'out' to have the same device as tensor for argument #2 'mat1'; but device 1 does not equal 3 (while checking arguments for addmm)
Versions / Dependencies
Ray 2.0
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: