-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ray.torch.train.prepare_data_loader #38115
Comments
I verified that the bug existed on 2.6.1. And probably on master as well. We didn't do a good job testing multiple GPU per worker case. The following needs to happen:
cc @matthewdeng for triaging visibility. Marking this as P1 for now. @MMorente would you like to contribute a PR to fix this? I think the fix can be as simple as
|
@xwjiang2010 can you update the docs to reflect the behavior? |
@xwjiang2010 I have just created a pull request #38127 with the change. |
@matthewdeng can you re-review and advise on next steps? |
This will be fixed by #42314, cc @woshiyyya to double check. |
Yes this will be fixed by the this PR. |
What happened + What you expected to happen
I have a local ray cluster and I am using pytorch dataloaders. As described in the docs, I am using ray.train.torch.prepare_data_loader to prepare the dataloader for ray, this works well when the workers only use one GPU, however if in the ScalingConfig I specify more than one GPU per worker, prepare_data_loader fails with the message:
My cluster has two nodes and each one has 4 GPUs.
Versions / Dependencies
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: