You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Ray Component
Ray Train
Issue Severity
High: It blocks me to complete my task.
What happened + What you expected to happen
my machine:
node1: 4gpu 1080ti
node2: 4gpu 1080ti
detail:
When I run the ray+transformers example (set num_woker==4), it can run, but when nvidia-smi check, it is found that only one gpu of node1 is used, and the other three gpu video memory of node1 is already occupied, but the utilization is 0 ,
When I set num_woker > 4, like 5, the same is true, it really frustrates me, how do I have to use this example?
The text was updated successfully, but these errors were encountered:
hezhaozhao-git
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Mar 16, 2022
stephanie-wang
changed the title
[Bug] ray + transformers example only use a gpu
[tune][Bug] ray + transformers example is not using GPUs correctly
Mar 18, 2022
amogkam
added
train
Ray Train Related Issue
P1
Issue that should be fixed within a few weeks
and removed
tune
Tune-related issues
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Apr 7, 2022
Search before asking
Ray Component
Ray Train
Issue Severity
High: It blocks me to complete my task.
What happened + What you expected to happen
my machine:
node1: 4gpu 1080ti
node2: 4gpu 1080ti
detail:
When I run the ray+transformers example (set num_woker==4), it can run, but when nvidia-smi check, it is found that only one gpu of node1 is used, and the other three gpu video memory of node1 is already occupied, but the utilization is 0 ,
When I set num_woker > 4, like 5, the same is true, it really frustrates me, how do I have to use this example?
Versions / Dependencies
ray: 1.11.0
python: 3.8.12
pytorch: 1.11.0+cu113
transformers: 4.17.0
accelerate: 0.5.1
Reproduction script
this example
Anything else
keep appearing
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: