Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune][Bug] ray + transformers example is not using GPUs correctly #23230

Closed
1 of 2 tasks
hezhaozhao-git opened this issue Mar 16, 2022 · 1 comment · Fixed by #24832
Closed
1 of 2 tasks

[tune][Bug] ray + transformers example is not using GPUs correctly #23230

hezhaozhao-git opened this issue Mar 16, 2022 · 1 comment · Fixed by #24832
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks train Ray Train Related Issue

Comments

@hezhaozhao-git
Copy link

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Train

Issue Severity

High: It blocks me to complete my task.

What happened + What you expected to happen

my machine:

node1: 4gpu 1080ti
node2: 4gpu 1080ti

detail:

When I run the ray+transformers example (set num_woker==4), it can run, but when nvidia-smi check, it is found that only one gpu of node1 is used, and the other three gpu video memory of node1 is already occupied, but the utilization is 0 ,
When I set num_woker > 4, like 5, the same is true, it really frustrates me, how do I have to use this example?

Versions / Dependencies

ray: 1.11.0
python: 3.8.12
pytorch: 1.11.0+cu113
transformers: 4.17.0
accelerate: 0.5.1

Reproduction script

this example

Anything else

keep appearing

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@hezhaozhao-git hezhaozhao-git added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 16, 2022
@amogkam
Copy link
Contributor

amogkam commented Mar 16, 2022

cc @matthewdeng

@amogkam amogkam assigned matthewdeng and unassigned maxpumperla Mar 16, 2022
@stephanie-wang stephanie-wang changed the title [Bug] ray + transformers example only use a gpu [tune][Bug] ray + transformers example is not using GPUs correctly Mar 18, 2022
@stephanie-wang stephanie-wang added the tune Tune-related issues label Mar 18, 2022
@amogkam amogkam added train Ray Train Related Issue P1 Issue that should be fixed within a few weeks and removed tune Tune-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks train Ray Train Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants