Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] fix transformers example for multi-gpu #24832

Merged
merged 5 commits into from Jun 9, 2022

Conversation

matthewdeng
Copy link
Contributor

@matthewdeng matthewdeng commented May 16, 2022

Why are these changes needed?

Accelerate depends on this environment variable to set for proper GPU device placement.

Source

self.local_process_index = int(os.environ.get("LOCAL_RANK", -1))
self.device = torch.device("cuda", self.local_process_index)

Verified that this allows multiple GPUs to be used:

Mon May 16 03:26:13 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1D.0 Off |                    0 |
| N/A   55C    P0   135W / 150W |   7006MiB /  7618MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   67C    P0   140W / 150W |   6434MiB /  7618MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     10446      C   ...WorkerMixin._BaseWorkerMixin__execute()  6994MiB |
|    1     10447      C   ...WorkerMixin._BaseWorkerMixin__execute()  6422MiB |
+-----------------------------------------------------------------------------+

closes #23230

Related issue number

https://discuss.ray.io/t/ray-train-example-with-transformers/6128

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks! Though looks like there is a failing test

@amogkam amogkam merged commit eff72f9 into ray-project:master Jun 9, 2022
bushshrub pushed a commit to bushshrub/ray that referenced this pull request Jun 10, 2022
Accelerate depends on this environment variable to set for proper GPU device placement.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[tune][Bug] ray + transformers example is not using GPUs correctly
3 participants