[train] fix transformers example for multi-gpu #24832

matthewdeng · 2022-05-16T03:29:57Z

Why are these changes needed?

Accelerate depends on this environment variable to set for proper GPU device placement.

self.local_process_index = int(os.environ.get("LOCAL_RANK", -1))
self.device = torch.device("cuda", self.local_process_index)

Verified that this allows multiple GPUs to be used:

Mon May 16 03:26:13 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1D.0 Off |                    0 |
| N/A   55C    P0   135W / 150W |   7006MiB /  7618MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   67C    P0   140W / 150W |   6434MiB /  7618MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     10446      C   ...WorkerMixin._BaseWorkerMixin__execute()  6994MiB |
|    1     10447      C   ...WorkerMixin._BaseWorkerMixin__execute()  6422MiB |
+-----------------------------------------------------------------------------+

closes #23230

Related issue number

https://discuss.ray.io/t/ray-train-example-with-transformers/6128

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

amogkam

Nice, thanks! Though looks like there is a failing test

Accelerate depends on this environment variable to set for proper GPU device placement. Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

[train] fix transformers example for multi-gpu

47c3d2b

matthewdeng requested review from amogkam, Yard1 and bveeramani May 16, 2022 03:29

matthewdeng assigned amogkam and Yard1 May 16, 2022

Yard1 approved these changes May 16, 2022

View reviewed changes

amogkam approved these changes May 16, 2022

View reviewed changes

Yard1 added 4 commits June 8, 2022 20:14

Merge branch 'master' into pr/matthewdeng/24832

b538bd0

Use GPU for transformers example

9405aa1

Fix

98db3cf

Fix for CPU

db798f4

amogkam merged commit eff72f9 into ray-project:master Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] fix transformers example for multi-gpu #24832

[train] fix transformers example for multi-gpu #24832

matthewdeng commented May 16, 2022 •

edited by amogkam

amogkam left a comment

[train] fix transformers example for multi-gpu #24832

[train] fix transformers example for multi-gpu #24832

Conversation

matthewdeng commented May 16, 2022 • edited by amogkam

Why are these changes needed?

Related issue number

Checks

amogkam left a comment

Choose a reason for hiding this comment

matthewdeng commented May 16, 2022 •

edited by amogkam