[air/train] Multi GPU training occupies more memory memory on first GPU #26707

krfricke · 2022-07-19T09:21:15Z

What happened + What you expected to happen

This is a 4 node, 4 GPUs per node setup. I'm running a GPU benchmark script comparing training with Ray AIR and Vanilla PyTorch.

When training with Ray AIR, the GPU with ID 0 on the head node occupies more memory than expected. It seems that worker models are instantiated on the GPU. This seems to happen during setup.

This is the output of nvidia-smi for Ray AIR, which shows that the GPU with ID 0 is used in multiple processes:

(base) ray@ip-172-31-73-133:~/oss-release-tests$ nvidia-smi
Tue Jul 19 02:13:09 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   42C    P0    26W /  70W |   5148MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   41C    P0    27W /  70W |   1047MiB / 15360MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   42C    P0    26W /  70W |   1031MiB / 15360MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   40C    P0    27W /  70W |   1071MiB / 15360MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9077      C                                    1303MiB |
|    0   N/A  N/A      9078      C                                    1281MiB |
|    0   N/A  N/A      9079      C                                    1281MiB |
|    0   N/A  N/A      9080      C                                    1281MiB |
|    1   N/A  N/A      9079      C                                    1077MiB |
|    2   N/A  N/A      9078      C                                    1043MiB |
|    3   N/A  N/A      9080      C                                    1071MiB |
+-----------------------------------------------------------------------------+

For comparison, when running with the vanilla training script, this is the GPU usage (only one process per GPU ID):

(base) ray@ip-172-31-73-133:~/oss-release-tests$ nvidia-smi
Tue Jul 19 02:15:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   42C    P0    35W /  70W |   1325MiB / 15360MiB |     40%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   42C    P0    36W /  70W |   1325MiB / 15360MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   43C    P0    36W /  70W |   1325MiB / 15360MiB |     45%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   41C    P0    36W /  70W |   1325MiB / 15360MiB |     46%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     18473      C                                    1323MiB |
|    1   N/A  N/A     18474      C                                    1323MiB |
|    2   N/A  N/A     18475      C                                    1323MiB |
|    3   N/A  N/A     18476      C                                    1323MiB |
+-----------------------------------------------------------------------------+

When restarting with Ray AIR and monitoring the GPU usage, this comes up:

Tue Jul 19 02:18:43 2022       
...
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     32150      C                                     311MiB |
|    0   N/A  N/A     32151      C                                     309MiB |
|    0   N/A  N/A     32152      C                                     311MiB |
|    0   N/A  N/A     32153      C                                     331MiB |
+-----------------------------------------------------------------------------+

a few seconds later:

Tue Jul 19 02:18:49 2022       
...
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     32150      C                                    1303MiB |
|    0   N/A  N/A     32151      C                                    1281MiB |
|    0   N/A  N/A     32152      C                                    1281MiB |
|    0   N/A  N/A     32153      C                                    1281MiB |
|    1   N/A  N/A     32151      C                                     769MiB |
|    2   N/A  N/A     32153      C                                     793MiB |
|    3   N/A  N/A     32152      C                                     751MiB |
+-----------------------------------------------------------------------------+

It thus seems to be a setup issue.

Versions / Dependencies

Latest master

Reproduction script

Run the air_benchmark_torch_mnist_gpu_4x4 release test (preferably manually on anyscale).

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

amogkam · 2022-07-19T17:19:38Z

@krfricke can you try adding a torch.cuda.set_device(ray.train.torch.get_device()) at the beginning of the training function?

matthewdeng · 2022-07-19T17:28:58Z

@JiahaoYao FYI seems related to the issue you saw with Horovod(?)

krfricke · 2022-07-19T17:41:01Z

@amogkam confirmed this removes the double memory allocation. On my first try the run hung forever though, on second run it worked. Waiting now for the full benchmark to pass to confirm.

Can we do this automatically in the torch trainer?

amogkam · 2022-07-19T17:47:15Z

Nice! yes we can do this automatically, will make a quick PR for this.

krfricke · 2022-07-19T18:13:34Z

With this change training takes forever now:

Before:

torch_mnist_ray_time_s_mean = 163.79045822200007
torch_mnist_vanilla_time_s_mean = 152.50713274733334

After:


[Run 1/3] Finished Ray training (120 epochs) in 1514.23 seconds. Observed loss = 1.7087
[Run 1/3] Finished vanilla training (120 epochs) in 389.51 seconds. Observed loss = 1.7622
[Run 1/3] Observed results:  {'torch_mnist_ray_time_s': 1514.2345556380003, 'torch_mnist_ray_loss': 1.7086539268493652, 'torch_mnist_vanilla_time_s': 389.5140381809997, 'torch_mnist_vanilla_loss': 1.7622368931770325}

GPU utilization is also at 100% during the whole training run, while it was at ~50% before and still is for vanilla training. I also believe that vanilla training is mostly slowed down due to left over utilization of the GPUs (but not sure about this).

Anyway, removing the line again speeds everything up again.

Does the setting of the device somehow interfere with the batch size setting?

krfricke · 2022-07-19T18:16:38Z

After removing the line (same cluster):

[Run 1/1] Finished Ray training (120 epochs) in 162.58 seconds. Observed loss = 1.0821
[Run 1/1] Finished vanilla training (120 epochs) in 151.25 seconds. Observed loss = 1.1949
[Run 1/1] Observed results:  {'torch_mnist_ray_time_s': 162.5793750959997, 'torch_mnist_ray_loss': 1.0820692956447602, 'torch_mnist_vanilla_time_s': 151.25329932600016, 'torch_mnist_vanilla_loss': 1.1948663949966432}

krfricke · 2022-07-19T18:23:04Z

Ah I believe it's because presumably the data loaders are packed onto the GPU.

JiahaoYao · 2022-07-19T18:47:34Z

my experience is logged here: Lightning-AI/pytorch-lightning#13665 and ray-project/ray_lightning#181 @krfricke @matthewdeng

richardliaw · 2022-07-19T19:33:26Z

Ah I believe it's because presumably the data loaders are packed onto the GPU.

What does this mean?

krfricke added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 19, 2022

krfricke assigned richardliaw and amogkam Jul 19, 2022

krfricke added the air label Jul 19, 2022

richardliaw added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 19, 2022

matthewdeng mentioned this issue Jul 21, 2022

[train] set auto_transfer cuda device #26819

Merged

6 tasks

amogkam closed this as completed in #26819 Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air/train] Multi GPU training occupies more memory memory on first GPU #26707

[air/train] Multi GPU training occupies more memory memory on first GPU #26707

krfricke commented Jul 19, 2022 •

edited

amogkam commented Jul 19, 2022

matthewdeng commented Jul 19, 2022

krfricke commented Jul 19, 2022

amogkam commented Jul 19, 2022

krfricke commented Jul 19, 2022

krfricke commented Jul 19, 2022 •

edited

krfricke commented Jul 19, 2022

JiahaoYao commented Jul 19, 2022

richardliaw commented Jul 19, 2022

[air/train] Multi GPU training occupies more memory memory on first GPU #26707

[air/train] Multi GPU training occupies more memory memory on first GPU #26707

Comments

krfricke commented Jul 19, 2022 • edited

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

amogkam commented Jul 19, 2022

matthewdeng commented Jul 19, 2022

krfricke commented Jul 19, 2022

amogkam commented Jul 19, 2022

krfricke commented Jul 19, 2022

krfricke commented Jul 19, 2022 • edited

krfricke commented Jul 19, 2022

JiahaoYao commented Jul 19, 2022

richardliaw commented Jul 19, 2022

krfricke commented Jul 19, 2022 •

edited

krfricke commented Jul 19, 2022 •

edited