Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air/train] Multi GPU training occupies more memory memory on first GPU #26707

Closed
krfricke opened this issue Jul 19, 2022 · 9 comments · Fixed by #26819
Closed

[air/train] Multi GPU training occupies more memory memory on first GPU #26707

krfricke opened this issue Jul 19, 2022 · 9 comments · Fixed by #26819
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order

Comments

@krfricke
Copy link
Contributor

krfricke commented Jul 19, 2022

What happened + What you expected to happen

This is a 4 node, 4 GPUs per node setup. I'm running a GPU benchmark script comparing training with Ray AIR and Vanilla PyTorch.

When training with Ray AIR, the GPU with ID 0 on the head node occupies more memory than expected. It seems that worker models are instantiated on the GPU. This seems to happen during setup.

This is the output of nvidia-smi for Ray AIR, which shows that the GPU with ID 0 is used in multiple processes:

(base) ray@ip-172-31-73-133:~/oss-release-tests$ nvidia-smi
Tue Jul 19 02:13:09 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   42C    P0    26W /  70W |   5148MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   41C    P0    27W /  70W |   1047MiB / 15360MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   42C    P0    26W /  70W |   1031MiB / 15360MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   40C    P0    27W /  70W |   1071MiB / 15360MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9077      C                                    1303MiB |
|    0   N/A  N/A      9078      C                                    1281MiB |
|    0   N/A  N/A      9079      C                                    1281MiB |
|    0   N/A  N/A      9080      C                                    1281MiB |
|    1   N/A  N/A      9079      C                                    1077MiB |
|    2   N/A  N/A      9078      C                                    1043MiB |
|    3   N/A  N/A      9080      C                                    1071MiB |
+-----------------------------------------------------------------------------+

For comparison, when running with the vanilla training script, this is the GPU usage (only one process per GPU ID):

(base) ray@ip-172-31-73-133:~/oss-release-tests$ nvidia-smi
Tue Jul 19 02:15:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   42C    P0    35W /  70W |   1325MiB / 15360MiB |     40%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   42C    P0    36W /  70W |   1325MiB / 15360MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   43C    P0    36W /  70W |   1325MiB / 15360MiB |     45%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   41C    P0    36W /  70W |   1325MiB / 15360MiB |     46%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     18473      C                                    1323MiB |
|    1   N/A  N/A     18474      C                                    1323MiB |
|    2   N/A  N/A     18475      C                                    1323MiB |
|    3   N/A  N/A     18476      C                                    1323MiB |
+-----------------------------------------------------------------------------+

When restarting with Ray AIR and monitoring the GPU usage, this comes up:

Tue Jul 19 02:18:43 2022       
...
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     32150      C                                     311MiB |
|    0   N/A  N/A     32151      C                                     309MiB |
|    0   N/A  N/A     32152      C                                     311MiB |
|    0   N/A  N/A     32153      C                                     331MiB |
+-----------------------------------------------------------------------------+

a few seconds later:

Tue Jul 19 02:18:49 2022       
...
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     32150      C                                    1303MiB |
|    0   N/A  N/A     32151      C                                    1281MiB |
|    0   N/A  N/A     32152      C                                    1281MiB |
|    0   N/A  N/A     32153      C                                    1281MiB |
|    1   N/A  N/A     32151      C                                     769MiB |
|    2   N/A  N/A     32153      C                                     793MiB |
|    3   N/A  N/A     32152      C                                     751MiB |
+-----------------------------------------------------------------------------+

It thus seems to be a setup issue.

Versions / Dependencies

Latest master

Reproduction script

Run the air_benchmark_torch_mnist_gpu_4x4 release test (preferably manually on anyscale).

Issue Severity

High: It blocks me from completing my task.

@krfricke krfricke added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 19, 2022
@krfricke krfricke added the air label Jul 19, 2022
@amogkam
Copy link
Contributor

amogkam commented Jul 19, 2022

@krfricke can you try adding a torch.cuda.set_device(ray.train.torch.get_device()) at the beginning of the training function?

@matthewdeng
Copy link
Contributor

@JiahaoYao FYI seems related to the issue you saw with Horovod(?)

@krfricke
Copy link
Contributor Author

@amogkam confirmed this removes the double memory allocation. On my first try the run hung forever though, on second run it worked. Waiting now for the full benchmark to pass to confirm.

Can we do this automatically in the torch trainer?

@amogkam
Copy link
Contributor

amogkam commented Jul 19, 2022

Nice! yes we can do this automatically, will make a quick PR for this.

@krfricke
Copy link
Contributor Author

With this change training takes forever now:

Before:

torch_mnist_ray_time_s_mean = 163.79045822200007
torch_mnist_vanilla_time_s_mean = 152.50713274733334

After:


[Run 1/3] Finished Ray training (120 epochs) in 1514.23 seconds. Observed loss = 1.7087
[Run 1/3] Finished vanilla training (120 epochs) in 389.51 seconds. Observed loss = 1.7622
[Run 1/3] Observed results:  {'torch_mnist_ray_time_s': 1514.2345556380003, 'torch_mnist_ray_loss': 1.7086539268493652, 'torch_mnist_vanilla_time_s': 389.5140381809997, 'torch_mnist_vanilla_loss': 1.7622368931770325}

GPU utilization is also at 100% during the whole training run, while it was at ~50% before and still is for vanilla training. I also believe that vanilla training is mostly slowed down due to left over utilization of the GPUs (but not sure about this).

Anyway, removing the line again speeds everything up again.

Does the setting of the device somehow interfere with the batch size setting?

@krfricke
Copy link
Contributor Author

krfricke commented Jul 19, 2022

After removing the line (same cluster):

[Run 1/1] Finished Ray training (120 epochs) in 162.58 seconds. Observed loss = 1.0821
[Run 1/1] Finished vanilla training (120 epochs) in 151.25 seconds. Observed loss = 1.1949
[Run 1/1] Observed results:  {'torch_mnist_ray_time_s': 162.5793750959997, 'torch_mnist_ray_loss': 1.0820692956447602, 'torch_mnist_vanilla_time_s': 151.25329932600016, 'torch_mnist_vanilla_loss': 1.1948663949966432}

@krfricke
Copy link
Contributor Author

Ah I believe it's because presumably the data loaders are packed onto the GPU.

@JiahaoYao
Copy link
Contributor

@richardliaw
Copy link
Contributor

Ah I believe it's because presumably the data loaders are packed onto the GPU.

What does this mean?

@richardliaw richardliaw added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants