Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

ray_horovod leaks gpu memory on the cuda:0 #181

Open
JiahaoYao opened this issue Jul 14, 2022 · 2 comments
Open

ray_horovod leaks gpu memory on the cuda:0 #181

JiahaoYao opened this issue Jul 14, 2022 · 2 comments

Comments

@JiahaoYao
Copy link
Contributor

The environment requirements:

(base) ray@ip-172-31-36-78:~/horovod-gpu/ray_lightning/ray_lightning/examples$ pip list | grep lightning
lightning-bolts                        0.4.0
pytorch-lightning                      1.5.4
ray-lightning                          0.2.0

The gpu environment is

Thu Jul 14 13:22:18 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   33C    P8    16W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   32C    P8    15W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   32C    P8    16W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P8    16W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The scripts runned is

(base) ray@ip-172-31-36-78:~/horovod-gpu/ray_lightning/ray_lightning/examples$ python ray_horovod_example.py 

The outputs are

(base) ray@ip-172-31-36-78:~/horovod-gpu/ray_lightning/ray_lightning/examples$ python ray_horovod_example.py 
/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:324: LightningDeprecationWarning: Passing <ray_lightning.ray_horovod.HorovodRayPlugin object at 0x7ff6df406670> `strategy` to the `plugins` flag in Trainer has been deprecated in v1.5 and will be removed in v1.7. Use `Trainer(strategy=<ray_lightning.ray_horovod.HorovodRayPlugin object at 0x7ff6df406670>)` instead.
  rank_zero_deprecation(
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1579: UserWarning: GPU available but not used. Set the gpus flag in your trainer `Trainer(gpus=1)` or script `--gpus=1`.
  rank_zero_warn(
/home/ray/anaconda3/lib/python3.8/site-packages/torchvision/datasets/mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
Validation sanity check: 0it [00:00, ?it/s]
(BaseHorovodWorker pid=44597) 
(BaseHorovodWorker pid=44597)   | Name     | Type     | Params
(BaseHorovodWorker pid=44597) --------------------------------------
(BaseHorovodWorker pid=44597) 0 | layer_1  | Linear   | 25.1 K
(BaseHorovodWorker pid=44597) 1 | layer_2  | Linear   | 2.1 K 
(BaseHorovodWorker pid=44597) 2 | layer_3  | Linear   | 650   
(BaseHorovodWorker pid=44597) 3 | accuracy | Accuracy | 0     
(BaseHorovodWorker pid=44597) --------------------------------------
(BaseHorovodWorker pid=44597) 27.9 K    Trainable params
(BaseHorovodWorker pid=44597) 0         Non-trainable params
(BaseHorovodWorker pid=44597) 27.9 K    Total params
(BaseHorovodWorker pid=44597) 0.112     Total estimated model params size (MB)
Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]
Validation sanity check:  50%|█████     | 1/2 [00:00<00:00,  7.22it/s]
Epoch 0:   0%|          | 0/468 [00:00<?, ?it/s]                      
(BaseHorovodWorker pid=44598) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 0:   1%|| 7/468 [00:00<00:09, 50.70it/s, loss=-0.0547, v_num=2]
(BaseHorovodWorker pid=44597) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
(BaseHorovodWorker pid=44599) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
(BaseHorovodWorker pid=44600) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 0:   2%|| 8/468 [00:00<00:08, 54.65it/s, loss=-0.0674, v_num=2]
Epoch 0:   4%|| 21/468 [00:00<00:05, 83.87it/s, loss=-0.0906, v_num=2]
Epoch 0:   5%|| 22/468 [00:00<00:05, 84.74it/s, loss=-0.0906, v_num=2]
Epoch 0:   5%|| 23/468 [00:00<00:05, 85.87it/s, loss=-0.0969, v_num=2]
Epoch 0:   5%|| 24/468 [00:00<00:05, 87.22it/s, loss=-0.0938, v_num=2]
Epoch 0:   5%|| 25/468 [00:00<00:04, 88.76it/s, loss=-0.0969, v_num=2]
Epoch 0:   6%|| 26/468 [00:00<00:04, 90.25it/s, loss=-0.102, v_num=2] 
Epoch 0:   6%|| 27/468 [00:00<00:04, 91.67it/s, loss=-0.108, v_num=2]
Epoch 0:   6%|| 28/468 [00:00<00:04, 93.08it/s, loss=-0.1, v_num=2]  
Epoch 0:   9%|| 43/468 [00:00<00:04, 105.73it/s, loss=-0.1, v_num=2]   
Epoch 0:   9%|| 44/468 [00:00<00:04, 105.95it/s, loss=-0.105, v_num=2]
Epoch 0:  12%|█▏        | 57/468 [00:00<00:03, 108.21it/s, loss=-0.0891, v_num=2]
Epoch 0:  15%|█▌        | 72/468 [00:00<00:03, 112.66it/s, loss=-0.0875, v_num=2]
Epoch 0:  18%|█▊        | 85/468 [00:00<00:03, 113.64it/s, loss=-0.0891, v_num=2]
Epoch 0:  21%|██▏       | 100/468 [00:00<00:03, 116.04it/s, loss=-0.0797, v_num=2]
Epoch 0:  21%|██▏       | 100/468 [00:00<00:03, 115.99it/s, loss=-0.0797, v_num=2]
Epoch 0:  25%|██▍       | 115/468 [00:00<00:02, 118.38it/s, loss=-0.0828, v_num=2]
Epoch 0:  25%|██▍       | 116/468 [00:00<00:02, 118.56it/s, loss=-0.0781, v_num=2]
Epoch 0:  25%|██▌       | 117/468 [00:00<00:02, 118.71it/s, loss=-0.0812, v_num=2]
Epoch 0:  28%|██▊       | 131/468 [00:01<00:02, 119.51it/s, loss=-0.0922, v_num=2]
Epoch 0:  31%|███       | 146/468 [00:01<00:02, 121.10it/s, loss=-0.0875, v_num=2]
Epoch 0:  31%|███▏      | 147/468 [00:01<00:02, 121.30it/s, loss=-0.0859, v_num=2]
Epoch 0:  34%|███▍      | 160/468 [00:01<00:02, 121.05it/s, loss=-0.0953, v_num=2]
Epoch 0:  34%|███▍      | 161/468 [00:01<00:02, 121.04it/s, loss=-0.0953, v_num=2]
Epoch 0:  37%|███▋      | 175/468 [00:01<00:02, 121.85it/s, loss=-0.0906, v_num=2]
Epoch 0:  38%|███▊      | 176/468 [00:01<00:02, 121.98it/s, loss=-0.0922, v_num=2]
Epoch 0:  38%|███▊      | 177/468 [00:01<00:02, 122.05it/s, loss=-0.0922, v_num=2]
Epoch 0:  38%|███▊      | 178/468 [00:01<00:02, 122.18it/s, loss=-0.1, v_num=2]   
Epoch 0:  41%|████      | 193/468 [00:01<00:02, 123.45it/s, loss=-0.0969, v_num=2]
Epoch 0:  41%|████▏     | 194/468 [00:01<00:02, 123.59it/s, loss=-0.0938, v_num=2]
Epoch 0:  42%|████▏     | 195/468 [00:01<00:02, 123.71it/s, loss=-0.0938, v_num=2]
Epoch 0:  42%|████▏     | 196/468 [00:01<00:02, 123.84it/s, loss=-0.0922, v_num=2]
Epoch 0:  42%|████▏     | 197/468 [00:01<00:02, 123.97it/s, loss=-0.0906, v_num=2]
Epoch 0:  42%|████▏     | 198/468 [00:01<00:02, 124.09it/s, loss=-0.0797, v_num=2]
Epoch 0:  46%|████▌     | 213/468 [00:01<00:02, 125.02it/s, loss=-0.0719, v_num=2]
Epoch 0:  46%|████▌     | 214/468 [00:01<00:02, 125.05it/s, loss=-0.0797, v_num=2]
Epoch 0:  49%|████▊     | 227/468 [00:01<00:01, 124.98it/s, loss=-0.1, v_num=2]   
Epoch 0:  49%|████▊     | 228/468 [00:01<00:01, 124.93it/s, loss=-0.103, v_num=2]
Epoch 0:  49%|████▉     | 229/468 [00:01<00:01, 124.89it/s, loss=-0.108, v_num=2]
Epoch 0:  52%|█████▏    | 243/468 [00:01<00:01, 125.41it/s, loss=-0.103, v_num=2]
Epoch 0:  52%|█████▏    | 244/468 [00:01<00:01, 125.44it/s, loss=-0.102, v_num=2]
Epoch 0:  52%|█████▏    | 245/468 [00:01<00:01, 125.48it/s, loss=-0.108, v_num=2]
Epoch 0:  53%|█████▎    | 246/468 [00:01<00:01, 125.57it/s, loss=-0.0984, v_num=2]
Epoch 0:  53%|█████▎    | 247/468 [00:01<00:01, 125.61it/s, loss=-0.0984, v_num=2]
Epoch 0:  56%|█████▌    | 263/468 [00:02<00:01, 126.88it/s, loss=-0.0906, v_num=2]
Epoch 0:  56%|█████▋    | 264/468 [00:02<00:01, 126.96it/s, loss=-0.0938, v_num=2]
Epoch 0:  57%|█████▋    | 265/468 [00:02<00:01, 127.06it/s, loss=-0.0922, v_num=2]
Epoch 0:  57%|█████▋    | 266/468 [00:02<00:01, 127.15it/s, loss=-0.0969, v_num=2]
Epoch 0:  57%|█████▋    | 267/468 [00:02<00:01, 127.24it/s, loss=-0.102, v_num=2] 
Epoch 0:  57%|█████▋    | 268/468 [00:02<00:01, 127.32it/s, loss=-0.0984, v_num=2]
Epoch 0:  57%|█████▋    | 269/468 [00:02<00:01, 127.41it/s, loss=-0.0984, v_num=2]
Epoch 0:  58%|█████▊    | 270/468 [00:02<00:01, 127.50it/s, loss=-0.102, v_num=2] 
Epoch 0:  58%|█████▊    | 271/468 [00:02<00:01, 127.53it/s, loss=-0.1, v_num=2]  
Epoch 0:  61%|██████    | 286/468 [00:02<00:01, 127.98it/s, loss=-0.0953, v_num=2]
Epoch 0:  61%|██████▏   | 287/468 [00:02<00:01, 128.00it/s, loss=-0.0938, v_num=2]
Epoch 0:  65%|██████▍   | 302/468 [00:02<00:01, 128.34it/s, loss=-0.0891, v_num=2]
Epoch 0:  68%|██████▊   | 317/468 [00:02<00:01, 128.62it/s, loss=-0.0859, v_num=2]
Epoch 0:  68%|██████▊   | 318/468 [00:02<00:01, 128.63it/s, loss=-0.0812, v_num=2]
Epoch 0:  68%|██████▊   | 319/468 [00:02<00:01, 128.64it/s, loss=-0.0828, v_num=2]
Epoch 0:  68%|██████▊   | 320/468 [00:02<00:01, 128.65it/s, loss=-0.0828, v_num=2]
Epoch 0:  72%|███████▏  | 335/468 [00:02<00:01, 128.96it/s, loss=-0.0844, v_num=2]
Epoch 0:  72%|███████▏  | 336/468 [00:02<00:01, 128.98it/s, loss=-0.0828, v_num=2]
Epoch 0:  75%|███████▌  | 351/468 [00:02<00:00, 129.49it/s, loss=-0.0875, v_num=2]
Epoch 0:  75%|███████▌  | 352/468 [00:02<00:00, 129.50it/s, loss=-0.0891, v_num=2]
Epoch 0:  75%|███████▌  | 353/468 [00:02<00:00, 129.52it/s, loss=-0.0906, v_num=2]
Epoch 0:  76%|███████▌  | 354/468 [00:02<00:00, 129.53it/s, loss=-0.0906, v_num=2]
Epoch 0:  76%|███████▌  | 355/468 [00:02<00:00, 129.55it/s, loss=-0.0906, v_num=2]
Epoch 0:  79%|███████▉  | 370/468 [00:02<00:00, 129.71it/s, loss=-0.0828, v_num=2]
Epoch 0:  79%|███████▉  | 371/468 [00:02<00:00, 129.72it/s, loss=-0.0859, v_num=2]
Epoch 0:  79%|███████▉  | 372/468 [00:02<00:00, 129.69it/s, loss=-0.0875, v_num=2]
Epoch 0:  80%|███████▉  | 373/468 [00:02<00:00, 129.65it/s, loss=-0.0828, v_num=2]
Epoch 0:  80%|███████▉  | 374/468 [00:02<00:00, 129.63it/s, loss=-0.0891, v_num=2]
Epoch 0:  80%|████████  | 375/468 [00:02<00:00, 129.63it/s, loss=-0.0922, v_num=2]
Epoch 0:  80%|████████  | 376/468 [00:02<00:00, 129.64it/s, loss=-0.0906, v_num=2]
Epoch 0:  81%|████████  | 377/468 [00:02<00:00, 129.62it/s, loss=-0.0922, v_num=2]
Epoch 0:  81%|████████  | 378/468 [00:02<00:00, 129.58it/s, loss=-0.0969, v_num=2]
Epoch 0:  81%|████████  | 379/468 [00:02<00:00, 129.59it/s, loss=-0.1, v_num=2]   
Epoch 0:  81%|████████  | 380/468 [00:02<00:00, 129.60it/s, loss=-0.106, v_num=2]
Epoch 0:  81%|████████▏ | 381/468 [00:02<00:00, 129.62it/s, loss=-0.109, v_num=2]
Epoch 0:  82%|████████▏ | 382/468 [00:02<00:00, 129.59it/s, loss=-0.105, v_num=2]
Epoch 0:  82%|████████▏ | 383/468 [00:02<00:00, 129.60it/s, loss=-0.102, v_num=2]
Epoch 0:  82%|████████▏ | 384/468 [00:02<00:00, 129.62it/s, loss=-0.106, v_num=2]
Epoch 0:  82%|████████▏ | 385/468 [00:02<00:00, 129.63it/s, loss=-0.102, v_num=2]
Epoch 0:  82%|████████▏ | 386/468 [00:02<00:00, 129.64it/s, loss=-0.0922, v_num=2]
Epoch 0:  83%|████████▎ | 387/468 [00:02<00:00, 129.62it/s, loss=-0.0891, v_num=2]
Epoch 0:  83%|████████▎ | 388/468 [00:02<00:00, 129.62it/s, loss=-0.0875, v_num=2]
Epoch 0:  83%|████████▎ | 389/468 [00:03<00:00, 129.63it/s, loss=-0.0875, v_num=2]
Epoch 0:  86%|████████▋ | 404/468 [00:03<00:00, 130.00it/s, loss=-0.0828, v_num=2]
Epoch 0:  87%|████████▋ | 405/468 [00:03<00:00, 129.98it/s, loss=-0.0891, v_num=2]
Epoch 0:  90%|████████▉ | 420/468 [00:03<00:00, 130.24it/s, loss=-0.0953, v_num=2]
Epoch 0:  90%|████████▉ | 421/468 [00:03<00:00, 130.29it/s, loss=-0.0969, v_num=2]
Epoch 0:  90%|█████████ | 422/468 [00:03<00:00, 130.35it/s, loss=-0.102, v_num=2] 
Epoch 0:  90%|█████████ | 423/468 [00:03<00:00, 130.36it/s, loss=-0.105, v_num=2]
Epoch 0:  92%|█████████▏| 429/468 [00:03<00:00, 128.30it/s, loss=-0.1, v_num=2]   
(BaseHorovodWorker pid=44598) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 0:  94%|█████████▍| 440/468 [00:03<00:00, 128.18it/s, loss=-0.1, v_num=2]
(BaseHorovodWorker pid=44597) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
(BaseHorovodWorker pid=44599) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
(BaseHorovodWorker pid=44600) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 0: 100%|█████████▉| 466/468 [00:03<00:00, 131.91it/s, loss=-0.1, v_num=2]
Epoch 0: 100%|██████████| 468/468 [00:03<00:00, 128.10it/s, loss=-0.1, v_num=2]

the cuda environment is

Thu Jul 14 13:23:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   36C    P0    28W /  70W |   5714MiB / 15360MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   35C    P0    27W /  70W |   1445MiB / 15360MiB |     11%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   35C    P0    28W /  70W |   1445MiB / 15360MiB |     11%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P0    27W /  70W |   1445MiB / 15360MiB |     11%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       976      C                                    1443MiB |
|    0   N/A  N/A       978      C                                    1423MiB |
|    0   N/A  N/A       981      C                                    1423MiB |
|    0   N/A  N/A       982      C                                    1423MiB |
|    1   N/A  N/A       978      C                                    1443MiB |
|    2   N/A  N/A       981      C                                    1443MiB |
|    3   N/A  N/A       982      C                                    1443MiB |
+-----------------------------------------------------------------------------+
@JiahaoYao
Copy link
Contributor Author

this was for the old version.

@JiahaoYao
Copy link
Contributor Author

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant