Optimize GPU memory consumption: Decrease heap usage at the beginning of the training and allow GPU to use 100% fragmentation. #44118

Harrypotterrrr · 2020-10-17T17:59:44Z

System information

Are you willing to contribute it (Yes/No): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux lz 5.4.0-48-generic Problem running RNN example #52~18.04.1-Ubuntu
TensorFlow installed from (source or binary): (in conda env) pip install tensorflow-gpu -i https://pypi.tuna.tsinghua.edu.cn/simple
TensorFlow version (use command below): v2.3.0-rc2-23-gb36436b087 2.3.0

Describe the feature and the current behavior/state.

I just use Tensorflow 2.3 to implement one paper, which has an official Pytorch version. Everything is fine except for the Batch_size of the training data, where the official pytorch could use batch_size of 32 with 8GPU, while I could only deploy 16 with the same GPUs and same settings of neural network. I tried to use tensorboard Profiler to check and optimize my training loop.

Here is the Profiler output of 10 steps of my model training.

You can see that the image points out the peak heap usage occurs when the GradientTape of the last layer of the network asked for it. However, after this allocation of the GPU memory, the peak memory usage goes down from 7.41 GiBs to around only 6 GiBs, so I am wondering why TF2 will allocate so much heap usage at the beginning of each training step and will not use this part during the loop, and the difference between the heap and memory usage. Is there any way that I could optimize the heap allocation of the heap so that I could fit my 32 batch_size to the graphic memory?

I also noticed that the memory capacity of what TF Profiler shows is 10.96 GiBs, which is only 90% of Fragmentation. My GPU memory has got 12196 MiB memory to use, which means that there are still some available space for the training. Some blogs' workarounds don't work, such as tf.config.experimental.set_memory_growth(gpu, True). Thus, I am looking for a way to permit TF to use this part of the graphic memory, say 100% fragmentation.

I tried to use the following code as the documentation says, but I still failed:

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

I am sure that if I solve the above two problems, I could increase my batch_size from 16 to 32, since a lot of GPU memory was wasted before. Appreciate your help sincerely.

I am using multi-gpu. So error callback is provided:

Traceback (most recent call last):
  File "main.py", line 29, in <module>
    main()
  File "main.py", line 25, in main
    trainer.train()
  File "/home/lz/potter/EDVR/trainers/train.py", line 238, in train
    loss, acc = self.train_epoch(epoch)
  File "/home/lz/potter/EDVR/trainers/train.py", line 191, in train_epoch
    loss, psnr = self.multi_train_step(batch_x, batch_y)
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 846, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1843, in _filtered_call
    return self._call_flat(
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1923, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 545, in call
    outputs = execute.execute(
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 3 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[4,256,360,640] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node StatefulPartitionedCall/conv2d_61/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[Identity_2/_190]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[4,256,360,640] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node StatefulPartitionedCall/conv2d_61/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (2) Resource exhausted:  OOM when allocating tensor with shape[4,256,360,640] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node StatefulPartitionedCall/conv2d_61/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[GroupCrossDeviceControlEdges_0/StatefulPartitionedCall/Adam/Adam/update_1_1/Const/_155]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_multi_train_step_36442]

Function call stack:
multi_train_step -> multi_train_step -> multi_train_step

Will this change the current api? How?
No

Who will benefit with this feature?
Everyone who wants to optimize the consumption of GPU memory.

The text was updated successfully, but these errors were encountered:

sanjoy · 2020-10-22T07:01:39Z

I also noticed that the memory capacity of what TF Profiler shows is 10.96 GiBs, which is only 90% of Fragmentation.

We leave 6% of the GPU memory free for libraries like cuDNN and cuBLAS, that could explain part of it, but not all of it. @imintz any idea?

Harrypotterrrr · 2020-10-22T07:19:35Z

@sanjoy Thanks for your reply! Here is a solution to use the rest part of GPU memory I found on the Internet:

gpus = tf.config.experimental.list_physical_devices('GPU')
    if gpus:
        # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
        try:
            for gpu in gpus:
                tf.config.experimental.set_virtual_device_configuration(
                    gpu,
                    [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=12195)])
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
        except RuntimeError as e:
            # Virtual devices must be set before GPUs have been initialized
            print(e)

I just set the memory_limit to 12195 MB directly and the TF will use the full portion of my GPU.

Harrypotterrrr added the type:feature Feature requests label Oct 17, 2020

google-ml-butler bot assigned ravikyram Oct 17, 2020

ravikyram added the comp:gpu GPU related issues label Oct 19, 2020

ravikyram assigned gowthamkpr and unassigned ravikyram Oct 19, 2020

gowthamkpr assigned sanjoy and unassigned gowthamkpr Oct 20, 2020

gowthamkpr added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Oct 20, 2020

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Oct 24, 2020

Harrypotterrrr mentioned this issue Oct 28, 2020

size argument of TensorArray works only when specified by pythonic int but tf.Tensor doesn't #44073

Open

Saduf2019 mentioned this issue Nov 19, 2020

ResourceExhaustedError #45003

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize GPU memory consumption: Decrease heap usage at the beginning of the training and allow GPU to use 100% fragmentation. #44118

Optimize GPU memory consumption: Decrease heap usage at the beginning of the training and allow GPU to use 100% fragmentation. #44118

Harrypotterrrr commented Oct 17, 2020 •

edited

sanjoy commented Oct 22, 2020

Harrypotterrrr commented Oct 22, 2020

Optimize GPU memory consumption: Decrease heap usage at the beginning of the training and allow GPU to use 100% fragmentation. #44118

Optimize GPU memory consumption: Decrease heap usage at the beginning of the training and allow GPU to use 100% fragmentation. #44118

Comments

Harrypotterrrr commented Oct 17, 2020 • edited

sanjoy commented Oct 22, 2020

Harrypotterrrr commented Oct 22, 2020

Harrypotterrrr commented Oct 17, 2020 •

edited