Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize GPU memory consumption: Decrease heap usage at the beginning of the training and allow GPU to use 100% fragmentation. #44118

Open
Harrypotterrrr opened this issue Oct 17, 2020 · 2 comments
Assignees
Labels
comp:gpu GPU related issues type:feature Feature requests

Comments

@Harrypotterrrr
Copy link

Harrypotterrrr commented Oct 17, 2020

System information

  • Are you willing to contribute it (Yes/No): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux lz 5.4.0-48-generic Problem running RNN example #52~18.04.1-Ubuntu
  • TensorFlow installed from (source or binary): (in conda env) pip install tensorflow-gpu -i https://pypi.tuna.tsinghua.edu.cn/simple
  • TensorFlow version (use command below): v2.3.0-rc2-23-gb36436b087 2.3.0

Describe the feature and the current behavior/state.

I just use Tensorflow 2.3 to implement one paper, which has an official Pytorch version. Everything is fine except for the Batch_size of the training data, where the official pytorch could use batch_size of 32 with 8GPU, while I could only deploy 16 with the same GPUs and same settings of neural network. I tried to use tensorboard Profiler to check and optimize my training loop.

Here is the Profiler output of 10 steps of my model training.
image
image

You can see that the image points out the peak heap usage occurs when the GradientTape of the last layer of the network asked for it. However, after this allocation of the GPU memory, the peak memory usage goes down from 7.41 GiBs to around only 6 GiBs, so I am wondering why TF2 will allocate so much heap usage at the beginning of each training step and will not use this part during the loop, and the difference between the heap and memory usage. Is there any way that I could optimize the heap allocation of the heap so that I could fit my 32 batch_size to the graphic memory?

I also noticed that the memory capacity of what TF Profiler shows is 10.96 GiBs, which is only 90% of Fragmentation. My GPU memory has got 12196 MiB memory to use, which means that there are still some available space for the training. Some blogs' workarounds don't work, such as tf.config.experimental.set_memory_growth(gpu, True). Thus, I am looking for a way to permit TF to use this part of the graphic memory, say 100% fragmentation.

I tried to use the following code as the documentation says, but I still failed:

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

I am sure that if I solve the above two problems, I could increase my batch_size from 16 to 32, since a lot of GPU memory was wasted before. Appreciate your help sincerely.

I am using multi-gpu. So error callback is provided:

Traceback (most recent call last):
  File "main.py", line 29, in <module>
    main()
  File "main.py", line 25, in main
    trainer.train()
  File "/home/lz/potter/EDVR/trainers/train.py", line 238, in train
    loss, acc = self.train_epoch(epoch)
  File "/home/lz/potter/EDVR/trainers/train.py", line 191, in train_epoch
    loss, psnr = self.multi_train_step(batch_x, batch_y)
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 846, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1843, in _filtered_call
    return self._call_flat(
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1923, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 545, in call
    outputs = execute.execute(
  File "/home/lz/anaconda3/envs/tf2/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 3 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[4,256,360,640] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node StatefulPartitionedCall/conv2d_61/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[Identity_2/_190]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[4,256,360,640] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node StatefulPartitionedCall/conv2d_61/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (2) Resource exhausted:  OOM when allocating tensor with shape[4,256,360,640] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node StatefulPartitionedCall/conv2d_61/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[GroupCrossDeviceControlEdges_0/StatefulPartitionedCall/Adam/Adam/update_1_1/Const/_155]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_multi_train_step_36442]

Function call stack:
multi_train_step -> multi_train_step -> multi_train_step

Will this change the current api? How?
No

Who will benefit with this feature?
Everyone who wants to optimize the consumption of GPU memory.

@Harrypotterrrr Harrypotterrrr added the type:feature Feature requests label Oct 17, 2020
@ravikyram ravikyram added the comp:gpu GPU related issues label Oct 19, 2020
@ravikyram ravikyram assigned gowthamkpr and unassigned ravikyram Oct 19, 2020
@gowthamkpr gowthamkpr assigned sanjoy and unassigned gowthamkpr Oct 20, 2020
@gowthamkpr gowthamkpr added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Oct 20, 2020
@sanjoy
Copy link
Contributor

sanjoy commented Oct 22, 2020

I also noticed that the memory capacity of what TF Profiler shows is 10.96 GiBs, which is only 90% of Fragmentation.

We leave 6% of the GPU memory free for libraries like cuDNN and cuBLAS, that could explain part of it, but not all of it. @imintz any idea?

@Harrypotterrrr
Copy link
Author

@sanjoy Thanks for your reply! Here is a solution to use the rest part of GPU memory I found on the Internet:

gpus = tf.config.experimental.list_physical_devices('GPU')
    if gpus:
        # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
        try:
            for gpu in gpus:
                tf.config.experimental.set_virtual_device_configuration(
                    gpu,
                    [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=12195)])
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
        except RuntimeError as e:
            # Virtual devices must be set before GPUs have been initialized
            print(e)

I just set the memory_limit to 12195 MB directly and the TF will use the full portion of my GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

5 participants