TensorRT: Large memory consumption on SSD-like graphs [Feature Request/Discussion] #19619

yegord · 2018-05-29T12:18:02Z

Hi,

tf.contrib.tensorrt currently automatically segments a given graph into subgraphs which can be fused into TensorRT nodes. This approach works well with networks having linear topology, e.g., CNN classifiers like VGG or ResNet-N: a single node is created, taking an image batch as input and producing a single tensor with predictions.

The situation with SSD-like networks is less favorable. Such a network has a topology of the form a->b->c->d->e->f, a->a_1, a->a_2, b->b_1, b->b_2, ..., f->f_1, f->f2, where a->b->c->d->e->f is a feature extractor and a_1, a_2, b_1, b_2, ..., f_1, f_2 are branches stemming from the feature extractor and predicting, e.g., classes and exact locations of the objects in the predefined anchor boxes. The tf.contrib.tensorrt's segmentation algorithm on such a graph selects a subgraph consisting of all the feature extractor's nodes, plus maybe parts of the branches (e.g., convolutions computing the logits, but not the argmaxes computing the class ids; TensorRT as of version 3 does not support argmax). As a result, we get a huge operation with lots of outputs (e.g., all logits and all raw location amendments, before argmaxes, reshapes, or NCHW->NHWC transpositions), which all must simultaneously fit in GPU memory at the moment of TensorRT's op's completion.

When the same graph is computed on TensorFlow, this high peak in memory usage can be (and seems to be) avoided: TensorFlow can compute the feature extractor up to a next level, compute the branches, copy the results computed in the branches into host RAM, go to the next level, and so on.

This means, tf.contrib.tensorrt-optimized graph can use (and seems to actually use in my experiments) significantly more GPU memory than the original graph, which can lead (and seems to lead in my experiments) to out-of-memory errors.

One workaround that I tried was to add to tf.contrib.tensorrt.create_inference_graph a parameter for specifying a list of subgraphs which should be independently segmented into subsubgraphs for fusion into TensorRT nodes. I passed individual levels of the feature extractor plus the branches at this level as such subgraphs. This reduced the memory use by TensorRT by a factor of around two and fixed the out-of-memory errors in my case.

Should such a parameter then maybe added to the mainline tf.contrib.tensorrt?
Maybe you have a better idea of avoiding a high GPU memory consumption in SSD-like graphs?
Comments, ideas are welcome.

I guess, I should invite @drpngx, @samikama, @jjsjann123 to the discussion.

In case it matters, my experience comes from the experiments with TensorFlow 1.8, TensorRT-3.0.4 running on Ubuntu 16.04 (AMD64) with GTX 1080 Ti.

Thanks!

The text was updated successfully, but these errors were encountered:

drpngx · 2018-05-29T16:02:35Z

/CC @tfboyd @zheng-xq

samikama · 2018-05-29T19:19:40Z

@yegord, Thanks for the investigation. Would it be possible to share the graph so that we can investigate and improve our segmenter? Most common OOM issue is due to reservation for TensorRT which will be reduced with TensorRT4.0 GA since we start to share the TensorFlow's allocator and only workspace parameter needs to be tuned now. We are also restructuring the TFTRT workflow which should improve things. In upcoming updates we will add an option to specify placement of ops in to the TRT segments and users will be able to selectively keep certain nodes outside segments.

Thanks,
Sami

yegord · 2018-05-30T16:13:35Z

Please find a minimal example here: https://yadi.sk/d/797LMYGT3Wi2yy
./minimal_example.py --mode tf gives me 5487MiB / 11172MiB GPU memory usage on 1080 Ti, according to nvidia-smi.
./minimal_example.py --mode trt gives me 9987MiB / 11172MiB, which is 80% larger than with plain TensorFlow, is over the max_workspace_size_bytes (which is around 6G, not sure if all of that is used by TensorRT, though) and is close to the physical memory limit.

Specifying placements of ops into TRT segments sounds like what I proposed here.
Sharing of the allocator also sounds great.
I guess, I should wait for the TensorRT 4 based release and repeat the experiment.
Currently, almost twofold memory usage is a blocker for the use of tf.contrib.tensorrt for me.

tensorflowbutler · 2018-06-14T18:52:25Z

Nagging Assignees @samikama, @drpngx: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

tensorflowbutler · 2018-06-29T18:58:06Z

Nagging Assignee @samikama: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

samikama · 2018-07-13T02:56:47Z

@yegord Could you please try with TRT4.0? You should not need to set gpu allocation fraction when using TRT4.0 and workspace will directly be allocated from TF memory. It should improve the things for you

yegord · 2018-07-13T22:30:23Z

@samikama Thanks for the update! Which TensorFlow version should I test TRT 4 with? Master (98b9a4e) crashes with

$ ./minimal_example.py --mode trt
2018-07-14 01:17:30.655903: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1
Traceback (most recent call last):
  File "./minimal_example.py", line 81, in <module>
    main()
  File "./minimal_example.py", line 44, in main
    minimum_segment_size=3
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tensorrt/python/trt_convert.py", line 152, in create_inference_graph
    int(msg[0]))
tensorflow.python.framework.errors_impl.NotFoundError: No attr named 'shape' in NodeDef:
         [[Node: placeholders/image_0 = Placeholder[dtype=DT_UINT8]()]] for 'placeholders/image_0' (op: 'Placeholder') with input shapes: .

(This is the minimal example mentioned in #19619 (comment).)

samikama · 2018-07-14T03:52:46Z

@yegord, it looks like an issue with grapplers layout optimizer

2018-07-13 20:50:26.902178: I tensorflow/core/grappler/optimizers/layout_optimizer.cc:2187] Infer shape return status: Not found: No attr named 'shape' in NodeDef:
	 [[Node: placeholders/image_0 = Placeholder[dtype=DT_UINT8]()]] for 'placeholders/image_0' (op: 'Placeholder') with input shapes: .
2018-07-13 20:50:26.938333: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:318] layout: Not found: No attr named 'shape' in NodeDef:
	 [[Node: placeholders/image_0 = Placeholder[dtype=DT_UINT8]()]] for 'placeholders/image_0' (op: 'Placeholder') with input shapes: .
2018-07-13 20:50:27.010038: I tensorflow/core/grappler/costs/graph_properties.cc:1250] Propagating 2002 new shapes through 0 loops and 0 resources

We are returning that error

samikama · 2018-07-14T04:25:18Z

@yegord, please wait for #20794 and try

#!/usr/bin/env python

import argparse
import subprocess
import tensorflow as tf
import numpy as np
import six

from tensorflow.core.protobuf import config_pb2 as cpb2
from tensorflow.core.protobuf import rewriter_config_pb2 as rwpb2

from tensorflow.contrib import tensorrt as trt

NUM_IMAGES = 3

INPUT_TENSORS = [u'placeholders/image_0', u'placeholders/image_1', u'placeholders/image_2', u'placeholders/num_images', u'placeholders/blend_coeff']

OUTPUT_TENSORS = [u'detections/center_x', u'detections/center_y', u'detections/width', u'detections/height', u'detections/width_3d', u'detections/height_3d', u'detections/depth_3d', u'detections/class_id', u'detections/probability', u'detections/yaw', u'detections/properties/tl_rotation/class_id', u'detections/properties/tl_type/class_id', u'detections/properties/tl_road_tl_state/class_id', u'detections/properties/tl_bicycle_tl_state/class_id', u'detections/properties/tl_pedestrian_tl_state/class_id', u'detections/properties/tl_other_tl_state/class_id', u'detections/properties/tl_left_section/class_id', u'detections/properties/tl_left_section_state/class_id', u'detections/properties/tl_right_section/class_id', u'detections/properties/tl_right_section_state/class_id', u'segmentations/sdc1/0', u'segmentations/sdc1/1', u'segmentations/sdc1/2', u'visualizations/sdc1/0', u'visualizations/sdc1/1', u'visualizations/sdc1/2']


def main():
    parser = argparse.ArgumentParser()

    parser.add_argument(
        '--mode',
        choices=['tf', 'trt'],
        required=True,
        help='Evaluator to use.')

    args = parser.parse_args()

    graph_def = tf.GraphDef()
    with tf.gfile.GFile('ssd-tensorflow.pb', 'rb') as f:
        graph_def.ParseFromString(f.read())

    with tf.Graph().as_default():
        tf.import_graph_def(graph_def, name='')
        run_graph(mode=args.mode)

def run_graph(ntimes=100, mode='trt'):
    image = np.zeros((1024, 768, 3), dtype=np.uint8)
    feed_dict = {
        tensor_name: image
        for tensor_name in INPUT_TENSORS[:NUM_IMAGES]
    }
    feed_dict[u'placeholders/num_images'] = NUM_IMAGES
    feed_dict[u'placeholders/blend_coeff'] = 0.5

    feed_dict = {
        tf.get_default_graph().get_operation_by_name(key).outputs[0]: value
        for key, value in six.iteritems(feed_dict)
    }

    output_tensors = [
        tf.get_default_graph().get_operation_by_name(tensor_name).outputs[0]
        for tensor_name in OUTPUT_TENSORS
    ]
    opt_config = rwpb2.RewriterConfig()
    opt_config.meta_optimizer_iterations = opt_config.ONE
    opt_config.optimizers.extend(["constfold", "layout"])
    custom_op = opt_config.custom_optimizers.add()
    custom_op.name = "TensorRTOptimizer"
    custom_op.parameter_map["minimum_segment_size"].i = 10
    custom_op.parameter_map["precision_mode"].s = "FP32"
    custom_op.parameter_map["max_batch_size"].i =NUM_IMAGES
    custom_op.parameter_map["is_dynamic_op"].b = True
    custom_op.parameter_map["max_workspace_size_bytes"].i = 1 << 25
    
    graph_options = cpb2.GraphOptions(rewrite_options=opt_config)
    if mode == 'trt':
        config = tf.ConfigProto(graph_options=graph_options)
    else:
        config = tf.ConfigProto()

    with tf.Session(config=config) as session:
        for i in six.moves.xrange(ntimes):
            session.run(output_tensors, feed_dict)
        subprocess.check_call(['nvidia-smi'], close_fds=True)


if __name__ == '__main__':
    main()

lqw187927 · 2018-07-14T06:34:17Z

do not send this email for me. At 2018-07-14 12:30:52, "Sami Kama" <notifications@github.com> wrote: @yegord, please wait for #20794 and try #!/usr/bin/env pythonimport argparse import subprocess import tensorflow as tf import numpy as np import six from tensorflow.core.protobuf import config_pb2 as cpb2 from tensorflow.core.protobuf import rewriter_config_pb2 as rwpb2 from tensorflow.contrib import tensorrt as trt NUM_IMAGES=3INPUT_TENSORS= [u'placeholders/image_0', u'placeholders/image_1', u'placeholders/image_2', u'placeholders/num_images', u'placeholders/blend_coeff'] OUTPUT_TENSORS= [u'detections/center_x', u'detections/center_y', u'detections/width', u'detections/height', u'detections/width_3d', u'detections/height_3d', u'detections/depth_3d', u'detections/class_id', u'detections/probability', u'detections/yaw', u'detections/properties/tl_rotation/class_id', u'detections/properties/tl_type/class_id', u'detections/properties/tl_road_tl_state/class_id', u'detections/properties/tl_bicycle_tl_state/class_id', u'detections/properties/tl_pedestrian_tl_state/class_id', u'detections/properties/tl_other_tl_state/class_id', u'detections/properties/tl_left_section/class_id', u'detections/properties/tl_left_section_state/class_id', u'detections/properties/tl_right_section/class_id', u'detections/properties/tl_right_section_state/class_id', u'segmentations/sdc1/0', u'segmentations/sdc1/1', u'segmentations/sdc1/2', u'visualizations/sdc1/0', u'visualizations/sdc1/1', u'visualizations/sdc1/2'] defmain(): parser = argparse.ArgumentParser() parser.add_argument( '--mode', choices=['tf', 'trt'], required=True, help='Evaluator to use.') args = parser.parse_args() graph_def = tf.GraphDef() with tf.gfile.GFile('ssd-tensorflow.pb', 'rb') as f: graph_def.ParseFromString(f.read()) with tf.Graph().as_default(): tf.import_graph_def(graph_def, name='') run_graph(mode=args.mode) defrun_graph(ntimes=100, mode='trt'): image = np.zeros((1024, 768, 3), dtype=np.uint8) feed_dict = { tensor_name: image for tensor_name inINPUT_TENSORS[:NUM_IMAGES] } feed_dict[u'placeholders/num_images'] =NUM_IMAGES feed_dict[u'placeholders/blend_coeff'] =0.5 feed_dict = { tf.get_default_graph().get_operation_by_name(key).outputs[0]: value for key, value in six.iteritems(feed_dict) } output_tensors = [ tf.get_default_graph().get_operation_by_name(tensor_name).outputs[0] for tensor_name inOUTPUT_TENSORS ] opt_config = rwpb2.RewriterConfig() opt_config.meta_optimizer_iterations = opt_config.ONE opt_config.optimizers.extend(["constfold", "layout"]) custom_op = opt_config.custom_optimizers.add() custom_op.name ="TensorRTOptimizer" custom_op.parameter_map["minimum_segment_size"].i =10 custom_op.parameter_map["precision_mode"].s ="FP32" custom_op.parameter_map["max_batch_size"].i =NUM_IMAGES custom_op.parameter_map["is_dynamic_op"].b =True custom_op.parameter_map["max_workspace_size_bytes"].i =1<<25 graph_options = cpb2.GraphOptions(rewrite_options=opt_config) if mode =='trt': config = tf.ConfigProto(graph_options=graph_options) else: config = tf.ConfigProto() with tf.Session(config=config) as session: for i in six.moves.xrange(ntimes): session.run(output_tensors, feed_dict) subprocess.check_call(['nvidia-smi'], close_fds=True) if__name__=='__main__': main() — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

yegord · 2018-07-17T16:59:00Z

As of master (d51aff2):
The vanilla graph uses 5499MiB.
TRT-optimized graph uses 1753MiB.

All numbers are according to nvidia-smi, the test script is the one from #19619 (comment) with config.gpu_options.allow_growth = True added.
The result is definitely very positive, assuming there is no error in the measurements and given the optimized graph computes the same things as the original one (I did not check this yet).

Thanks a lot for your help!
Feel free to close this task if you do not need it anymore.
I will come back if I have any further problems on my way.

yegord · 2018-07-25T15:49:48Z

Is the old way of creating an optimized graph with

    graph_def = trt.create_inference_graph(
        input_graph_def=graph_def,
        outputs=output_node_names,
        max_batch_size=num_cameras,
        max_workspace_size_bytes=1 << 25, 
        precision_mode=precision_mode,
        minimum_segment_size=10,  # minimum number of nodes in an engine,
    )

still supposed to work?

I still get

Traceback (most recent call last):
  File "ros/src/perception/detection/tools/deploy.py", line 294, in <module>
    main()
  File "ros/src/perception/detection/tools/deploy.py", line 40, in main
    graph_def = _make_inference_graph(model, args.num_cameras)
  File "ros/src/perception/detection/tools/deploy.py", line 270, in _make_inference_graph
    minimum_segment_size=10,  # minimum number of nodes in an engine,
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tensorrt/python/trt_convert.py", line 153, in create_inference_graph
    int(msg[0]))
tensorflow.python.framework.errors_impl.NotFoundError: No attr named 'shape' in NodeDef:
         [[Node: placeholders/image_0 = Placeholder[dtype=DT_UINT8]()]] for 'placeholders/image_0' (op: 'Placeholder') with input shapes: .

when I try to call create_inference_graph, even after the fix #20794, on master (2791d58c2bbbce833a3de73503816d881a06cfe3).

Do I need to run a couple of optimization passes before I call create_inference_graph (like here: #19619 (comment))? If yes, how do I do it outside session.run()? Maybe, there is a public manual on grappler and graph rewritings that you could recommend to read?

Thank you.

yegord · 2018-08-09T10:38:51Z

@samikama Ping.

In general, can anybody here share the roadmap for TensorRT support in TensorFlow: on what time horizon is tf.contrib.tensorrt going to get a documented, relatively stable, and well-tested interface (like most other TensorFlow contrib modules have)?

aaroey · 2018-08-09T15:55:24Z

@pooyadavoodi, would you please help to take a look at this issue?

aaroey · 2018-08-11T07:20:35Z

Hi @yegord,

It looks like the problem you encountered about No attr named 'shape' in NodeDef is fixed. I patched the repro in aaroey@f2bdf76 and the script works well when building from master. Would you please help to double check?

Regarding the roadmap, we're working on making the integration more stable and adding more test coverages. We planned to ship TF r1.11 with trt 4.0, which we believe will be a relatively stable version to try. But please always let us know if you encounter any problems, and that could help us to improve and make the integration better.

Thanks.

yegord · 2018-08-11T12:55:32Z

I patched the repro in aaroey/tensorflow@f2bdf76 and the script works well when building from master.

The script at the commit that you refer to did work on my side. What did not work was calling create_inference_graph() (as in the original minimal example script).

Would you please help to double check?

The original minimal example script seems to work again with 3b061fc. Apparently, something was fixed between 2791d58c2bbbce833a3de73503816d881a06cfe3 and 3b061fc that made it work. I guess, I will make another attempt to integrate TensorRT in the near future then.

Thank you for the information!

aaroey · 2018-08-21T21:36:00Z

I'm closing this issue since the mentioned problem (large memory consumption) is fixed. Please feel free to let me know if there are any other questions.

tensorflowbutler assigned poxvoculi May 30, 2018

poxvoculi assigned drpngx and unassigned poxvoculi May 30, 2018

samikama self-assigned this May 30, 2018

drpngx removed their assignment Jun 15, 2018

aaroey self-assigned this Jul 14, 2018

aaroey added a commit to aaroey/tensorflow that referenced this issue Aug 11, 2018

add repro to tensorflow#19619

f2bdf76

aaroey closed this as completed Aug 21, 2018

aaroey mentioned this issue Sep 19, 2018

Fix trt allocator, convert nodes logic, etc #22371

Merged

gowthamkpr mentioned this issue Apr 3, 2020

dimension lost by using tf_trt #37662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT: Large memory consumption on SSD-like graphs [Feature Request/Discussion] #19619

TensorRT: Large memory consumption on SSD-like graphs [Feature Request/Discussion] #19619

yegord commented May 29, 2018 •

edited

drpngx commented May 29, 2018

samikama commented May 29, 2018

yegord commented May 30, 2018

tensorflowbutler commented Jun 14, 2018

tensorflowbutler commented Jun 29, 2018

samikama commented Jul 13, 2018

yegord commented Jul 13, 2018

samikama commented Jul 14, 2018

samikama commented Jul 14, 2018

lqw187927 commented Jul 14, 2018 via email

yegord commented Jul 17, 2018

yegord commented Jul 25, 2018

yegord commented Aug 9, 2018

aaroey commented Aug 9, 2018

aaroey commented Aug 11, 2018

yegord commented Aug 11, 2018

aaroey commented Aug 21, 2018

TensorRT: Large memory consumption on SSD-like graphs [Feature Request/Discussion] #19619

TensorRT: Large memory consumption on SSD-like graphs [Feature Request/Discussion] #19619

Comments

yegord commented May 29, 2018 • edited

drpngx commented May 29, 2018

samikama commented May 29, 2018

yegord commented May 30, 2018

tensorflowbutler commented Jun 14, 2018

tensorflowbutler commented Jun 29, 2018

samikama commented Jul 13, 2018

yegord commented Jul 13, 2018

samikama commented Jul 14, 2018

samikama commented Jul 14, 2018

lqw187927 commented Jul 14, 2018 via email

yegord commented Jul 17, 2018

yegord commented Jul 25, 2018

yegord commented Aug 9, 2018

aaroey commented Aug 9, 2018

aaroey commented Aug 11, 2018

yegord commented Aug 11, 2018

aaroey commented Aug 21, 2018

yegord commented May 29, 2018 •

edited