Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT: Large memory consumption on SSD-like graphs [Feature Request/Discussion] #19619

Closed
yegord opened this issue May 29, 2018 · 17 comments
Closed
Assignees

Comments

@yegord
Copy link
Contributor

yegord commented May 29, 2018

Hi,

tf.contrib.tensorrt currently automatically segments a given graph into subgraphs which can be fused into TensorRT nodes. This approach works well with networks having linear topology, e.g., CNN classifiers like VGG or ResNet-N: a single node is created, taking an image batch as input and producing a single tensor with predictions.

The situation with SSD-like networks is less favorable. Such a network has a topology of the form a->b->c->d->e->f, a->a_1, a->a_2, b->b_1, b->b_2, ..., f->f_1, f->f2, where a->b->c->d->e->f is a feature extractor and a_1, a_2, b_1, b_2, ..., f_1, f_2 are branches stemming from the feature extractor and predicting, e.g., classes and exact locations of the objects in the predefined anchor boxes. The tf.contrib.tensorrt's segmentation algorithm on such a graph selects a subgraph consisting of all the feature extractor's nodes, plus maybe parts of the branches (e.g., convolutions computing the logits, but not the argmaxes computing the class ids; TensorRT as of version 3 does not support argmax). As a result, we get a huge operation with lots of outputs (e.g., all logits and all raw location amendments, before argmaxes, reshapes, or NCHW->NHWC transpositions), which all must simultaneously fit in GPU memory at the moment of TensorRT's op's completion.

When the same graph is computed on TensorFlow, this high peak in memory usage can be (and seems to be) avoided: TensorFlow can compute the feature extractor up to a next level, compute the branches, copy the results computed in the branches into host RAM, go to the next level, and so on.

This means, tf.contrib.tensorrt-optimized graph can use (and seems to actually use in my experiments) significantly more GPU memory than the original graph, which can lead (and seems to lead in my experiments) to out-of-memory errors.

One workaround that I tried was to add to tf.contrib.tensorrt.create_inference_graph a parameter for specifying a list of subgraphs which should be independently segmented into subsubgraphs for fusion into TensorRT nodes. I passed individual levels of the feature extractor plus the branches at this level as such subgraphs. This reduced the memory use by TensorRT by a factor of around two and fixed the out-of-memory errors in my case.

Should such a parameter then maybe added to the mainline tf.contrib.tensorrt?
Maybe you have a better idea of avoiding a high GPU memory consumption in SSD-like graphs?
Comments, ideas are welcome.

I guess, I should invite @drpngx, @samikama, @jjsjann123 to the discussion.

In case it matters, my experience comes from the experiments with TensorFlow 1.8, TensorRT-3.0.4 running on Ubuntu 16.04 (AMD64) with GTX 1080 Ti.

Thanks!

@drpngx
Copy link
Contributor

drpngx commented May 29, 2018

/CC @tfboyd @zheng-xq

@samikama
Copy link
Contributor

@yegord, Thanks for the investigation. Would it be possible to share the graph so that we can investigate and improve our segmenter? Most common OOM issue is due to reservation for TensorRT which will be reduced with TensorRT4.0 GA since we start to share the TensorFlow's allocator and only workspace parameter needs to be tuned now. We are also restructuring the TFTRT workflow which should improve things. In upcoming updates we will add an option to specify placement of ops in to the TRT segments and users will be able to selectively keep certain nodes outside segments.

Thanks,
Sami

@yegord
Copy link
Contributor Author

yegord commented May 30, 2018

Please find a minimal example here: https://yadi.sk/d/797LMYGT3Wi2yy
./minimal_example.py --mode tf gives me 5487MiB / 11172MiB GPU memory usage on 1080 Ti, according to nvidia-smi.
./minimal_example.py --mode trt gives me 9987MiB / 11172MiB, which is 80% larger than with plain TensorFlow, is over the max_workspace_size_bytes (which is around 6G, not sure if all of that is used by TensorRT, though) and is close to the physical memory limit.

Specifying placements of ops into TRT segments sounds like what I proposed here.
Sharing of the allocator also sounds great.
I guess, I should wait for the TensorRT 4 based release and repeat the experiment.
Currently, almost twofold memory usage is a blocker for the use of tf.contrib.tensorrt for me.

@poxvoculi poxvoculi assigned drpngx and unassigned poxvoculi May 30, 2018
@samikama samikama self-assigned this May 30, 2018
@tensorflowbutler
Copy link
Member

Nagging Assignees @samikama, @drpngx: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@drpngx drpngx removed their assignment Jun 15, 2018
@tensorflowbutler
Copy link
Member

Nagging Assignee @samikama: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@samikama
Copy link
Contributor

@yegord Could you please try with TRT4.0? You should not need to set gpu allocation fraction when using TRT4.0 and workspace will directly be allocated from TF memory. It should improve the things for you

@yegord
Copy link
Contributor Author

yegord commented Jul 13, 2018

@samikama Thanks for the update! Which TensorFlow version should I test TRT 4 with? Master (98b9a4e) crashes with

$ ./minimal_example.py --mode trt
2018-07-14 01:17:30.655903: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1
Traceback (most recent call last):
  File "./minimal_example.py", line 81, in <module>
    main()
  File "./minimal_example.py", line 44, in main
    minimum_segment_size=3
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tensorrt/python/trt_convert.py", line 152, in create_inference_graph
    int(msg[0]))
tensorflow.python.framework.errors_impl.NotFoundError: No attr named 'shape' in NodeDef:
         [[Node: placeholders/image_0 = Placeholder[dtype=DT_UINT8]()]] for 'placeholders/image_0' (op: 'Placeholder') with input shapes: .

(This is the minimal example mentioned in #19619 (comment).)

@samikama
Copy link
Contributor

@yegord, it looks like an issue with grapplers layout optimizer

2018-07-13 20:50:26.902178: I tensorflow/core/grappler/optimizers/layout_optimizer.cc:2187] Infer shape return status: Not found: No attr named 'shape' in NodeDef:
	 [[Node: placeholders/image_0 = Placeholder[dtype=DT_UINT8]()]] for 'placeholders/image_0' (op: 'Placeholder') with input shapes: .
2018-07-13 20:50:26.938333: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:318] layout: Not found: No attr named 'shape' in NodeDef:
	 [[Node: placeholders/image_0 = Placeholder[dtype=DT_UINT8]()]] for 'placeholders/image_0' (op: 'Placeholder') with input shapes: .
2018-07-13 20:50:27.010038: I tensorflow/core/grappler/costs/graph_properties.cc:1250] Propagating 2002 new shapes through 0 loops and 0 resources

We are returning that error

@samikama
Copy link
Contributor

@yegord, please wait for #20794 and try

#!/usr/bin/env python

import argparse
import subprocess
import tensorflow as tf
import numpy as np
import six

from tensorflow.core.protobuf import config_pb2 as cpb2
from tensorflow.core.protobuf import rewriter_config_pb2 as rwpb2

from tensorflow.contrib import tensorrt as trt

NUM_IMAGES = 3

INPUT_TENSORS = [u'placeholders/image_0', u'placeholders/image_1', u'placeholders/image_2', u'placeholders/num_images', u'placeholders/blend_coeff']

OUTPUT_TENSORS = [u'detections/center_x', u'detections/center_y', u'detections/width', u'detections/height', u'detections/width_3d', u'detections/height_3d', u'detections/depth_3d', u'detections/class_id', u'detections/probability', u'detections/yaw', u'detections/properties/tl_rotation/class_id', u'detections/properties/tl_type/class_id', u'detections/properties/tl_road_tl_state/class_id', u'detections/properties/tl_bicycle_tl_state/class_id', u'detections/properties/tl_pedestrian_tl_state/class_id', u'detections/properties/tl_other_tl_state/class_id', u'detections/properties/tl_left_section/class_id', u'detections/properties/tl_left_section_state/class_id', u'detections/properties/tl_right_section/class_id', u'detections/properties/tl_right_section_state/class_id', u'segmentations/sdc1/0', u'segmentations/sdc1/1', u'segmentations/sdc1/2', u'visualizations/sdc1/0', u'visualizations/sdc1/1', u'visualizations/sdc1/2']


def main():
    parser = argparse.ArgumentParser()

    parser.add_argument(
        '--mode',
        choices=['tf', 'trt'],
        required=True,
        help='Evaluator to use.')

    args = parser.parse_args()

    graph_def = tf.GraphDef()
    with tf.gfile.GFile('ssd-tensorflow.pb', 'rb') as f:
        graph_def.ParseFromString(f.read())

    with tf.Graph().as_default():
        tf.import_graph_def(graph_def, name='')
        run_graph(mode=args.mode)

def run_graph(ntimes=100, mode='trt'):
    image = np.zeros((1024, 768, 3), dtype=np.uint8)
    feed_dict = {
        tensor_name: image
        for tensor_name in INPUT_TENSORS[:NUM_IMAGES]
    }
    feed_dict[u'placeholders/num_images'] = NUM_IMAGES
    feed_dict[u'placeholders/blend_coeff'] = 0.5

    feed_dict = {
        tf.get_default_graph().get_operation_by_name(key).outputs[0]: value
        for key, value in six.iteritems(feed_dict)
    }

    output_tensors = [
        tf.get_default_graph().get_operation_by_name(tensor_name).outputs[0]
        for tensor_name in OUTPUT_TENSORS
    ]
    opt_config = rwpb2.RewriterConfig()
    opt_config.meta_optimizer_iterations = opt_config.ONE
    opt_config.optimizers.extend(["constfold", "layout"])
    custom_op = opt_config.custom_optimizers.add()
    custom_op.name = "TensorRTOptimizer"
    custom_op.parameter_map["minimum_segment_size"].i = 10
    custom_op.parameter_map["precision_mode"].s = "FP32"
    custom_op.parameter_map["max_batch_size"].i =NUM_IMAGES
    custom_op.parameter_map["is_dynamic_op"].b = True
    custom_op.parameter_map["max_workspace_size_bytes"].i = 1 << 25
    
    graph_options = cpb2.GraphOptions(rewrite_options=opt_config)
    if mode == 'trt':
        config = tf.ConfigProto(graph_options=graph_options)
    else:
        config = tf.ConfigProto()

    with tf.Session(config=config) as session:
        for i in six.moves.xrange(ntimes):
            session.run(output_tensors, feed_dict)
        subprocess.check_call(['nvidia-smi'], close_fds=True)


if __name__ == '__main__':
    main()

@aaroey aaroey self-assigned this Jul 14, 2018
@lqw187927
Copy link

lqw187927 commented Jul 14, 2018 via email

@yegord
Copy link
Contributor Author

yegord commented Jul 17, 2018

As of master (d51aff2):
The vanilla graph uses 5499MiB.
TRT-optimized graph uses 1753MiB.

All numbers are according to nvidia-smi, the test script is the one from #19619 (comment) with config.gpu_options.allow_growth = True added.
The result is definitely very positive, assuming there is no error in the measurements and given the optimized graph computes the same things as the original one (I did not check this yet).

Thanks a lot for your help!
Feel free to close this task if you do not need it anymore.
I will come back if I have any further problems on my way.

@yegord
Copy link
Contributor Author

yegord commented Jul 25, 2018

Is the old way of creating an optimized graph with

    graph_def = trt.create_inference_graph(
        input_graph_def=graph_def,
        outputs=output_node_names,
        max_batch_size=num_cameras,
        max_workspace_size_bytes=1 << 25, 
        precision_mode=precision_mode,
        minimum_segment_size=10,  # minimum number of nodes in an engine,
    )   

still supposed to work?

I still get

Traceback (most recent call last):
  File "ros/src/perception/detection/tools/deploy.py", line 294, in <module>
    main()
  File "ros/src/perception/detection/tools/deploy.py", line 40, in main
    graph_def = _make_inference_graph(model, args.num_cameras)
  File "ros/src/perception/detection/tools/deploy.py", line 270, in _make_inference_graph
    minimum_segment_size=10,  # minimum number of nodes in an engine,
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tensorrt/python/trt_convert.py", line 153, in create_inference_graph
    int(msg[0]))
tensorflow.python.framework.errors_impl.NotFoundError: No attr named 'shape' in NodeDef:
         [[Node: placeholders/image_0 = Placeholder[dtype=DT_UINT8]()]] for 'placeholders/image_0' (op: 'Placeholder') with input shapes: .

when I try to call create_inference_graph, even after the fix #20794, on master (2791d58c2bbbce833a3de73503816d881a06cfe3).

Do I need to run a couple of optimization passes before I call create_inference_graph (like here: #19619 (comment))? If yes, how do I do it outside session.run()? Maybe, there is a public manual on grappler and graph rewritings that you could recommend to read?

Thank you.

@yegord
Copy link
Contributor Author

yegord commented Aug 9, 2018

@samikama Ping.

In general, can anybody here share the roadmap for TensorRT support in TensorFlow: on what time horizon is tf.contrib.tensorrt going to get a documented, relatively stable, and well-tested interface (like most other TensorFlow contrib modules have)?

@aaroey
Copy link
Member

aaroey commented Aug 9, 2018

@pooyadavoodi, would you please help to take a look at this issue?

aaroey added a commit to aaroey/tensorflow that referenced this issue Aug 11, 2018
@aaroey
Copy link
Member

aaroey commented Aug 11, 2018

Hi @yegord,

It looks like the problem you encountered about No attr named 'shape' in NodeDef is fixed. I patched the repro in aaroey@f2bdf76 and the script works well when building from master. Would you please help to double check?

Regarding the roadmap, we're working on making the integration more stable and adding more test coverages. We planned to ship TF r1.11 with trt 4.0, which we believe will be a relatively stable version to try. But please always let us know if you encounter any problems, and that could help us to improve and make the integration better.

Thanks.

@yegord
Copy link
Contributor Author

yegord commented Aug 11, 2018

I patched the repro in aaroey/tensorflow@f2bdf76 and the script works well when building from master.

The script at the commit that you refer to did work on my side. What did not work was calling create_inference_graph() (as in the original minimal example script).

Would you please help to double check?

The original minimal example script seems to work again with 3b061fc. Apparently, something was fixed between 2791d58c2bbbce833a3de73503816d881a06cfe3 and 3b061fc that made it work. I guess, I will make another attempt to integrate TensorRT in the near future then.

Thank you for the information!

@aaroey
Copy link
Member

aaroey commented Aug 21, 2018

I'm closing this issue since the mentioned problem (large memory consumption) is fixed. Please feel free to let me know if there are any other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants