New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorRT: Large memory consumption on SSD-like graphs [Feature Request/Discussion] #19619
Comments
@yegord, Thanks for the investigation. Would it be possible to share the graph so that we can investigate and improve our segmenter? Most common OOM issue is due to reservation for TensorRT which will be reduced with TensorRT4.0 GA since we start to share the TensorFlow's allocator and only workspace parameter needs to be tuned now. We are also restructuring the TFTRT workflow which should improve things. In upcoming updates we will add an option to specify placement of ops in to the TRT segments and users will be able to selectively keep certain nodes outside segments. Thanks, |
Please find a minimal example here: https://yadi.sk/d/797LMYGT3Wi2yy Specifying placements of ops into TRT segments sounds like what I proposed here. |
Nagging Assignee @samikama: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
@yegord Could you please try with TRT4.0? You should not need to set gpu allocation fraction when using TRT4.0 and workspace will directly be allocated from TF memory. It should improve the things for you |
@samikama Thanks for the update! Which TensorFlow version should I test TRT 4 with? Master (98b9a4e) crashes with
(This is the minimal example mentioned in #19619 (comment).) |
@yegord, it looks like an issue with grapplers layout optimizer
We are returning that error |
@yegord, please wait for #20794 and try #!/usr/bin/env python
import argparse
import subprocess
import tensorflow as tf
import numpy as np
import six
from tensorflow.core.protobuf import config_pb2 as cpb2
from tensorflow.core.protobuf import rewriter_config_pb2 as rwpb2
from tensorflow.contrib import tensorrt as trt
NUM_IMAGES = 3
INPUT_TENSORS = [u'placeholders/image_0', u'placeholders/image_1', u'placeholders/image_2', u'placeholders/num_images', u'placeholders/blend_coeff']
OUTPUT_TENSORS = [u'detections/center_x', u'detections/center_y', u'detections/width', u'detections/height', u'detections/width_3d', u'detections/height_3d', u'detections/depth_3d', u'detections/class_id', u'detections/probability', u'detections/yaw', u'detections/properties/tl_rotation/class_id', u'detections/properties/tl_type/class_id', u'detections/properties/tl_road_tl_state/class_id', u'detections/properties/tl_bicycle_tl_state/class_id', u'detections/properties/tl_pedestrian_tl_state/class_id', u'detections/properties/tl_other_tl_state/class_id', u'detections/properties/tl_left_section/class_id', u'detections/properties/tl_left_section_state/class_id', u'detections/properties/tl_right_section/class_id', u'detections/properties/tl_right_section_state/class_id', u'segmentations/sdc1/0', u'segmentations/sdc1/1', u'segmentations/sdc1/2', u'visualizations/sdc1/0', u'visualizations/sdc1/1', u'visualizations/sdc1/2']
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
'--mode',
choices=['tf', 'trt'],
required=True,
help='Evaluator to use.')
args = parser.parse_args()
graph_def = tf.GraphDef()
with tf.gfile.GFile('ssd-tensorflow.pb', 'rb') as f:
graph_def.ParseFromString(f.read())
with tf.Graph().as_default():
tf.import_graph_def(graph_def, name='')
run_graph(mode=args.mode)
def run_graph(ntimes=100, mode='trt'):
image = np.zeros((1024, 768, 3), dtype=np.uint8)
feed_dict = {
tensor_name: image
for tensor_name in INPUT_TENSORS[:NUM_IMAGES]
}
feed_dict[u'placeholders/num_images'] = NUM_IMAGES
feed_dict[u'placeholders/blend_coeff'] = 0.5
feed_dict = {
tf.get_default_graph().get_operation_by_name(key).outputs[0]: value
for key, value in six.iteritems(feed_dict)
}
output_tensors = [
tf.get_default_graph().get_operation_by_name(tensor_name).outputs[0]
for tensor_name in OUTPUT_TENSORS
]
opt_config = rwpb2.RewriterConfig()
opt_config.meta_optimizer_iterations = opt_config.ONE
opt_config.optimizers.extend(["constfold", "layout"])
custom_op = opt_config.custom_optimizers.add()
custom_op.name = "TensorRTOptimizer"
custom_op.parameter_map["minimum_segment_size"].i = 10
custom_op.parameter_map["precision_mode"].s = "FP32"
custom_op.parameter_map["max_batch_size"].i =NUM_IMAGES
custom_op.parameter_map["is_dynamic_op"].b = True
custom_op.parameter_map["max_workspace_size_bytes"].i = 1 << 25
graph_options = cpb2.GraphOptions(rewrite_options=opt_config)
if mode == 'trt':
config = tf.ConfigProto(graph_options=graph_options)
else:
config = tf.ConfigProto()
with tf.Session(config=config) as session:
for i in six.moves.xrange(ntimes):
session.run(output_tensors, feed_dict)
subprocess.check_call(['nvidia-smi'], close_fds=True)
if __name__ == '__main__':
main() |
do not send this email for me.
At 2018-07-14 12:30:52, "Sami Kama" <notifications@github.com> wrote:
@yegord, please wait for #20794 and try
#!/usr/bin/env pythonimport argparse
import subprocess
import tensorflow as tf
import numpy as np
import six
from tensorflow.core.protobuf import config_pb2 as cpb2
from tensorflow.core.protobuf import rewriter_config_pb2 as rwpb2
from tensorflow.contrib import tensorrt as trt
NUM_IMAGES=3INPUT_TENSORS= [u'placeholders/image_0', u'placeholders/image_1', u'placeholders/image_2', u'placeholders/num_images', u'placeholders/blend_coeff']
OUTPUT_TENSORS= [u'detections/center_x', u'detections/center_y', u'detections/width', u'detections/height', u'detections/width_3d', u'detections/height_3d', u'detections/depth_3d', u'detections/class_id', u'detections/probability', u'detections/yaw', u'detections/properties/tl_rotation/class_id', u'detections/properties/tl_type/class_id', u'detections/properties/tl_road_tl_state/class_id', u'detections/properties/tl_bicycle_tl_state/class_id', u'detections/properties/tl_pedestrian_tl_state/class_id', u'detections/properties/tl_other_tl_state/class_id', u'detections/properties/tl_left_section/class_id', u'detections/properties/tl_left_section_state/class_id', u'detections/properties/tl_right_section/class_id', u'detections/properties/tl_right_section_state/class_id', u'segmentations/sdc1/0', u'segmentations/sdc1/1', u'segmentations/sdc1/2', u'visualizations/sdc1/0', u'visualizations/sdc1/1', u'visualizations/sdc1/2']
defmain():
parser = argparse.ArgumentParser()
parser.add_argument(
'--mode',
choices=['tf', 'trt'],
required=True,
help='Evaluator to use.')
args = parser.parse_args()
graph_def = tf.GraphDef()
with tf.gfile.GFile('ssd-tensorflow.pb', 'rb') as f:
graph_def.ParseFromString(f.read())
with tf.Graph().as_default():
tf.import_graph_def(graph_def, name='')
run_graph(mode=args.mode)
defrun_graph(ntimes=100, mode='trt'):
image = np.zeros((1024, 768, 3), dtype=np.uint8)
feed_dict = {
tensor_name: image
for tensor_name inINPUT_TENSORS[:NUM_IMAGES]
}
feed_dict[u'placeholders/num_images'] =NUM_IMAGES
feed_dict[u'placeholders/blend_coeff'] =0.5
feed_dict = {
tf.get_default_graph().get_operation_by_name(key).outputs[0]: value
for key, value in six.iteritems(feed_dict)
}
output_tensors = [
tf.get_default_graph().get_operation_by_name(tensor_name).outputs[0]
for tensor_name inOUTPUT_TENSORS
]
opt_config = rwpb2.RewriterConfig()
opt_config.meta_optimizer_iterations = opt_config.ONE
opt_config.optimizers.extend(["constfold", "layout"])
custom_op = opt_config.custom_optimizers.add()
custom_op.name ="TensorRTOptimizer"
custom_op.parameter_map["minimum_segment_size"].i =10
custom_op.parameter_map["precision_mode"].s ="FP32"
custom_op.parameter_map["max_batch_size"].i =NUM_IMAGES
custom_op.parameter_map["is_dynamic_op"].b =True
custom_op.parameter_map["max_workspace_size_bytes"].i =1<<25
graph_options = cpb2.GraphOptions(rewrite_options=opt_config)
if mode =='trt':
config = tf.ConfigProto(graph_options=graph_options)
else:
config = tf.ConfigProto()
with tf.Session(config=config) as session:
for i in six.moves.xrange(ntimes):
session.run(output_tensors, feed_dict)
subprocess.check_call(['nvidia-smi'], close_fds=True)
if__name__=='__main__':
main()
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
As of master (d51aff2): All numbers are according to nvidia-smi, the test script is the one from #19619 (comment) with Thanks a lot for your help! |
Is the old way of creating an optimized graph with
still supposed to work? I still get
when I try to call create_inference_graph, even after the fix #20794, on master (2791d58c2bbbce833a3de73503816d881a06cfe3). Do I need to run a couple of optimization passes before I call create_inference_graph (like here: #19619 (comment))? If yes, how do I do it outside session.run()? Maybe, there is a public manual on grappler and graph rewritings that you could recommend to read? Thank you. |
@samikama Ping. In general, can anybody here share the roadmap for TensorRT support in TensorFlow: on what time horizon is tf.contrib.tensorrt going to get a documented, relatively stable, and well-tested interface (like most other TensorFlow contrib modules have)? |
@pooyadavoodi, would you please help to take a look at this issue? |
Hi @yegord, It looks like the problem you encountered about Regarding the roadmap, we're working on making the integration more stable and adding more test coverages. We planned to ship TF r1.11 with trt 4.0, which we believe will be a relatively stable version to try. But please always let us know if you encounter any problems, and that could help us to improve and make the integration better. Thanks. |
The script at the commit that you refer to did work on my side. What did not work was calling
The original minimal example script seems to work again with 3b061fc. Apparently, something was fixed between 2791d58c2bbbce833a3de73503816d881a06cfe3 and 3b061fc that made it work. I guess, I will make another attempt to integrate TensorRT in the near future then. Thank you for the information! |
I'm closing this issue since the mentioned problem (large memory consumption) is fixed. Please feel free to let me know if there are any other questions. |
Hi,
tf.contrib.tensorrt
currently automatically segments a given graph into subgraphs which can be fused into TensorRT nodes. This approach works well with networks having linear topology, e.g., CNN classifiers like VGG or ResNet-N: a single node is created, taking an image batch as input and producing a single tensor with predictions.The situation with SSD-like networks is less favorable. Such a network has a topology of the form a->b->c->d->e->f, a->a_1, a->a_2, b->b_1, b->b_2, ..., f->f_1, f->f2, where a->b->c->d->e->f is a feature extractor and a_1, a_2, b_1, b_2, ..., f_1, f_2 are branches stemming from the feature extractor and predicting, e.g., classes and exact locations of the objects in the predefined anchor boxes. The
tf.contrib.tensorrt
's segmentation algorithm on such a graph selects a subgraph consisting of all the feature extractor's nodes, plus maybe parts of the branches (e.g., convolutions computing the logits, but not the argmaxes computing the class ids; TensorRT as of version 3 does not support argmax). As a result, we get a huge operation with lots of outputs (e.g., all logits and all raw location amendments, before argmaxes, reshapes, or NCHW->NHWC transpositions), which all must simultaneously fit in GPU memory at the moment of TensorRT's op's completion.When the same graph is computed on TensorFlow, this high peak in memory usage can be (and seems to be) avoided: TensorFlow can compute the feature extractor up to a next level, compute the branches, copy the results computed in the branches into host RAM, go to the next level, and so on.
This means,
tf.contrib.tensorrt
-optimized graph can use (and seems to actually use in my experiments) significantly more GPU memory than the original graph, which can lead (and seems to lead in my experiments) to out-of-memory errors.One workaround that I tried was to add to
tf.contrib.tensorrt.create_inference_graph
a parameter for specifying a list of subgraphs which should be independently segmented into subsubgraphs for fusion into TensorRT nodes. I passed individual levels of the feature extractor plus the branches at this level as such subgraphs. This reduced the memory use by TensorRT by a factor of around two and fixed the out-of-memory errors in my case.Should such a parameter then maybe added to the mainline
tf.contrib.tensorrt
?Maybe you have a better idea of avoiding a high GPU memory consumption in SSD-like graphs?
Comments, ideas are welcome.
I guess, I should invite @drpngx, @samikama, @jjsjann123 to the discussion.
In case it matters, my experience comes from the experiments with TensorFlow 1.8, TensorRT-3.0.4 running on Ubuntu 16.04 (AMD64) with GTX 1080 Ti.
Thanks!
The text was updated successfully, but these errors were encountered: