Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph optimized using tf.contrib.tensorrt is not loadable with TF_GraphImportGraphDef #23853

Closed
yegord opened this issue Nov 19, 2018 · 19 comments
Assignees

Comments

@yegord
Copy link
Contributor

yegord commented Nov 19, 2018

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes.

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04

  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:

  • TensorFlow installed from (source or binary): Source.

  • TensorFlow version (use command below): v1.12

  • Python version: 2.7.12

  • Bazel version (if compiling from source): 0.19.0

  • GCC/Compiler version (if compiling from source): 5.4.0

  • CUDA/cuDNN version: 9.0/7.0.5

  • GPU model and memory: 1080 Ti

Describe the current behavior

I optimize a TensorFlow graph with

    precision_mode = 'FP32'  # "FP32","FP16" or "INT8"
    graph_def = trt.create_inference_graph(
        input_graph_def=graph_def,
        outputs=output_node_names,
        max_batch_size=num_cameras,
        max_workspace_size_bytes=4*10**9,
        precision_mode=precision_mode,
        minimum_segment_size=10,  # minimum number of nodes in an engine,
    )

, save the resulting graph, and try to load it in a C++ program using C API.

First, I call

TF_LoadLibrary("/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tensorrt/python/ops/_trt_engine_op.so", status)

and call TF_GraphImportGraphDef with the optimized graph.

I get the following error:

TF_GraphImportGraphDef: No shape inference function exists for op 'TRTEngineOp', did you forget to define it?

Describe the expected behavior

The call to TF_GraphImportGraphDef must succeed.

Code to reproduce the issue

It seems that the issue, although not in the bug tracker, should be already known to the authors: https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/contrib/tensorrt/ops/trt_engine_op.cc#L46
However, I can make a minimal example to reproduce the problem on demand.

Other info / logs

It is a pain that a TRT-optimized graph cannot be used outside of python now.
I would be happy to know about a workaround, in case one exists.

@samikama
Copy link
Contributor

Hello @yegord,

Could you please link your application with trt_conversion.so and trt_engine_op_op_lib

@sujitbiswas
Copy link

@samikama

this is with respect to #23243

Where is the lib located, “trt_conversion.so” , can you please tell?
 
There is library named _wrap_conversion.so
 

TensorFlow.loadLibrary("/home/sujitb/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/tensorrt/python/ops/_trt_engine_op.so")
TensorFlow.loadLibrary("/home/sujitb/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/tensorrt/_wrap_conversion.so")
 
Exception in thread "main" java.lang.UnsatisfiedLinkError: /home/sujitb/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/tensorrt/_wrap_conversion.so: undefined symbol: _Py_NoneStruct
                at org.tensorflow.TensorFlow.loadLibrary(TensorFlow.java:47)
                at com.nvidia.tf.InspectModel2$.main(InspectModel2.scala:22)
                at com.nvidia.tf.InspectModel2.main(InspectModel2.scala)

@asimshankar
Copy link
Contributor

So there seem to be two issues here:

  1. That there is no shape inference function registered (the Python API uses a backdoor to be okay with that, but ideally we want all operations to have a shape inference function, even if that function says "Unknown Shape". And we want to get rid of that backdoor). I'll try a fix for that.

  2. For reasons I'm not quite clear on (@aaroey @samikama may know), the TRTEngine operation's kernel is included in a Python specific target (//tensorflow/contrib/tensorrt:wrap_conversion) instead of in the shared library for the op. I suspect/hope this can be changed to make the kernel independent of Python. Will look into it.

@yegord
Copy link
Contributor Author

yegord commented Nov 27, 2018

@asimshankar Excellent summary, thanks!

@samikama So, I cherry-picked the patch enabling the shape function (4fbbeea).
I also applied the following changes:

diff --git a/tensorflow/BUILD b/tensorflow/BUILD
index 9b62a50..254ad51 100644
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@@ -443,6 +443,9 @@ tf_cc_shared_object(
         "//tensorflow/c:version_script.lds",
         "//tensorflow/c/eager:c_api",
         "//tensorflow/core:tensorflow",
+        "//tensorflow/contrib/tensorrt:trt_conversion",
+        "//tensorflow/contrib/tensorrt:trt_engine_op_op_lib",
+        "//tensorflow/contrib/tensorrt:trt_engine_op_kernel",
     ],
 )

(Somehow without trt_engine_op_kernel the kernel was not successfully registered.)

As a result, I get the expected performance boost of around 10% against vanilla TensorFlow graph.
However, my C++ application starts crashing after dozens of seconds running, with the following message:

2018-11-27 19:19:10.135505: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-27 19:19:11.164065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 19:19:11.164123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2018-11-27 19:19:11.164136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2018-11-27 19:19:11.164648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9426 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2018-11-27 19:19:52.094600: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2018-11-27 19:19:52.094660: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
ssd_detection2:terminate_handler.cpp:25: terminate_handler(): abort
0. /usr/lib/libassert.so(+0x32d0) [0x7fcafecdb2d0]
1. /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7fcad54544b0]
2. /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38) [0x7fcad5454428]
3. /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a) [0x7fcad545602a]
4. /usr/lib/libtensorflow_framework.so(+0x6eeab7) [0x7fcaca33bab7]
5. /usr/lib/libtensorflow_framework.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0xf3) [0x7fcaca306633]
6. /usr/lib/libtensorflow_framework.so(_ZN10tensorflow8EventMgr8PollLoopEv+0xce) [0x7fcaca306dee]
7. /usr/lib/libtensorflow_framework.so(_ZN5Eigen26NonBlockingThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x241) [0x7fcaca30c441]
8. /usr/lib/libtensorflow_framework.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x37) [0x7fcaca30a007]
9. /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fcad5dc0c80]
10. /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fcae428c6ba]
11. /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fcad552641d]

The application runs TF_SessionRun on the same session from two threads in parallel.
If I disable one of the threads, the crash goes away.
So, it is either plain OOM (caught too late), or some data race.
Does it ring the bell to anybody?

@aaroey aaroey self-assigned this Nov 27, 2018
@aaroey
Copy link
Member

aaroey commented Nov 27, 2018

@pooyadavoodi may have an idea for the crash problem.
Also, @yegord do you have a repro for that? Thanks.

@yegord
Copy link
Contributor Author

yegord commented Nov 27, 2018

If you have a hypothesis — shoot, I will check it.
If the cause is not that clear, I will make a minimal example, but it will take another day or so.

@samikama
Copy link
Contributor

@yegord I thought TF_SessionRun() is not thread-safe.

@pooyadavoodi
Copy link

Could you reduce max_workspace_size_bytes and also use allow_growth in the session config and see if the problem persists.

@asimshankar
Copy link
Contributor

@samikama : TF_SessionRun is thread-safe (and op kernels are supposed to be too).

@pooyadavoodi
Copy link

@asimshankar Excellent summary, thanks!

@samikama So, I cherry-picked the patch enabling the shape function (4fbbeea).
I also applied the following changes:

diff --git a/tensorflow/BUILD b/tensorflow/BUILD
index 9b62a50..254ad51 100644
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@@ -443,6 +443,9 @@ tf_cc_shared_object(
         "//tensorflow/c:version_script.lds",
         "//tensorflow/c/eager:c_api",
         "//tensorflow/core:tensorflow",
+        "//tensorflow/contrib/tensorrt:trt_conversion",
+        "//tensorflow/contrib/tensorrt:trt_engine_op_op_lib",
+        "//tensorflow/contrib/tensorrt:trt_engine_op_kernel",
     ],
 )

(Somehow without trt_engine_op_kernel the kernel was not successfully registered.)

As a result, I get the expected performance boost of around 10% against vanilla TensorFlow graph.
However, my C++ application starts crashing after dozens of seconds running, with the following message:

2018-11-27 19:19:10.135505: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-27 19:19:11.164065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 19:19:11.164123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2018-11-27 19:19:11.164136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2018-11-27 19:19:11.164648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9426 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2018-11-27 19:19:52.094600: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2018-11-27 19:19:52.094660: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
ssd_detection2:terminate_handler.cpp:25: terminate_handler(): abort
0. /usr/lib/libassert.so(+0x32d0) [0x7fcafecdb2d0]
1. /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7fcad54544b0]
2. /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38) [0x7fcad5454428]
3. /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a) [0x7fcad545602a]
4. /usr/lib/libtensorflow_framework.so(+0x6eeab7) [0x7fcaca33bab7]
5. /usr/lib/libtensorflow_framework.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0xf3) [0x7fcaca306633]
6. /usr/lib/libtensorflow_framework.so(_ZN10tensorflow8EventMgr8PollLoopEv+0xce) [0x7fcaca306dee]
7. /usr/lib/libtensorflow_framework.so(_ZN5Eigen26NonBlockingThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x241) [0x7fcaca30c441]
8. /usr/lib/libtensorflow_framework.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x37) [0x7fcaca30a007]
9. /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fcad5dc0c80]
10. /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fcae428c6ba]
11. /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fcad552641d]

The application runs TF_SessionRun on the same session from two threads in parallel.
If I disable one of the threads, the crash goes away.
So, it is either plain OOM (caught too late), or some data race.
Does it ring the bell to anybody?

@yegord Could you provide a repro. We need to look into the kernel registration that you mentioned above.

@yegord
Copy link
Contributor Author

yegord commented Nov 28, 2018

So, it is either plain OOM (caught too late), or some data race.

Does not look like an OOM, because reducing the input image size by a factor of many (6-fold or so) does not fix it.

use allow_growth in the session config

Already there.

reduce max_workspace_size_bytes

Reducing from 4*10**9 to 2*10**9 does not make the crash go away.

I'll be back with a repro then.

@yegord
Copy link
Contributor Author

yegord commented Nov 30, 2018

Please find the repro here: https://github.com/yegord/tf-trt-linking-and-data-race-example
make && ./main should reproduce the crash.

The error message that I personally observe is here: https://github.com/yegord/tf-trt-linking-and-data-race-example/blob/master/crash.txt

Tensorflow version being used (v1.12 with two patches: uncomment the shape function and link tensorrt operation into libtensorflow.so): https://github.com/yegord/tensorflow/tree/issue-23853

TensorFlow is installed with bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package //tensorflow:libtensorflow.so && bazel-bin/tensorflow/tools/pip_package/build_pip_package .. && sudo pip uninstall -y tensorflow; sudo pip install ../tensorflow*.whl && sudo cp bazel-bin/tensorflow/*.so /usr/lib && sudo mkdir -p /usr/lib/tensorflow/c && sudo cp tensorflow/c/c_api.h /usr/include/tensorflow/c

The repro demonstrates two points. First, there must be a way to link for an external user to link against something from TensorFlow (without patching TensorFlow like I did) and be able to start using TensorRT operation. Second, parallel calls to TensorRT operations should not crash the process.

@yegord
Copy link
Contributor Author

yegord commented Nov 30, 2018

As of the crash, it might happen because you seem call enqueue on the same nvinfer1::IExecutionContext instance in parallel:

auto ret = trt_execution_context_ptr->enqueue(num_batch, &buffers[0], *stream,

And, as my coworker noticed, this is not thread-safe: https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#thread-safety

@samikama
Copy link
Contributor

samikama commented Nov 30, 2018

@ yegord,
That is correct. adding a mutex around the call should solve it since enqueue is pretty lightweight. There were plans to move away from class member execution context but things are reprioritized. I will take a look at your example and get back to you.

@aaroey
Copy link
Member

aaroey commented Jan 11, 2019

I can reproduce the error and I added a mutex which did solve the problem. I'll make a fix soon.

tensorflow-copybara pushed a commit that referenced this issue Jan 11, 2019
… of their

operations are not thread safe.

This fixed one of the issues mentioned in
#23853

PiperOrigin-RevId: 228947504
aaroey added a commit to aaroey/tensorflow that referenced this issue Jan 11, 2019
… of their

operations are not thread safe.

This fixed one of the issues mentioned in
tensorflow#23853

PiperOrigin-RevId: 228947504
tensorflow-copybara pushed a commit that referenced this issue Jan 17, 2019
1. the new calibration design. The current int8 calibration workflow depends on
   a global resouce manager singleton TRTResourceManager (in
   resources/trt_resource_manager.h). This has been:
   - violating the resource manager design: resource manager should be
     per-device
   - pollutes the BUILD dependencies, and makes the kernel implementation not
     able to be used in other language bindings (Issue #23853)
2. the custom backend offline mode, where we'll do the conversion during
   execution and provide an offline tool to get the serizlied engine

PiperOrigin-RevId: 229654702
@yegord
Copy link
Contributor Author

yegord commented Jan 22, 2019

Apparently, there is also a data race leading to a crash during the parallel creation of multiple dynamic int8 engines. Could you have a look?

The repro is here: https://github.com/yegord/tf-trt-linking-and-data-race-example/tree/crash-with-int8-engines

The error message that I see: https://github.com/yegord/tf-trt-linking-and-data-race-example/blob/crash-with-int8-engines/crash.txt

The TensorFlow version used is 1.12 with few patches from you: https://github.com/yegord/tensorflow/tree/issue-23853-2

Build instructions are as above: #23853 (comment)

(I hope you do not mind that I pile up remotely related issues into a single ticket.)

Thanks!

@aaroey
Copy link
Member

aaroey commented Jan 30, 2019

Thanks for the repro @yegord, I'll try it and get back to you.

tensorflow-copybara pushed a commit that referenced this issue Mar 13, 2019
…he trt

grappler optimizer, op kernels, and ops. This library will be included in pip
build, so users can use TF-TRT without building TF from source in C++.

This solves an issue mention by #23853 (TF-TRT not loadable with
TF_GraphImportGraphDef).

PiperOrigin-RevId: 238140294
@aaroey
Copy link
Member

aaroey commented Mar 13, 2019

The original problem in this issue is fixed: if we install the latest nightly pip package, we should see the TF-TRT shared library in site-packages/tensorflow/compiler/tf2tensorrt/python/ops/libtftrt.so.

@yegord, to solve your linking problem, I have an example in https://github.com/aaroey/tensorflow/blob/issue_repros/test/fixed-issue23853/Makefile#L14.

@aaroey
Copy link
Member

aaroey commented Mar 13, 2019

For the INT8 calibration problem, I believe it's fixed at HEAD. Here is a (fixed) repo for it: https://github.com/aaroey/tensorflow/blob/issue_repros/test/fixed-issue23853-int8/Makefile

I'm closing this, feel free to let me know if there are any questions. Thanks.

@aaroey aaroey closed this as completed Mar 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants