Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

got problem while serving with TensorRT plan #7

Closed
zoidburg opened this issue Nov 30, 2018 · 2 comments
Closed

got problem while serving with TensorRT plan #7

zoidburg opened this issue Nov 30, 2018 · 2 comments

Comments

@zoidburg
Copy link

hi,
I'm trying to serving a TensorRT serialized plan file on tensorrt-inference-server. While initializing server, a segmentation fault came up. Log message looks like:

I1130 15:23:00.955141 12581 server.cc:631] Initializing TensorRT Inference Server
I1130 15:23:00.955237 12581 server.cc:680] Reporting prometheus metrics on port 8002
I1130 15:23:01.874070 12581 metrics.cc:129] found 2 GPUs supported power usage metric
I1130 15:23:01.881493 12581 metrics.cc:139] GPU 0: Tesla P40
I1130 15:23:01.887018 12581 metrics.cc:139] GPU 1: Tesla P40
I1130 15:23:01.887897 12581 server.cc:884] Starting server 'inference:0' listening on
I1130 15:23:01.887919 12581 server.cc:888] localhost:8001 for gRPC requests
I1130 15:23:01.889030 12581 server.cc:898] localhost:8000 for HTTP requests
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 235] RAW: Entering the event loop ...
I1130 15:23:01.932019 12581 server_core.cc:465] Adding/updating models.
I1130 15:23:01.932078 12581 server_core.cc:520] (Re-)adding model: resnet50_plan
I1130 15:23:02.032174 12581 basic_manager.cc:739] Successfully reserved resources to load servable {name: resnet50_plan version: 1}
I1130 15:23:02.032211 12581 loader_harness.cc:66] Approving load for servable version {name: resnet50_plan version: 1}
I1130 15:23:02.032240 12581 loader_harness.cc:74] Loading servable version {name: resnet50_plan version: 1}
I1130 15:23:02.138975 12581 plan_bundle.cc:301] Creating instance resnet50_plan_0_0_gpu1 on GPU 1 (6.1) using model.plan
I1130 15:23:02.840773 12581 logging.cc:39] Glob Size is 56 bytes.
I1130 15:23:02.841601 12581 logging.cc:39] Added linear block of size 8589934597
I1130 15:23:02.841628 12581 logging.cc:39] Added linear block of size 18369233829710790662
I1130 15:23:02.841734 12581 logging.cc:39] Added linear block of size 47244640284
I1130 15:23:02.841755 12581 logging.cc:39] Added linear block of size 154618822688
I1130 15:23:02.841775 12581 logging.cc:39] Added linear block of size 18446744069414584508
I1130 15:23:02.841789 12581 logging.cc:39] Added linear block of size 17179869216
I1130 15:23:02.841804 12581 logging.cc:39] Added linear block of size 1651470960
I1130 15:23:02.841818 12581 logging.cc:39] Added linear block of size 1305670057985
I1130 15:23:02.841837 12581 logging.cc:39] Added linear block of size 773094113281
I1130 15:23:02.841853 12581 logging.cc:39] Added linear block of size 17179869185
I1130 15:23:02.841867 12581 logging.cc:39] Added linear block of size 38547291084
I1130 15:23:02.841881 12581 logging.cc:39] Added linear block of size 17179869200
Segmentation fault (core dumped)

and part of tracing messages from gdb:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by trtserver --model-store=/exchange/model_repository/.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 strlen () at ../sysdeps/x86_64/strlen.S:106
106 ../sysdeps/x86_64/strlen.S: No such file or directory.
[Current thread is 1 (Thread 0x7f6afff40700 (LWP 12791))]
(gdb) bt
#0 strlen () at ../sysdeps/x86_64/strlen.S:106
#1 0x00007f6e75ea3d01 in std::basic_string<char, std::char_traits, std::allocator >::basic_string(char const*, std::allocator const&) ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f6efdaf181a in nvinfer1::rt::Engine::deserialize(void const*, unsigned long, nvinfer1::IGpuAllocator&, nvinfer1::IPluginFactory*) ()
from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5
#3 0x00007f6efdaf6bd3 in nvinfer1::Runtime::deserializeCudaEngine(void const*, unsigned long, nvinfer1::IPluginFactory*) () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.5
#4 0x00007f6f144c2cc4 in nvidia::inferenceserver::PlanBundle::CreateExecutionContext(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int, nvidia::inferenceserver::ModelConfig const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::vector<char, std::allocator >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::vector<char, std::allocator > > > > const&) ()
#5 0x00007f6f144c3a82 in nvidia::inferenceserver::PlanBundle::CreateExecutionContexts(nvidia::inferenceserver::ModelConfig const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::vector<char, std::allocator >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::vector<char, std::allocator > > > > const&) ()
#6 0x00007f6f144bc1f4 in nvidia::inferenceserver::(anonymous namespace)::CreatePlanBundle(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >) ()
#7 0x00007f6f144ba497 in std::_Function_handler<tensorflow::Status (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >
), tensorflow::Status ()(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >)>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >&&)
()
#8 0x00007f6f144ba58c in std::_Function_handler<tensorflow::Status (std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >
), tensorflow::serving::SimpleLoaderSourceAdapter<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, nvidia::inferenceserver::PlanBundle>::Convert(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::unique_ptr<tensorflow::serving::Loader, std::default_deletetensorflow::serving::Loader >)::{lambda(std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >)#1}>::_M_invoke(std::_Any_data const&, std::unique_ptr<nvidia::inferenceserver::PlanBundle, std::default_deletenvidia::inferenceserver::PlanBundle >&&) ()
#9 0x00007f6f144bc86a in tensorflow::serving::SimpleLoadernvidia::inferenceserver::PlanBundle::Load() ()
#10 0x00007f6f14594099 in std::_Function_handler<tensorflow::Status (), tensorflow::serving::LoaderHarness::Load()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#11 0x00007f6f145964b7 in tensorflow::serving::Retry(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, unsigned int, long long, std::function<tensorflow::Status ()> const&, std::function<bool ()> const&) ()
#12 0x00007f6f145953d6 in tensorflow::serving::LoaderHarness::Load() ()
#13 0x00007f6f1459172d in tensorflow::serving::BasicManager::ExecuteLoad(tensorflow::serving::LoaderHarness
) ()
#14 0x00007f6f14591b4c in tensorflow::serving::BasicManager::ExecuteLoadOrUnload(tensorflow::serving::BasicManager::LoadOrUnloadRequest const&, tensorflow::serving::LoaderHarness*) ()
#15 0x00007f6f14593396 in tensorflow::serving::BasicManager::HandleLoadOrUnloadRequest(tensorflow::serving::BasicManager::LoadOrUnloadRequest const&, std::function<void (tensorflow::Status const&)>) ()
#16 0x00007f6f1459348f in std::_Function_handler<void (), tensorflow::serving::BasicManager::LoadOrUnloadServable(tensorflow::serving::BasicManager::LoadOrUnloadRequest const&, std::function<void (tensorflow::Status const&)>)::{lambda()#2}>::_M_invoke(std::_Any_data const&) ()
#17 0x00007f6f19e06579 in Eigen::NonBlockingThreadPoolTempltensorflow::thread::EigenEnvironment::WorkerLoop(int) ()
#18 0x00007f6f19e04717 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
()
#19 0x00007f6e75e8bc80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#20 0x00007f6e76a936ba in start_thread (arg=0x7f6afff40700) at pthread_create.c:333
#21 0x00007f6e758fa41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

I'm not sure if it went wrong while serializing the plan file to build an inference engine, so do you got any ideas? or is there anything wrong with my plan file/config file etc.

I used these docker images for testing
tensorrtserver: nvcr.io/nvidia/tensorrtserver 18.09-py3
tensorrt: nvcr.io/nvidia/tensorrt 18.11-py3

@deadeyegoodwin
Copy link
Contributor

The error is likely caused because you generated your TRT plan with a different version of TRT than you used to execute the plan. You used the TRT version in the 18.11 TensorRT container to generate and the version in TRTIS 18.09 to execute... these are not the same version. Try using the 18.11 version of tensorrtserver container. Containers at the same release (18.11 in this case) will use the same version of TRT.

TRT should give a better error message when this happens... in the future it will.

@zoidburg
Copy link
Author

zoidburg commented Dec 3, 2018

Thanks a lot, I tried to update the trt server image to 18.11-py3 and tested with the very same plan file, and it worked.

@zoidburg zoidburg closed this as completed Dec 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants