Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deploy maskRCNN pytorch model? #418

Closed
achbogga opened this issue Jun 27, 2019 · 15 comments
Closed

How to deploy maskRCNN pytorch model? #418

achbogga opened this issue Jun 27, 2019 · 15 comments

Comments

@achbogga
Copy link

@deadeyegoodwin , how should one think of deploying pytorch models some of which might not be supported for fully automatic tensorRT conversion yet? For example, can you point me towards any example with FPN-101 backbone maskRCNN pytorch model?

@bezero
Copy link

bezero commented Jun 27, 2019

@achbogga you can use onnx conversion of your pytorch model and deploy. However, it is not straightforward. You can check this model, which I believe was created using maskrcnn-benchmark repository. From what I saw, TRTIS is working on integrating pytorch models as well, but you have to write your model using TorchScript in order to be able to freeze your model using torch.jit.trace. If there is other possible solutions, I would like to know as well.

@ligonzheng
Copy link

Using python flask is a easy solution.

@achbogga
Copy link
Author

@ligonzheng can you please elaborate?

@ligonzheng
Copy link

@deadeyegoodwin
Copy link
Contributor

@achbogga As @bezero mention your best bet is to use the native pytorch support if possible (in master and will be in 19.07). In 19.06 we also added ONNX support so it may be possible to export pytorch -> ONNX and use the resulting model.

@achbogga
Copy link
Author

achbogga commented Aug 7, 2019

Regarding the r19.07-py3 version pytorch support:

There is no official PyTorch example supplied from NVIDIA to write config.pbtxt for the pytorch_libtorch models. Any official example would help. Also, when I tried with --strict-model-config=false, I got the following error while serving our maskrcnn model forked from facebook-maskrcnn-benchmark github repo. Any help is appreciated. If anyone got the model to work with onnx route, I would appreciate an example code snippet to export fb-maskrcnn model into onnx along with the corresponding config.pbtxt.

nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it --rm -v/home/caffe/.bashrc:/.bashrc -v/home/caffe/.bash_history:/.bash_history -v/home/caffe/achu/model_repo:/models -p8000:8000 -p8001:8001 -p8002:8002 nvcr.io/nvidia/tensorrtserver:19.07-py3 trtserver --model-store=/models  --strict-model-config=false

===============================
== TensorRT Inference Server ==
===============================

NVIDIA Release 19.07 (build 7353475)

Copyright (c) 2018-2019, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

I0807 22:52:48.954956 1 main.cc:346] Starting endpoints, 'inference:0' listening on
I0807 22:52:48.955207 1 grpc_server.cc:272] Starting a GRPCService at 0.0.0.0:8001
I0807 22:52:48.955246 1 grpc_server.cc:278] Register TensorRT GRPCService
I0807 22:52:48.955284 1 grpc_server.cc:281] Register Infer RPC
I0807 22:52:48.955298 1 grpc_server.cc:285] Register StreamInfer RPC
I0807 22:52:48.955311 1 grpc_server.cc:290] Register Status RPC
I0807 22:52:48.955318 1 grpc_server.cc:294] Register Profile RPC
I0807 22:52:48.955326 1 grpc_server.cc:298] Register Health RPC
I0807 22:52:48.955370 1 grpc_server.cc:310] Register Executor
I0807 22:52:48.962232 1 http_server.cc:632] Starting HTTPService at 0.0.0.0:8000
I0807 22:52:49.004117 1 http_server.cc:646] Starting Metrics Service at 0.0.0.0:8002
I0807 22:52:49.006389 1 metrics.cc:160] found 1 GPUs supporting NVML metrics
I0807 22:52:49.011886 1 metrics.cc:169]   GPU 0: Tesla V100-SXM2-16GB
I0807 22:52:49.017681 1 server.cc:111] Initializing TensorRT Inference Server
I0807 22:52:49.052634 1 server_status.cc:83] New status tracking for model 'fridge_libtorch'
I0807 22:52:49.052687 1 model_repository_manager.cc:633] loading: fridge_libtorch:1
I0807 22:52:49.357929 1 libtorch_backend.cc:220] Creating instance fridge_libtorch_0_gpu0 on GPU 0 (7.0) using model.pt
E0807 22:52:49.649415 1 model_repository_manager.cc:779] failed to load 'fridge_libtorch' version 1: Internal: load failed for libtorch model -> 'fridge_libtorch': [enforce fail at inline_container.cc:137] . PytorchStreamReader failed reading zip archive: failed finding central directory
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*) + 0x78 (0x7f313c709ce8 in /opt/tensorrtserver/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::valid(char const*) + 0x8d (0x7f3140fe8e2d in /opt/tensorrtserver/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::init() + 0xa6 (0x7f3140fed136 in /opt/tensorrtserver/lib/libtorch.so)
frame #3: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::unique_ptr<caffe2::serialize::ReadAdapterInterface, std::default_delete<caffe2::serialize::ReadAdapterInterface> >) + 0x53 (0x7f3140ff0fe3 in /opt/tensorrtserver/lib/libtorch.so)
frame #4: <unknown function> + 0x5566a1e (0x7f3141e88a1e in /opt/tensorrtserver/lib/libtorch.so)
frame #5: torch::jit::load(std::unique_ptr<caffe2::serialize::ReadAdapterInterface, std::default_delete<caffe2::serialize::ReadAdapterInterface> >, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0xd7 (0x7f3141e92087 in /opt/tensorrtserver/lib/libtorch.so)
frame #6: torch::jit::load(std::istream&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x79 (0x7f3141e92229 in /opt/tensorrtserver/lib/libtorch.so)
frame #7: <unknown function> + 0x1f63c3 (0x7f31d00b23c3 in /opt/tensorrtserver/lib/libtrtserver.so)
frame #8: <unknown function> + 0x1f70d3 (0x7f31d00b30d3 in /opt/tensorrtserver/lib/libtrtserver.so)
frame #9: <unknown function> + 0x1f0224 (0x7f31d00ac224 in /opt/tensorrtserver/lib/libtrtserver.so)
frame #10: <unknown function> + 0xbe924 (0x7f31cff7a924 in /opt/tensorrtserver/lib/libtrtserver.so)
frame #11: <unknown function> + 0xbf315 (0x7f31cff7b315 in /opt/tensorrtserver/lib/libtrtserver.so)
frame #12: <unknown function> + 0xbd66f (0x7f31cfbf066f in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #13: <unknown function> + 0x76db (0x7f313a0cc6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #14: clone + 0x3f (0x7f31cf64b88f in /lib/x86_64-linux-gnu/libc.so.6)

@CoderHam
Copy link
Contributor

CoderHam commented Aug 7, 2019

The pytorch backend runs the C++ backend (LibTorch) which requires a torch script model. You can produce the same by tracing your existing Pytorch model as shown here: https://pytorch.org/tutorials/advanced/cpp_export.html

The autofill for pytorch is very limited due to lack of information being stored in the model file.

You can see here in the docs the naming convention and other related information on how to create the config.pbtxt file for PyTorch models: https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#model-configuration

@achbogga
Copy link
Author

achbogga commented Aug 8, 2019

@CoderHam An official example to create a config.pbtxt for PyTorch model should be provided by NVIDIA for the sake of all the noobs out there who could not understand the documentation properly. Can you guys provide at least one example? Please reopen this issue.

@achbogga
Copy link
Author

achbogga commented Aug 9, 2019

BTW, I have tried both 1.torch.jit.trace and 2.torch.jit.script routes. Nothing seems to work for the maskrcnn models. I encountered the following errors respectively 1.RuntimeError: Expected object of scalar type Float but got scalar type Long for argument #2 'other' 2.UnsupportedNodeError: GeneratorExp aren't supported.

@CoderHam
Copy link
Contributor

CoderHam commented Aug 9, 2019

You need to trace the pytorch model with a float tensor. Can you confirm you are doing that?
Here is a sample config file: config.pbtxt:

platform: "pytorch_libtorch"
max_batch_size: 128
input {
    name: "INPUT__0"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
output {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    label_filename: "resnet50_labels.txt"
  }

You can follow this template to create a similar config file to your model.

@achbogga
Copy link
Author

achbogga commented Aug 9, 2019

Yes, I have tried tracing using float Tensor. The same result, it did not work.

@c464851257
Copy link

You need to trace the pytorch model with a float tensor. Can you confirm you are doing that?
Here is a sample config file: config.pbtxt:

platform: "pytorch_libtorch"
max_batch_size: 128
input {
    name: "INPUT__0"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
output {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    label_filename: "resnet50_labels.txt"
  }

You can follow this template to create a similar config file to your model.

Hello,I have tried tracing using float Tensor, how to get the input name and output name?

@okyspace
Copy link
Contributor

okyspace commented May 5, 2021

Did anyone manage to solve this? I got the server to run with Pytorch model but got http 400 when doing inference.
The config.pbtxt follows the above sample. I have also test my input in colab notebook with model inference, it works. but the same input with model deployed in triton gave http 400.

Any way to print out more logs to identify the error? I have tried to add --log-verbose=5 but not enough logs to help derive the error.

Appreciate the help

@CoderHam
Copy link
Contributor

CoderHam commented May 5, 2021

@okyspace Did you try tracing the model as advised here.
Also, try --log-verbose=1. You should see a failure logged in the server when the inference fails to execute.

@c464851257 pytorch models do not have a concept of input names. The INPUT names should follow the Triton convention as described here.

@okyspace
Copy link
Contributor

okyspace commented May 5, 2021

@okyspace Did you try tracing the model as advised here.
Also, try --log-verbose=1. You should see a failure logged in the server when the inference fails to execute.

@c464851257 pytorch models do not have a concept of input names. The INPUT names should follow the Triton convention as described here.

Thanks. I reviewed again and found that the error was due to my output shape not correct.
By the way, I was thinking to contribute my pytorch config as example so it can benefit more people since I found that the examples provided are mostly tensorflow.

And also if you could point me to examples that deploy pre-trained model like YOLO will be great. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants