Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core dump when load model with config which containning repoagent in explicit mode #5189

Closed
zhaotyer opened this issue Dec 21, 2022 · 4 comments
Labels
question Further information is requested

Comments

@zhaotyer
Copy link

Description
Tritonserver runs in explicit mode,The repoagent is used in the model .pbtxt configuration,
When I load the model with config via http api twice in a row, the program will crash,
It is normal when repoagent is not used

Triton Information
tritonserver:22.03
docker version: nvcr.io/nvidia/tritonserver:22.03-py3

Are you using the Triton container or did you build it yourself?
No

To Reproduce
step1:
docker run --rm -it --shm-size=20G -v /home/zty/Downloads:/tmp/models -p8000:8000 -p8001:8001 -p8002:8002 nvcr.io/nvidia/tritonserver:22.03-py3 bash
step2:
tritonserver --model-repository=/tmp/models/value-added-tax/model/v1/ --model-control-mode=explicit --log-verbose=1
step3:
load model with config by 'v2/repository/models/{}/load' api
step4:
load model with config by 'v2/repository/models/{}/load' api
The program crashes when I do this for the second time

the log is:

I1221 12:25:07.773030 7504 model_repository_manager.cc:956] no next action, trigger OnComplete()
I1221 12:25:07.773069 7504 model_repository_manager.cc:942] TriggerNextAction() 'text_detection_ppocrv3' version 1: 2
I1221 12:25:07.773087 7504 model_repository_manager.cc:1022] Unload() 'text_detection_ppocrv3' version 1
I1221 12:25:07.773102 7504 model_repository_manager.cc:1029] unloading: text_detection_ppocrv3:1
I1221 12:25:07.773248 7504 model_repository_manager.cc:549] VersionStates() 'text_detection_ppocrv3'
I1221 12:25:07.773316 7504 model_repository_manager.cc:549] VersionStates() 'text_detection_ppocrv3'
I1221 12:25:07.773334 7504 triton_model_instance.cc:693] Stopping backend thread for text_detection_ppocrv3...
I1221 12:25:07.773454 7504 python.cc:1993] TRITONBACKEND_ModelInstanceFinalize: delete instance state
Cleaning up...
I1221 12:25:09.094515 7504 triton_model_instance.cc:693] Stopping backend thread for text_detection_ppocrv3...
I1221 12:25:09.094619 7504 python.cc:1993] TRITONBACKEND_ModelInstanceFinalize: delete instance state
Cleaning up...
I1221 12:25:10.147369 7504 python.cc:1882] TRITONBACKEND_ModelFinalize: delete model state
I1221 12:25:10.147578 7504 model_repository_manager.cc:1133] OnDestroy callback() 'text_detection_ppocrv3' version 1
I1221 12:25:10.147598 7504 model_repository_manager.cc:1135] successfully unloaded 'text_detection_ppocrv3' version 1
I1221 12:25:10.147648 7504 model_repository_manager.cc:942] TriggerNextAction() 'text_detection_ppocrv3' version 1: 0
I1221 12:25:10.147673 7504 model_repository_manager.cc:956] no next action, trigger OnComplete()
double free or corruption (fasttop)
Signal (6) received.
0# 0x0000558D48C9F549 in tritonserver
1# 0x00007FF9BD2E40C0 in /lib/x86_64-linux-gnu/libc.so.6
2# gsignal in /lib/x86_64-linux-gnu/libc.so.6
3# abort in /lib/x86_64-linux-gnu/libc.so.6
4# 0x00007FF9BD32E29E in /lib/x86_64-linux-gnu/libc.so.6
5# 0x00007FF9BD33632C in /lib/x86_64-linux-gnu/libc.so.6
6# 0x00007FF9BD337C95 in /lib/x86_64-linux-gnu/libc.so.6
7# 0x00007FF9BDD05093 in /opt/tritonserver/bin/../lib/libtritonserver.so
8# 0x00007FF9BDD05D87 in /opt/tritonserver/bin/../lib/libtritonserver.so
9# 0x00007FF9BDD060B0 in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007FF9BD6D5DE4 in /lib/x86_64-linux-gnu/libstdc++.so.6
11# 0x00007FF9BDB52609 in /lib/x86_64-linux-gnu/libpthread.so.0
12# clone in /lib/x86_64-linux-gnu/libc.so.6

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
config.txt

Expected behavior
The program is running normally

@krishung5
Copy link
Contributor

Hi @zhaotyer, thanks for sharing the steps to reproduce the issue. Could you try with newer version of Triton and see if the core dump still happens? I couldn't reproduce the issue using our 22.12 release.

@zhaotyer
Copy link
Author

Thank you for your reply,22.12 release is work fine, I want to know what changes were made,because I can't upgrade to the latest version
the gdb info of core dump is:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff6bc7859 in __GI_abort () at abort.c:79
#2 0x00007ffff6c3226e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff6d5c298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007ffff6c3a2fc in malloc_printerr (str=str@entry=0x7ffff6d5e628 "double free or corruption (fasttop)") at malloc.c:5347
#4 0x00007ffff6c3bc65 in _int_free (av=0x7ffe58000020, p=0x7ffe580706d0, have_lock=0) at malloc.c:4266
#5 0x00007ffff75c8203 in std::_Function_base::_Base_manager<nvidia::inferenceserver::ModelRepositoryManager::ModelLifeCycle::AsyncLoad(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, inference::ModelConfig const&, std::shared_ptr<nvidia::inferenceserver::(anonymous namespace)::TritonRepoAgentModelList> const&, std::function<void (nvidia::inferenceserver::Status)>)::{lambda()#1}::operator()() const::{lambda()#2}>::_M_manager(std::_Any_data&, nvidia::inferenceserver::ModelRepositoryManager::ModelLifeCycle::AsyncLoad(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, inference::ModelConfig const&, std::shared_ptr<nvidia::inferenceserver::(anonymous namespace)::TritonRepoAgentModelList> const&, std::function<void (nvidia::inferenceserver::Status)>)::{lambda()#1}::operator()() const::{lambda()#2} const&, std::_Manager_operation) () from /opt/tritonserver/bin/../lib/libtritonserver.so
#6 0x00007ffff75c8ef7 in nvidia::inferenceserver::ModelRepositoryManager::ModelLifeCycle::TriggerNextAction(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, long, nvidia::inferenceserver::ModelRepositoryManager::ModelLifeCycle::ModelInfo*) () from /opt/tritonserver/bin/../lib/libtritonserver.so
#7 0x00007ffff75c9220 in nvidia::inferenceserver::ModelRepositoryManager::ModelLifeCycle::CreateModel(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, long, nvidia::inferenceserver::ModelRepositoryManager::ModelLifeCycle::ModelInfo*)::{lambda()#1}::operator()() () from /opt/tritonserver/bin/../lib/libtritonserver.so
#8 0x00007ffff6fd9de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007ffff7456609 in start_thread (arg=) at pthread_create.c:477
#10 0x00007ffff6cc4133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@krishung5
Copy link
Contributor

We have some refactoring and a lot of new changes to the related code so unfortunately it's hard to tell what changes were made. All the changes can be found in the history of the core repo. If using a later version really is not an option and you would like to make changes to the code, maybe building Triton r22.03 from source with the debug flag enabled will help on using gdb for debugging.

@krishung5 krishung5 added the question Further information is requested label Dec 28, 2022
@dyastremsky
Copy link
Contributor

Closing issue due to inactivity. If you need follow-up, please let us know and we will reopen this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

3 participants