Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triton terminated with Signal (6) #4566

Closed
erichtho opened this issue Jun 30, 2022 · 5 comments
Closed

Triton terminated with Signal (6) #4566

erichtho opened this issue Jun 30, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@erichtho
Copy link

erichtho commented Jun 30, 2022

When using triton grpc client to infer, triton will exit unexpectedly sometimes.
like using:

with tritonclient.grpc.InferenceServerClient('localhost:8001', verbose = False) as client:
            outputs = [
                httpclient.InferRequestedOutput('logits', ),
                httpclient.InferRequestedOutput('embs', )
            ]

            # data_loader is a torch dataloader with 4 workers
            for sent_count, test_batch in enumerate(data_loader):
                with autocast():
                    processed_signal, processed_signal_length = preprocessor(
                        input_signal = test_batch[0].to(device),
                        length = test_batch[1].to(device)
                    )
                inputs = [
                    httpclient.InferInput("audio_signal", list(processed_signal.shape), "FP16"),
                    httpclient.InferInput("length", [1, 1], np_to_triton_dtype(np.int32))
                ]
                inputs[0].set_data_from_numpy(processed_signal.cpu().numpy().astype(np.float16), )
                inputs[1].set_data_from_numpy(processed_signal_length.cpu().numpy().astype(np.int32).reshape(1, 1))
                result = client.infer(model_name = "tensorrt_emb",
                                          inputs = inputs,
                                          outputs = outputs)

and tritonserver output:

terminate called after throwing an instance of 'nvinfer1::InternalError'
what(): Assertion mUsedAllocators.find(alloc) != mUsedAllocators.end() && "Myelin free callback called with invalid MyelinAllocator" failed.
Signal (6) received.
0# 0x00005602FC4F21B9 in tritonserver
1# 0x00007FC98736C0C0 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
4# 0x00007FC987725911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
5# 0x00007FC98773138C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
6# 0x00007FC987730369 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
7# __gxx_personality_v0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
8# 0x00007FC98752BBEF in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
9# _Unwind_RaiseException in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
10# __cxa_throw in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
11# nvinfer1::Lobbernvinfer1::InternalError::operator()(char const*, char const*, int, int, nvinfer1::ErrorCode, char const*) in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
12# 0x00007FC9020EECBC in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
13# 0x00007FC902A7220F in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
14# 0x00007FC902A2862D in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
15# 0x00007FC902A7F653 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
16# 0x00007FC9020EE715 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
17# 0x00007FC901C8BAD0 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
18# 0x00007FC9020F41F4 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
19# 0x00007FC902913FD8 in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
20# 0x00007FC90291478C in /usr/lib/x86_64-linux-gnu/libnvinfer.so.8
21# 0x00007FC97A57C6D7 in /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
22# 0x00007FC97A5855FE in /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
23# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
24# 0x00007FC987C1D73A in /opt/tritonserver/bin/../lib/libtritonserver.so
25# 0x00007FC987C1E0F7 in /opt/tritonserver/bin/../lib/libtritonserver.so
26# 0x00007FC987CDB411 in /opt/tritonserver/bin/../lib/libtritonserver.so
27# 0x00007FC987C175C7 in /opt/tritonserver/bin/../lib/libtritonserver.so
28# 0x00007FC98775DDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
29# 0x00007FC98896D609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
30# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

tritonserver version:22.05-py3(docker image)
using tensorrt backend.
os: ubuntu 20.04

How To Reproduce
We use trtexec to transform a onnx model to tensorRT engine(with maxShapes=1x80x12000), then put into triton model repository.
When send dozens of request with shape 1x80x11000(like 8000) and other model requests in same time(different grpc client in different process, not multiprocessing, but multiple .py running),triton will exit by chance.

@rmccorm4
Copy link
Collaborator

Hi @erichtho,

Thanks for reporting this issue.

  1. Can you reproduce this with the HTTP client as well, or is it only reproducible with GRPC?
  2. Can you share a complete client script, ONNX model, and example trtexec conversion command that we can use to easily reproduce this error as-is?

CC @GuanLuo @tanmayv25 if you've seen any TRT or similar backend issues like this before

@erichtho
Copy link
Author

erichtho commented Jul 1, 2022

Sorry, I can't share my code, it's a part of big project. I'm try to simplify it, but can't reproduce the error with simplified code(still trying).
Onnx model is transformed from nemo titanet(titanet-l.nemo). I edit the graph, delete the if-branch in squeeze node, to make trtexec run normally. It maybe cause problem but the bug seems like didn't relate to input. Here is transformed onnx file (google drive).
trtexec command:
./trtexec --onnx=titanet-l_2.onnx --minShapes=audio_signal:1x80x1,length:1 --optShapes=audio_signal:1x80x12000,length:1 --maxShapes=audio_signal:1x80x12000,length:1 --fp16 --inputIOFormats=fp16:chw,int32:chw --saveEngine=model.plan --workspace=16400

And http client get broken pipe, so I can't tell if it would cause the bug.
BrokenPipeError: [Errno 32] Broken pipe

@tanmayv25
Copy link
Contributor

The back trace suggests that the error originates in within TensorRT. I don't think the issue is client specific.

When send dozens of request with shape 1x80x11000(like 8000) and other model requests in same time(different grpc client in different process, not multiprocessing, but multiple .py running),triton will exit by chance.

I assume the issue only occurs in case of sufficient request concurrency? What is your instance count? Can you share your model configuration file?
Is your system running out of memory?

@erichtho
Copy link
Author

erichtho commented Jul 6, 2022

Yes, it's related to request concurrency. And I feel like it appear with higher opportunity when there are lots of request with almost maximum shape.
I checked with top, dmesg, nvidia-smi, it's seems no memory issue, including CUDA memory (RTX3090) and system ram.
model configuration(report bug one):

name: "tensorrt_emb"
backend: "tensorrt"
max_batch_size: 1

input [
  {
    name: "audio_signal"
    data_type: TYPE_FP32
    dims: [80, -1]
  }
]
input [
  {
    name: "length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP16
    dims: [ -1 ]
  }
]
output [
  {
    name: "embs"
    data_type: TYPE_FP16
    dims: [ -1 ]
  }
]

instance_group [
  {
    count: 3
    kind: KIND_GPU
  }
]

dynamic_batching {
  preferred_batch_size: [1]
  max_queue_delay_microseconds: 1
  preserve_ordering: true
}

There are two other models in model repository, total instance count is 3.

By the way, we tried triton and onnx model. It's normal.

@tanmayv25 tanmayv25 added the bug Something isn't working label Jul 6, 2022
@tanmayv25
Copy link
Contributor

TensorRT team seems to have a fix that can resolve this issue. We are working with them to make the fix available to Triton users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

3 participants