Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triton Inference Server taking adding 3 seconds to get YOLOv4 Inference #4079

Closed
dominusLabs opened this issue Mar 20, 2022 · 10 comments
Closed

Comments

@dominusLabs
Copy link

Hello,

Just recently setup Triton to work with our YoloV4/TensorRT model. However, I noticed that inference is taking a long time, ~3 seconds. This didn't seem right, so I fetched the verbose logs and got this back

I0320 05:09:58.334630 1 grpc_server.cc:272] Process for ServerLive, rpc_ok=1, 3 step START
I0320 05:09:58.334692 1 grpc_server.cc:227] Ready for RPC 'ServerLive', 4
I0320 05:09:58.334773 1 grpc_server.cc:272] Process for ServerLive, rpc_ok=1, 3 step COMPLETE
I0320 05:09:58.334785 1 grpc_server.cc:413] Done for ServerLive, 3
I0320 05:09:58.405717 1 grpc_server.cc:272] Process for ServerReady, rpc_ok=1, 3 step START
I0320 05:09:58.405777 1 grpc_server.cc:227] Ready for RPC 'ServerReady', 4
I0320 05:09:58.405796 1 model_repository_manager.cc:522] ModelStates()
I0320 05:09:58.405874 1 grpc_server.cc:272] Process for ServerReady, rpc_ok=1, 3 step COMPLETE
I0320 05:09:58.405887 1 grpc_server.cc:413] Done for ServerReady, 3
I0320 05:09:58.477041 1 grpc_server.cc:272] Process for ModelReady, rpc_ok=1, 3 step START
I0320 05:09:58.477095 1 grpc_server.cc:227] Ready for RPC 'ModelReady', 4
I0320 05:09:58.477112 1 model_repository_manager.cc:590] GetModel() 'yolov4' version -1
I0320 05:09:58.477189 1 grpc_server.cc:272] Process for ModelReady, rpc_ok=1, 3 step COMPLETE
I0320 05:09:58.477202 1 grpc_server.cc:413] Done for ModelReady, 3
I0320 05:10:01.545416 1 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 6 step START
I0320 05:10:01.545473 1 grpc_server.cc:3413] New request handler for ModelInferHandler, 8
I0320 05:10:01.545488 1 model_repository_manager.cc:590] GetModel() 'yolov4' version -1
I0320 05:10:01.545503 1 model_repository_manager.cc:590] GetModel() 'yolov4' version -1
I0320 05:10:01.545589 1 infer_request.cc:675] prepared: [0x0x7fdef4006340] request id: , model: yolov4, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7fdee8007f38] input: input, type: FP32, original shape: [1,3,416,416], batch + shape: [1,3,416,416], shape: [1,3,416,416]
override inputs:
inputs:
[0x0x7fdee8007f38] input: input, type: FP32, original shape: [1,3,416,416], batch + shape: [1,3,416,416], shape: [1,3,416,416]
original requested outputs:
boxes
requested outputs:
boxes
I0320 05:10:01.545749 1 tensorrt.cc:5376] model yolov4, instance yolov4, executing 1 requests
I0320 05:10:01.545785 1 tensorrt.cc:1609] TRITONBACKEND_ModelExecute: Issuing yolov4 with 1 requests
I0320 05:10:01.545796 1 tensorrt.cc:1668] TRITONBACKEND_ModelExecute: Running yolov4 with 1 requests
I0320 05:10:01.545842 1 tensorrt.cc:2804] Optimization profile default [0] is selected for yolov4
I0320 05:10:01.545896 1 pinned_memory_manager.cc:161] pinned memory allocation: size 2076672, addr 0x7fdff8000090
I0320 05:10:01.546688 1 tensorrt.cc:2181] Context with profile default [0] is being executed for yolov4
I0320 05:10:01.549341 1 infer_response.cc:166] add response output: output: boxes, type: FP32, shape: [1,10647,1,4]
I0320 05:10:01.549391 1 grpc_server.cc:2498] GRPC: unable to provide 'boxes' in GPU, will use CPU
I0320 05:10:01.549439 1 grpc_server.cc:2509] GRPC: using buffer for 'boxes', size: 170352, addr: 0x7fdf5800e960
I0320 05:10:01.549454 1 pinned_memory_manager.cc:161] pinned memory allocation: size 170352, addr 0x7fdff81fb0a0
I0320 05:10:01.558050 1 grpc_server.cc:3572] ModelInferHandler::InferResponseComplete, 6 step ISSUED
I0320 05:10:01.558085 1 grpc_server.cc:2591] GRPC free: size 170352, addr 0x7fdf5800e960
I0320 05:10:01.558338 1 grpc_server.cc:3148] ModelInferHandler::InferRequestComplete
I0320 05:10:01.558373 1 tensorrt.cc:2661] TRITONBACKEND_ModelExecute: model yolov4 released 1 requests
I0320 05:10:01.558389 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7fdff81fb0a0
I0320 05:10:01.558407 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7fdff8000090
I0320 05:10:01.558675 1 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 6 step COMPLETE
I0320 05:10:01.558696 1 grpc_server.cc:2419] Done for ModelInferHandler, 6

So it seems that the actual inference only takes ~10-15 milliseconds, which seems normal. However, before that, the logs show

I0320 05:09:58.477189 1 grpc_server.cc:272] Process for ModelReady, rpc_ok=1, 3 step COMPLETE
I0320 05:09:58.477202 1 grpc_server.cc:413] Done for ModelReady, 3
I0320 05:10:01.545416 1 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 6 step START
I0320 05:10:01.545473 1 grpc_server.cc:3413] New request handler for ModelInferHandler, 8

I tried to search in the docs, on Google and in other issues to figure out what exactly is going on, and why this is taking 3 seconds to run, but didn't find much unfortunately. At first I thought it was because of model state, but according to the docs, CNNs like YOLO should be stateless. I'm unsure how to resolve this.

This is the config.pbtxt

name: "yolov4"
platform: "tensorrt_plan"
dynamic_batching { }
input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 1,  3, 416, 416 ]
  }
]
output [
  {
    name: "boxes"
    data_type: TYPE_FP32
    dims: [ 1, 10647, 1, 4 ]
  },
  {
    name: "confs"
    data_type: TYPE_FP32
    dims: [ 1, 10647, 18 ]
  }
]

and this is how the docker container running Triton is deployed

docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/name/models:/models nvcr.io/nvidia/tritonserver:22.02-py3 tritonserver --model-repository=/models --strict-model-config=false --grpc-infer-allocation-pool-size=16 --log-verbose 1
@dominusLabs
Copy link
Author

The error seemed to come from our client script. We rewrote it and inference times normalized.

@dominusLabs
Copy link
Author

dominusLabs commented Mar 23, 2022

Reopening the issue since the issue is occurring again, this time with the perf_client as well. I made no modifications to the perf_client whatsoever, and am using the same config setup as detailed here (with a different model name in the config to help differentiate between the ensemble version and the standalone version with no preprocessing)

Screenshot_from_2022-03-22_21-51-02

@dominusLabs
Copy link
Author

Here is a full verbose log from the run (with only 1 inference, as I believe adding any more would simply be unnessesary)

I0323 02:54:31.113212 1 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
I0323 02:54:32.036814 1 libtorch.cc:1306] TRITONBACKEND_Initialize: pytorch
I0323 02:54:32.036850 1 libtorch.cc:1316] Triton TRITONBACKEND API version: 1.8
I0323 02:54:32.036859 1 libtorch.cc:1322] 'pytorch' TRITONBACKEND API version: 1.8
I0323 02:54:32.036920 1 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so
2022-03-23 02:54:32.205265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-03-23 02:54:32.256681: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0323 02:54:32.256751 1 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0323 02:54:32.256777 1 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.8
I0323 02:54:32.256790 1 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.8
I0323 02:54:32.256799 1 tensorflow.cc:2216] backend configuration:
{}
I0323 02:54:32.256884 1 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
I0323 02:54:32.258563 1 onnxruntime.cc:2319] TRITONBACKEND_Initialize: onnxruntime
I0323 02:54:32.258592 1 onnxruntime.cc:2329] Triton TRITONBACKEND API version: 1.8
I0323 02:54:32.258601 1 onnxruntime.cc:2335] 'onnxruntime' TRITONBACKEND API version: 1.8
I0323 02:54:32.258610 1 onnxruntime.cc:2365] backend configuration:
{}
I0323 02:54:32.274022 1 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/openvino_2021_2/libtriton_openvino_2021_2.so
I0323 02:54:32.284373 1 openvino.cc:1234] TRITONBACKEND_Initialize: openvino
I0323 02:54:32.284399 1 openvino.cc:1244] Triton TRITONBACKEND API version: 1.8
I0323 02:54:32.284409 1 openvino.cc:1250] 'openvino' TRITONBACKEND API version: 1.8
I0323 02:54:32.472109 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f14d6000000' with size 268435456
I0323 02:54:32.472688 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0323 02:54:32.474573 1 model_config_utils.cc:667] Server side auto-completed config: name: "yolov4-standalone"
platform: "tensorrt_plan"
input {
  name: "input"
  data_type: TYPE_FP32
  dims: 1
  dims: 3
  dims: 416
  dims: 416
}
output {
  name: "boxes"
  data_type: TYPE_FP32
  dims: 1
  dims: 10647
  dims: 1
  dims: 4
}
output {
  name: "confs"
  data_type: TYPE_FP32
  dims: 1
  dims: 10647
  dims: 18
}
default_model_filename: "model.plan"
backend: "tensorrt"
I0323 02:54:32.475200 1 model_repository_manager.cc:700] AsyncLoad() 'yolov4-standalone'
I0323 02:54:32.475304 1 model_repository_manager.cc:939] TriggerNextAction() 'yolov4-standalone' version 1: 1
I0323 02:54:32.475322 1 model_repository_manager.cc:975] Load() 'yolov4-standalone' version 1
I0323 02:54:32.475330 1 model_repository_manager.cc:994] loading: yolov4-standalone:1
I0323 02:54:32.575968 1 model_repository_manager.cc:1052] CreateModel() 'yolov4-standalone' version 1
I0323 02:54:32.576166 1 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so
I0323 02:54:32.576880 1 tensorrt.cc:5145] TRITONBACKEND_Initialize: tensorrt
I0323 02:54:32.576904 1 tensorrt.cc:5155] Triton TRITONBACKEND API version: 1.8
I0323 02:54:32.576916 1 tensorrt.cc:5161] 'tensorrt' TRITONBACKEND API version: 1.8
I0323 02:54:32.576924 1 tensorrt.cc:5184] Registering TensorRT Plugins
I0323 02:54:32.576942 1 logging.cc:52] Plugin creator already registered - ::BatchTilePlugin_TRT version 1
I0323 02:54:32.576954 1 logging.cc:52] Plugin creator already registered - ::BatchedNMS_TRT version 1
I0323 02:54:32.576964 1 logging.cc:52] Plugin creator already registered - ::BatchedNMSDynamic_TRT version 1
I0323 02:54:32.576972 1 logging.cc:52] Plugin creator already registered - ::CoordConvAC version 1
I0323 02:54:32.576983 1 logging.cc:52] Plugin creator already registered - ::CropAndResize version 1
I0323 02:54:32.576993 1 logging.cc:52] Plugin creator already registered - ::CropAndResizeDynamic version 1
I0323 02:54:32.577003 1 logging.cc:52] Plugin creator already registered - ::DecodeBbox3DPlugin version 1
I0323 02:54:32.577013 1 logging.cc:52] Plugin creator already registered - ::DetectionLayer_TRT version 1
I0323 02:54:32.577023 1 logging.cc:52] Plugin creator already registered - ::EfficientNMS_TRT version 1
I0323 02:54:32.577033 1 logging.cc:52] Plugin creator already registered - ::EfficientNMS_ONNX_TRT version 1
I0323 02:54:32.577043 1 logging.cc:52] Plugin creator already registered - ::EfficientNMS_TFTRT_TRT version 1
I0323 02:54:32.577052 1 logging.cc:52] Plugin creator already registered - ::FlattenConcat_TRT version 1
I0323 02:54:32.577067 1 logging.cc:52] Plugin creator already registered - ::GenerateDetection_TRT version 1
I0323 02:54:32.577076 1 logging.cc:52] Plugin creator already registered - ::GridAnchor_TRT version 1
I0323 02:54:32.577086 1 logging.cc:52] Plugin creator already registered - ::GridAnchorRect_TRT version 1
I0323 02:54:32.577096 1 logging.cc:52] Plugin creator already registered - ::InstanceNormalization_TRT version 1
I0323 02:54:32.577106 1 logging.cc:52] Plugin creator already registered - ::LReLU_TRT version 1
I0323 02:54:32.577116 1 logging.cc:52] Plugin creator already registered - ::MultilevelCropAndResize_TRT version 1
I0323 02:54:32.577127 1 logging.cc:52] Plugin creator already registered - ::MultilevelProposeROI_TRT version 1
I0323 02:54:32.577140 1 logging.cc:52] Plugin creator already registered - ::NMS_TRT version 1
I0323 02:54:32.577157 1 logging.cc:52] Plugin creator already registered - ::NMSDynamic_TRT version 1
I0323 02:54:32.577167 1 logging.cc:52] Plugin creator already registered - ::Normalize_TRT version 1
I0323 02:54:32.577178 1 logging.cc:52] Plugin creator already registered - ::PillarScatterPlugin version 1
I0323 02:54:32.577188 1 logging.cc:52] Plugin creator already registered - ::PriorBox_TRT version 1
I0323 02:54:32.577199 1 logging.cc:52] Plugin creator already registered - ::ProposalLayer_TRT version 1
I0323 02:54:32.577208 1 logging.cc:52] Plugin creator already registered - ::Proposal version 1
I0323 02:54:32.577218 1 logging.cc:52] Plugin creator already registered - ::ProposalDynamic version 1
I0323 02:54:32.577228 1 logging.cc:52] Plugin creator already registered - ::PyramidROIAlign_TRT version 1
I0323 02:54:32.577238 1 logging.cc:52] Plugin creator already registered - ::Region_TRT version 1
I0323 02:54:32.577247 1 logging.cc:52] Plugin creator already registered - ::Reorg_TRT version 1
I0323 02:54:32.577257 1 logging.cc:52] Plugin creator already registered - ::ResizeNearest_TRT version 1
I0323 02:54:32.577268 1 logging.cc:52] Plugin creator already registered - ::RPROI_TRT version 1
I0323 02:54:32.577281 1 logging.cc:52] Plugin creator already registered - ::ScatterND version 1
I0323 02:54:32.577291 1 logging.cc:52] Plugin creator already registered - ::SpecialSlice_TRT version 1
I0323 02:54:32.577300 1 logging.cc:52] Plugin creator already registered - ::Split version 1
I0323 02:54:32.577310 1 logging.cc:52] Plugin creator already registered - ::VoxelGeneratorPlugin version 1
I0323 02:54:32.577323 1 tensorrt.cc:5204] backend configuration:
{}
I0323 02:54:32.577357 1 tensorrt.cc:5256] TRITONBACKEND_ModelInitialize: yolov4-standalone (version 1)
I0323 02:54:32.578368 1 model_config_utils.cc:1616] ModelConfig 64-bit fields:
I0323 02:54:32.578390 1 model_config_utils.cc:1618]     ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0323 02:54:32.578399 1 model_config_utils.cc:1618]     ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0323 02:54:32.578406 1 model_config_utils.cc:1618]     ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0323 02:54:32.578413 1 model_config_utils.cc:1618]     ModelConfig::ensemble_scheduling::step::model_version
I0323 02:54:32.578420 1 model_config_utils.cc:1618]     ModelConfig::input::dims
I0323 02:54:32.578427 1 model_config_utils.cc:1618]     ModelConfig::input::reshape::shape
I0323 02:54:32.578434 1 model_config_utils.cc:1618]     ModelConfig::instance_group::secondary_devices::device_id
I0323 02:54:32.578443 1 model_config_utils.cc:1618]     ModelConfig::model_warmup::inputs::value::dims
I0323 02:54:32.578451 1 model_config_utils.cc:1618]     ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0323 02:54:32.578459 1 model_config_utils.cc:1618]     ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0323 02:54:32.578467 1 model_config_utils.cc:1618]     ModelConfig::output::dims
I0323 02:54:32.578497 1 model_config_utils.cc:1618]     ModelConfig::output::reshape::shape
I0323 02:54:32.578506 1 model_config_utils.cc:1618]     ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0323 02:54:32.578514 1 model_config_utils.cc:1618]     ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0323 02:54:32.578520 1 model_config_utils.cc:1618]     ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0323 02:54:32.578527 1 model_config_utils.cc:1618]     ModelConfig::sequence_batching::state::dims
I0323 02:54:32.578535 1 model_config_utils.cc:1618]     ModelConfig::sequence_batching::state::initial_state::dims
I0323 02:54:32.578542 1 model_config_utils.cc:1618]     ModelConfig::version_policy::specific::versions
I0323 02:54:33.003483 1 logging.cc:49] [MemUsageChange] Init CUDA: CPU +313, GPU +0, now: CPU 377, GPU 310 (MiB)
I0323 02:54:33.110721 1 logging.cc:49] Loaded engine size: 126 MiB
I0323 02:54:33.902349 1 logging.cc:52] Using cublasLt as a tactic source
I0323 02:54:33.902523 1 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +518, GPU +224, now: CPU 1166, GPU 662 (MiB)
I0323 02:54:33.902716 1 logging.cc:52] Using cuDNN as a tactic source
I0323 02:54:34.066036 1 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +115, GPU +54, now: CPU 1281, GPU 716 (MiB)
I0323 02:54:34.067635 1 logging.cc:52] Deserialization required 956388 microseconds.
I0323 02:54:34.067683 1 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +123, now: CPU 0, GPU 123 (MiB)
W0323 02:54:34.068471 1 tensorrt.cc:599] The specified dimensions in model config for yolov4-standalone hints that batching is unavailable
I0323 02:54:34.093729 1 tensorrt.cc:475] post auto-complete:
{
    "name": "yolov4-standalone",
    "platform": "tensorrt_plan",
    "backend": "tensorrt",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 0,
    "input": [
        {
            "name": "input",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1,
                3,
                416,
                416
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "boxes",
            "data_type": "TYPE_FP32",
            "dims": [
                1,
                10647,
                1,
                4
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "confs",
            "data_type": "TYPE_FP32",
            "dims": [
                1,
                10647,
                18
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "yolov4-standalone",
            "kind": "KIND_GPU",
            "count": 1,
            "gpus": [
                0
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.plan",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {},
    "model_warmup": []
}
I0323 02:54:34.094289 1 tensorrt.cc:439] model configuration:
{
    "name": "yolov4-standalone",
    "platform": "tensorrt_plan",
    "backend": "tensorrt",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 0,
    "input": [
        {
            "name": "input",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1,
                3,
                416,
                416
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "boxes",
            "data_type": "TYPE_FP32",
            "dims": [
                1,
                10647,
                1,
                4
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "confs",
            "data_type": "TYPE_FP32",
            "dims": [
                1,
                10647,
                18
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "yolov4-standalone",
            "kind": "KIND_GPU",
            "count": 1,
            "gpus": [
                0
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.plan",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {},
    "model_warmup": []
}
I0323 02:54:34.094843 1 tensorrt.cc:5305] TRITONBACKEND_ModelInstanceInitialize: yolov4-standalone (GPU device 0)
I0323 02:54:34.095530 1 backend_model_instance.cc:105] Creating instance yolov4-standalone on GPU 0 (7.5) using artifact 'model.plan'
I0323 02:54:34.095920 1 tensorrt.cc:1483] Zero copy optimization is disabled
I0323 02:54:34.096242 1 logging.cc:49] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1010, GPU 574 (MiB)
I0323 02:54:34.203148 1 logging.cc:49] Loaded engine size: 126 MiB
I0323 02:54:34.277634 1 logging.cc:52] Using cublasLt as a tactic source
I0323 02:54:34.277780 1 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1281, GPU 708 (MiB)
I0323 02:54:34.277914 1 logging.cc:52] Using cuDNN as a tactic source
I0323 02:54:34.279519 1 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1281, GPU 716 (MiB)
I0323 02:54:34.280836 1 logging.cc:52] Deserialization required 77148 microseconds.
I0323 02:54:34.280867 1 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +123, now: CPU 0, GPU 123 (MiB)
I0323 02:54:34.281604 1 tensorrt.cc:387] Created new runtime on GPU device 0, NVDLA core -1 for yolov4-standalone
I0323 02:54:34.281632 1 tensorrt.cc:394] Created new engine on GPU device 0, NVDLA core -1 for yolov4-standalone
I0323 02:54:34.282831 1 logging.cc:52] Using cublasLt as a tactic source
I0323 02:54:34.282939 1 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1028, GPU 708 (MiB)
I0323 02:54:34.283061 1 logging.cc:52] Using cuDNN as a tactic source
I0323 02:54:34.284628 1 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1028, GPU 716 (MiB)
I0323 02:54:34.285252 1 logging.cc:52] Total per-runner device persistent memory is 129355264
I0323 02:54:34.285278 1 logging.cc:52] Total per-runner host persistent memory is 274960
I0323 02:54:34.285899 1 logging.cc:52] Allocated activation device memory of size 33970688
I0323 02:54:34.287820 1 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +156, now: CPU 0, GPU 279 (MiB)
I0323 02:54:34.287861 1 tensorrt.cc:3191] Detected input as execution binding for yolov4-standalone
I0323 02:54:34.287871 1 tensorrt.cc:3191] Detected boxes as execution binding for yolov4-standalone
I0323 02:54:34.287881 1 tensorrt.cc:3191] Detected confs as execution binding for yolov4-standalone
I0323 02:54:34.288684 1 tensorrt.cc:1409] Created instance yolov4-standalone on GPU 0 with stream priority 0 and optimization profile default[0];
I0323 02:54:34.288886 1 triton_model_instance.cc:668] Starting backend thread for yolov4-standalone at nice 0 on device 0...
I0323 02:54:34.289026 1 model_repository_manager.cc:1149] successfully loaded 'yolov4-standalone' version 1
I0323 02:54:34.289072 1 model_repository_manager.cc:939] TriggerNextAction() 'yolov4-standalone' version 1: 0
I0323 02:54:34.289083 1 model_repository_manager.cc:953] no next action, trigger OnComplete()
I0323 02:54:34.289195 1 model_repository_manager.cc:546] VersionStates() 'yolov4-standalone'
I0323 02:54:34.289393 1 model_repository_manager.cc:546] VersionStates() 'yolov4-standalone'
I0323 02:54:34.289443 1 server.cc:522] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0323 02:54:34.289513 1 server.cc:549] 
+-------------+-------------------------------------------------------------------------+--------+
| Backend     | Path                                                                    | Config |
+-------------+-------------------------------------------------------------------------+--------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so                 | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so         | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so         | {}     |
| openvino    | /opt/tritonserver/backends/openvino_2021_2/libtriton_openvino_2021_2.so | {}     |
| tensorrt    | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so               | {}     |
+-------------+-------------------------------------------------------------------------+--------+
I0323 02:54:34.289523 1 model_repository_manager.cc:522] ModelStates()
I0323 02:54:34.289546 1 server.cc:592] 
+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| yolov4-standalone | 1       | READY  |
+-------------------+---------+--------+
I0323 02:54:34.301436 1 metrics.cc:623] Collecting metrics for GPU 0: Tesla T4
I0323 02:54:34.302282 1 tritonserver.cc:1932] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.19.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /models                                                                                                                                                                                      |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 0                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0323 02:54:34.302385 1 grpc_server.cc:4325] === GRPC KeepAlive Options ===
I0323 02:54:34.302398 1 grpc_server.cc:4326] keepalive_time_ms: 7200000
I0323 02:54:34.302406 1 grpc_server.cc:4328] keepalive_timeout_ms: 20000
I0323 02:54:34.302414 1 grpc_server.cc:4330] keepalive_permit_without_calls: 0
I0323 02:54:34.302422 1 grpc_server.cc:4332] http2_max_pings_without_data: 2
I0323 02:54:34.302431 1 grpc_server.cc:4334] http2_min_recv_ping_interval_without_data_ms: 300000
I0323 02:54:34.302439 1 grpc_server.cc:4337] http2_max_ping_strikes: 2
I0323 02:54:34.302447 1 grpc_server.cc:4339] ==============================
I0323 02:54:34.304085 1 grpc_server.cc:227] Ready for RPC 'ServerLive', 0
I0323 02:54:34.304114 1 grpc_server.cc:227] Ready for RPC 'ServerReady', 0
I0323 02:54:34.304153 1 grpc_server.cc:227] Ready for RPC 'ModelReady', 0
I0323 02:54:34.304362 1 grpc_server.cc:227] Ready for RPC 'ServerMetadata', 0
I0323 02:54:34.304387 1 grpc_server.cc:227] Ready for RPC 'ModelMetadata', 0
I0323 02:54:34.304401 1 grpc_server.cc:227] Ready for RPC 'ModelConfig', 0
I0323 02:54:34.304411 1 grpc_server.cc:227] Ready for RPC 'ModelStatistics', 0
I0323 02:54:34.304425 1 grpc_server.cc:227] Ready for RPC 'Trace', 0
I0323 02:54:34.304502 1 grpc_server.cc:227] Ready for RPC 'SystemSharedMemoryStatus', 0
I0323 02:54:34.304679 1 grpc_server.cc:227] Ready for RPC 'SystemSharedMemoryRegister', 0
I0323 02:54:34.304699 1 grpc_server.cc:227] Ready for RPC 'SystemSharedMemoryUnregister', 0
I0323 02:54:34.304711 1 grpc_server.cc:227] Ready for RPC 'CudaSharedMemoryStatus', 0
I0323 02:54:34.304724 1 grpc_server.cc:227] Ready for RPC 'CudaSharedMemoryRegister', 0
I0323 02:54:34.305115 1 grpc_server.cc:227] Ready for RPC 'CudaSharedMemoryUnregister', 0
I0323 02:54:34.305229 1 grpc_server.cc:227] Ready for RPC 'RepositoryIndex', 0
I0323 02:54:34.305420 1 grpc_server.cc:227] Ready for RPC 'RepositoryModelLoad', 0
I0323 02:54:34.305531 1 grpc_server.cc:227] Ready for RPC 'RepositoryModelUnload', 0
I0323 02:54:34.305742 1 grpc_server.cc:421] Thread started for CommonHandler
I0323 02:54:34.305997 1 grpc_server.cc:3413] New request handler for ModelInferHandler, 1
I0323 02:54:34.306157 1 grpc_server.cc:2426] Thread started for ModelInferHandler
I0323 02:54:34.306793 1 grpc_server.cc:3413] New request handler for ModelInferHandler, 2
I0323 02:54:34.306876 1 grpc_server.cc:2426] Thread started for ModelInferHandler
I0323 02:54:34.307168 1 grpc_server.cc:3772] New request handler for ModelStreamInferHandler, 4
I0323 02:54:34.307304 1 grpc_server.cc:2426] Thread started for ModelStreamInferHandler
I0323 02:54:34.307418 1 grpc_server.cc:4375] Started GRPCInferenceService at 0.0.0.0:8001
I0323 02:54:34.307936 1 http_server.cc:3075] Started HTTPService at 0.0.0.0:8000
I0323 02:54:34.350272 1 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
I0323 02:55:09.652598 1 grpc_server.cc:272] Process for ServerLive, rpc_ok=1, 0 step START
I0323 02:55:09.652645 1 grpc_server.cc:227] Ready for RPC 'ServerLive', 1
I0323 02:55:09.652729 1 grpc_server.cc:272] Process for ServerLive, rpc_ok=1, 0 step COMPLETE
I0323 02:55:09.652741 1 grpc_server.cc:413] Done for ServerLive, 0
I0323 02:55:09.721482 1 grpc_server.cc:272] Process for ServerReady, rpc_ok=1, 0 step START
I0323 02:55:09.721677 1 grpc_server.cc:227] Ready for RPC 'ServerReady', 1
I0323 02:55:09.721853 1 model_repository_manager.cc:522] ModelStates()
I0323 02:55:09.722234 1 grpc_server.cc:272] Process for ServerReady, rpc_ok=1, 0 step COMPLETE
I0323 02:55:09.722449 1 grpc_server.cc:413] Done for ServerReady, 0
I0323 02:55:09.784369 1 grpc_server.cc:272] Process for ModelReady, rpc_ok=1, 0 step START
I0323 02:55:09.784396 1 grpc_server.cc:227] Ready for RPC 'ModelReady', 1
I0323 02:55:09.784408 1 model_repository_manager.cc:590] GetModel() 'yolov4-standalone' version -1
I0323 02:55:09.784633 1 grpc_server.cc:272] Process for ModelReady, rpc_ok=1, 0 step COMPLETE
I0323 02:55:09.784652 1 grpc_server.cc:413] Done for ModelReady, 0
I0323 02:55:12.728061 1 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 1 step START
I0323 02:55:12.728112 1 grpc_server.cc:3413] New request handler for ModelInferHandler, 5
I0323 02:55:12.728131 1 model_repository_manager.cc:590] GetModel() 'yolov4-standalone' version -1
I0323 02:55:12.728151 1 model_repository_manager.cc:590] GetModel() 'yolov4-standalone' version -1
I0323 02:55:12.728184 1 infer_request.cc:675] prepared: [0x0x7f1420004320] request id: , model: yolov4-standalone, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f1420004618] input: input, type: FP32, original shape: [1,3,416,416], batch + shape: [1,3,416,416], shape: [1,3,416,416]
override inputs:
inputs:
[0x0x7f1420004618] input: input, type: FP32, original shape: [1,3,416,416], batch + shape: [1,3,416,416], shape: [1,3,416,416]
original requested outputs:
boxes
confs
requested outputs:
boxes
confs
I0323 02:55:12.728294 1 tensorrt.cc:5376] model yolov4-standalone, instance yolov4-standalone, executing 1 requests
I0323 02:55:12.728316 1 tensorrt.cc:1609] TRITONBACKEND_ModelExecute: Issuing yolov4-standalone with 1 requests
I0323 02:55:12.728338 1 tensorrt.cc:1668] TRITONBACKEND_ModelExecute: Running yolov4-standalone with 1 requests
I0323 02:55:12.728389 1 tensorrt.cc:2804] Optimization profile default [0] is selected for yolov4-standalone
I0323 02:55:12.728428 1 pinned_memory_manager.cc:161] pinned memory allocation: size 2076672, addr 0x7f14d6000090
I0323 02:55:12.729153 1 tensorrt.cc:2181] Context with profile default [0] is being executed for yolov4-standalone
I0323 02:55:12.732739 1 infer_response.cc:166] add response output: output: boxes, type: FP32, shape: [1,10647,1,4]
I0323 02:55:12.732850 1 grpc_server.cc:2498] GRPC: unable to provide 'boxes' in GPU, will use CPU
I0323 02:55:12.732995 1 grpc_server.cc:2509] GRPC: using buffer for 'boxes', size: 170352, addr: 0x7f143c00e8f0
I0323 02:55:12.733016 1 pinned_memory_manager.cc:161] pinned memory allocation: size 170352, addr 0x7f14d61fb0a0
I0323 02:55:12.733060 1 infer_response.cc:166] add response output: output: confs, type: FP32, shape: [1,10647,18]
I0323 02:55:12.733075 1 grpc_server.cc:2498] GRPC: unable to provide 'confs' in GPU, will use CPU
I0323 02:55:12.733586 1 grpc_server.cc:2509] GRPC: using buffer for 'confs', size: 766584, addr: 0x7f143c038350
I0323 02:55:12.733614 1 pinned_memory_manager.cc:161] pinned memory allocation: size 766584, addr 0x7f14d6224a20
I0323 02:55:12.742681 1 grpc_server.cc:3572] ModelInferHandler::InferResponseComplete, 1 step ISSUED
I0323 02:55:12.742722 1 model_repository_manager.cc:590] GetModel() 'yolov4-standalone' version 1
I0323 02:55:12.757716 1 grpc_server.cc:2591] GRPC free: size 170352, addr 0x7f143c00e8f0
I0323 02:55:12.757742 1 grpc_server.cc:2591] GRPC free: size 766584, addr 0x7f143c038350
I0323 02:55:12.758266 1 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 1 step COMPLETE
I0323 02:55:12.758293 1 grpc_server.cc:2419] Done for ModelInferHandler, 1
I0323 02:55:12.758355 1 grpc_server.cc:3148] ModelInferHandler::InferRequestComplete
I0323 02:55:12.758398 1 tensorrt.cc:2661] TRITONBACKEND_ModelExecute: model yolov4-standalone released 1 requests
I0323 02:55:12.758417 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f14d61fb0a0
I0323 02:55:12.758429 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f14d6224a20
I0323 02:55:12.758449 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f14d6000090

@dominusLabs
Copy link
Author

We can the client and the inference server on the same machine and inference was returned in ~50ms, which was the expected behaviour. However, running the same client script on a different machine still results in a 3 second delay in inference. We are not using the shared memory command when launching the inference server or in the client and everything is done over GRPC.

@dominusLabs
Copy link
Author

Following up on the network latency theory from above ^ we tested on 3 different systems, all running Ubuntu 20.04.

The Triton Inference server is located on a GCloud VM that is running on a Tesla T4. It is located in us-west1-b. All tests were ran using the same server instance/container and the same client script/image.

We tested on the same VM that was running the Triton Inference Server, and it only took ~50ms to get back an inference.
We tested on another VM, in the same us-west1-b region. It also took ~50ms to get back an inference.
We tested on another VM, in the us-central1-a region. It took ~1100ms to get back an inference.
We tested on another computer, in Texas (so the south). It took ~3000ms to get back an inference.
We tested on another VM, in Germany. It took ~6000ms to get back an inference.

All of the networks being used were minimum 1GB/s at the minimum.

So our theory is that Triton has to essentially wait for the entire image to come in to the inference server before starting inference. But that image takes a while to reach the inference server to come in, causing Triton to wait, thus resulting in increased inference round trip time.

@dominusLabs
Copy link
Author

We switched to using the HTTP client, since we're more familiar with HTTP than GRPC. We also started using deflate compression. With these changes, we managed to reduce the time by ~50%, so the inference is taking ~1500ms from a computer in Texas to the VM hosting Triton in us-west1-b. However, this is a long shot from the ~50ms inference roundtrip being offered by a VM in the same region.

We are using NGINX with the VM. It is unlikely this has anything to do with increasing the inference time, but wanted to mention this so potential investigators have a better understanding of the setup.

@dominusLabs
Copy link
Author

dominusLabs commented Mar 24, 2022

Here is a more simplified output of a request lifetime

====================== SERVER =========================
I0324 02:47:29.911654 1 http_server.cc:2976] HTTP request: 0 /v2/health/live
I0324 02:47:29.971922 1 http_server.cc:2976] HTTP request: 0 /v2/health/ready
I0324 02:47:29.971964 1 model_repository_manager.cc:522] ModelStates()
I0324 02:47:30.031816 1 http_server.cc:2976] HTTP request: 0 /v2/models/yolov4-standalone/ready
I0324 02:47:30.031858 1 model_repository_manager.cc:590] GetModel() 'yolov4-standalone' version -1
====================== CLIENT =========================
00000 02:47:30.005100 1 Starting to build grpc request
00000 02:47:30.015531 1 Starting inference request thread
00000 02:47:30.015603 1 Sending inference request
00000 02:47:30.059975 1 Triton Client actually sending now
====================== SERVER =========================
I0324 02:47:30.555256 1 http_server.cc:2976] HTTP request: 2 /v2/models/yolov4-standalone/infer
I0324 02:47:30.555315 1 model_repository_manager.cc:590] GetModel() 'yolov4-standalone' version -1
I0324 02:47:30.555336 1 model_repository_manager.cc:590] GetModel() 'yolov4-standalone' version -1
I0324 02:47:30.561618 1 infer_request.cc:675]  request id: , model: yolov4-standalone
I0324 02:47:30.562883 1 tensorrt.cc:5376] model yolov4-standalone, instance yolov4-standalone_0, 
I0324 02:47:30.562920 1 tensorrt.cc:1609] TRITONBACKEND_ModelExecute: Issuing yolov4-standalone_0
I0324 02:47:30.562936 1 tensorrt.cc:1668] TRITONBACKEND_ModelExecute: Running yolov4-standalone_0 
I0324 02:47:30.562975 1 tensorrt.cc:2804] Optimization profile default [0] is selected for yolov4-standalone_0
I0324 02:47:30.563017 1 pinned_memory_manager.cc:161] pinned memory allocation: size 2076672
I0324 02:47:30.563959 1 tensorrt.cc:2181] Context with profile default [0] is being executed
I0324 02:47:30.566265 1 infer_response.cc:166] add response output: output: confs, type: FP32
I0324 02:47:30.566299 1 http_server.cc:1068] HTTP: unable to provide 'confs' in GPU, will use CPU
I0324 02:47:30.566314 1 http_server.cc:1088] HTTP using buffer for: 'confs', size: 766584, addr: 0x7f8c38013be0
I0324 02:47:30.566373 1 pinned_memory_manager.cc:161] pinned memory allocation: size 766584, addr 0x7f8cdc1fb0a0
I0324 02:47:30.922351 1 http_server.cc:1140] HTTP release: size 766584, addr 0x7f8c38013be0
I0324 02:47:30.922504 1 tensorrt.cc:2661] TRITONBACKEND_ModelExecute: model yolov4-standalone_0
I0324 02:47:30.922572 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f8cdc1fb0a0
I0324 02:47:30.922640 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f8cdc000090
====================== CLIENT =========================
00000 02:47:31.403997 1 Got back inference response```

Part #1 

00000 02:47:30.059975 1 Triton Client actually sending now
====================== SERVER =========================
I0324 02:47:30.555256 1 http_server.cc:2976] HTTP request: 2 /v2/models/yolov4-standalone/infer
I0324 02:47:30.555315 1 model_repository_manager.cc:590] GetModel() 'yolov4-standalone' version -1```

Part #2

I0324 02:47:30.566314 1 http_server.cc:1088] HTTP using buffer for: 'confs', size: 766584, addr: 0x7f8c38013be0
I0324 02:47:30.566373 1 pinned_memory_manager.cc:161] pinned memory allocation: size 766584, addr 0x7f8cdc1fb0a0
I0324 02:47:30.922351 1 http_server.cc:1140] HTTP release: size 766584, addr 0x7f8c38013be0
I0324 02:47:30.922504 1 tensorrt.cc:2661] TRITONBACKEND_ModelExecute: model yolov4-standalone_0

Part #3

I0324 02:47:30.922640 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7f8cdc000090
====================== CLIENT =========================
00000 02:47:31.403997 1 Got back inference response

This particular request took ~1600ms, so these 3 big chunks are taking up ~95% of the inference roundtrip time. I'll try to find out if the client is just sitting on the request and response, or if the server is taking longer than usual to process the request/response.

@deadeyegoodwin
Copy link
Contributor

Why do you think this is anything more that networking overhead?

@dominusLabs
Copy link
Author

I no longer do. I thought it was something more, since one of the chunks of time that was causing problems was within the actual Triton inference process, but that issue has since been resolved by us. I'm wondering if there is a way to reduce the networking overhead, but so far, I haven't found anything significantly fruitful.

@dominusLabs
Copy link
Author

Closing due to this being unrelated to Triton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants