Boosting DL Service Throughput 1.5-4x by Ensemble Pipeline Serving with Concurrent CUDA Streams for PyTorch/LibTorch Frontend and TensorRT/CVCUDA, etc., Backends
deployment
inference
pytorch
ray
serve
tensorrt
serving
pipeline-parallelism
torch2trt
triton-inference-server
ray-serve
cvcuda
-
Updated
Jun 5, 2024 - C++