Skip to content

Latest commit

 

History

History
executable file
·
238 lines (185 loc) · 20.1 KB

File metadata and controls

executable file
·
238 lines (185 loc) · 20.1 KB
id title type displayed_sidebar
torch
Torch-related Backends
reference
api

Overview

Name Main Initialization Parameters Input[Type] Output[Type] Note
DecodeTensor color(default=rgb) ;data_format(default=nchw) data[str/bytes] result[at::Tensor]
color[str] color[rgb,bgr]
data_format[nchw,hwc] v0.3.2rc3
cvtColorTensor color data[at::Tensor]
color[str]
result[at::Tensor]
ResizeTensor resize_h,resize_w data[at::Tensor] result[at::Tensor]
PillowResizeTensor resize_h,resize_w data[at::Tensor] result[at::Tensor] CV_8UC3
ResizePadTensor max_h,max_w,pad_value data[at::Tensor] - result[at::Tensor]
- inverse_trans
[std::function<std::pair<float, float>(float x, float y)>]
TensorrtTensor model,instance_num,
max,precision,
mean,std,model::cache
TensorRT inference engine
Tensor2Mat data[at::Tensor] result[cv::Mat]
Tensor2Vector data[at::Tensor] result[std::vector]
SyncTensor SyncTensor::backend data[at::Tensor] result[at::Tensor] CUDA Stream Synchronization Facility
Torch device_id Torch::backend data[at::Tensor] result[at::Tensor]
SaveTensor save_dir data[at::Tensor] result[at::Tensor]
LoadTensor tensor_name result[at::Tensor]
C10Exception Throws c10::Error exception, used to simulate internal Torch exceptions

DecodeTensor

  • Calls nvjpeg for GPU decoding, with a limit of h*w<5000*5000. The output data shape is 13hw.
  • If the decoded image is empty, there will be no result key value output.

cvtColorTensor

  • The color parameter initialized at the beginning is the target color space. The color read from the input is the color space of the data. If they are different, a color space conversion will be performed. Otherwise, the input value will be returned.
  • color currently supports "rgb" and "bgr".
  • The input must be in the shape of 13hw.

ResizeTensor

  • Calls at::upsample_bilinear2d for resizing.
  • resize_h and resize_w must be integers, with a valid range of [1, 1024 * 1024].
  • The input must be in the shape of 13hw.
  • The output is 13hw, with float data type.

PillowResizeTensor

  • The input at::Tensor type must be at::kByte, with a shape of 13hw.
  • resize_h and resize_w must be integers, with a valid range of [1, 1024 * 1024].
  • Strictly maintains consistency with the bilinear interpolation results of Pillow (verified on a large amount of data).

ResizePadTensor

  • Maintains aspect ratio during resizing, aligns to the top left corner, and pads with a constant pad_value.
  • The output at::Tensor type is float, with a shape of 13hw.
  • max_h and max_w must be integers, with a valid range of [1, 1024 * 1024].
  • pad_value supports integers, floating-point numbers, and multiple values separated by commas.
  • inverse_trans: used to map new coordinates to original coordinates.

Tensor2Mat

  • Converts at::Tensor to cv::Mat, while keeping the data type unchanged.
  • The input shape must be hw3 or 13hw.
  • Similar to SyncTensor, it synchronizes the current stream. :::caution Please insert stream management operations: Sequential[Tensor2Mat,SyncTensor] or SyncTensor[Tensor2Mat], otherwise Tensor2Mat will use the default CUDA stream. :::

Tensor2Vector

  • Convert at::Tensor to std::vector while keeping the data type unchanged (>=v0.4.1, currently only supports float)
  • Input shape: hw3 or 13hw
  • Similar to SyncTensor, it will synchronize the current stream. :::caution Please insert stream management operations: Sequential[Tensor2Vector,SyncTensor] or SyncTensor[Tensor2Vector, otherwise Tensor2Vector will use the default CUDA stream. :::

SyncTensor

  • SyncTensor::backend: default=Identity
  • Usage:
    • SyncTensor[BackendTensor]
    • Sequential[ATensor,BTensor,SyncTensor]
    • Nested mode only executes stream synchronization once.
  • When used directly in the pytorch environment (without going through the scheduling backend), this backend will not take effect in order to be compatible with pytorch cuda semantics. When going through the default scheduling backend, initialization and forward can be considered to be completed in the same independent thread.
  • Aliases: TensorSync, Torch (effective from version 0.3.1b2)
Implementation details
  • The scheduling system ensures that the initialization and forward of the backend instance are executed in the same independent thread. Torch perceives that it is in this independent thread mode before activating its own functionality.
  • Torch will determine whether the current thread is bound to the default stream during initialization. If so, it will activate its own functionality: switch the thread to an independent stream during initialization and perform stream synchronization during forward.
  • The Sequential container ensures that the initialization order of its sub-backends is opposite to the forward order, such as Sequential[SyncTensor[A],SyncTensor[B]], which initializes in reverse order and forwards in order:
flowchart LR
  BI["SyncTensor[B].init"] --> AI["SyncTensor[A].init"]
  AI --> AF["SyncTensor[A].forward"]
  AF --> BF["SyncTensor[B].forward "]
Loading

SyncTensor[A] is not the default stream during initialization, so it does not need to set a new stream or be responsible for stream synchronization during forward. At this time, SyncTensor[B] sets a new stream, so SyncTensor[B] is responsible for stream synchronization.

  • Mat2Tensor and Tensor2Mat backends have their own stream synchronization functions for the current stream. However, they cannot change the stream bound by the thread and still need to switch to an independent stream through S[Tensor2Mat,...,SyncTensor], otherwise it will affect performance.

Torch

Similar to SyncTensor, but with additional cross-card functionality. Effective from version 0.3.2b1.

  • Torch::backend: required, like Torch[TensorrtTensor]
  • device_id: Default is -1, which sets the current device to this ID. During initialization, it is equivalent to calling c10::cuda::set_device(device_id) or torch.cuda.set_device(device_id). During forward propagation, the input data type must be at::Tensor or vector<at::Tensor>. This backend will move the data to the specified graphics card.

SaveTensor

  • save_dir: Directory for file saving, which needs to be created in advance.
  • The file name suffix is .pt, and the name is unique.

TensorrtTensor

TensorRT inference engine.

Initialization

The following are initialization parameters:

Parameter Description Note
model Model path Supports
- onnx files ending with .onnx
- tensorrt engine files ending with .trt
- encrypted files ending with .onnx.encrypted and .trt.encrypted
instance_num Number of instances If the number of profiles in the tensorrt engine is not enough to establish enough instances, multiple engines will be deserialized.
postprocessor Custom post-processing Custom C++ batch post-processing for network output; the default operation is to split the batch dimension; needs to be implemented as a subclass of PostProcessor and registered.

For onnx models, there are the following additional parameters:

Parameter Description Note
min/max Minimum/maximum input of the model In form, it can be 1, 1x3x224x224, or 1,1 (for multi-input networks).
When instance_num > 1, it can be multiple configurations separated by ;.
precision Model precision [fp32,fp16,int8,best]. The default value for versions <=0.3.1b1 is fp16, and for versions >0.3.1b1 it is [fp16(SM>6.1), fp32(SM<=6.1)]; If the precision is not supported, it will automatically degrade to a supported precision.
precision::fp32 Set the precision of some layers to fp32 (overrides the precision setting) Layer name(s) (can provide only part of the name), separated by commas.
precision::fp16 Set the precision of some layers to fp16 (overrides the precision setting) Layer name(s) (can provide only part of the name), separated by commas.
precision::output::fp32 Set the output precision of some layers to fp32 (overrides the precision setting) Layer name(s) (can provide only part of the name), separated by commas; (>=0.3.1b2)
precision::output::fp16 Set the output precision of some layers to fp16 (overrides the precision setting) Layer name(s) (can provide only part of the name), separated by commas; (>=0.3.1b2)
mean/std Subtract mean and divide variance parameters in image preprocessing This operation will be inserted into the tensorrt network. Needs to be greater than 1+1e-5 (>=0.3.1b2)
model::cache Automatically cache model file path Supports file names with .trt and .trt.encrypted suffixes. If the file does not exist, it will be automatically saved; otherwise, this file will be loaded directly.

For quantizing onnx models using tensorrt, the following parameters are available:

Parameter Description Note Starting Version
calibrate_input Calibration input directory Tensors saved using torch.save or SaveTensor backend, with a shape of 1chw.
calibrate_cache Optional. Calibration cache, for example "resnet18.cache". If it exists, calibrate_input will be skipped. Calibration can be expensive and can be cached to a file. If the network structure or input dataset changes, the network should be recalibrated. >=0.3.0b4

See example.

Forward Computation

Description Note
TASK_DATA_KEY When the network has a single input and output, the type is at::Tensor/torch.Tensor. When the network has multiple inputs and outputs, the type is vector/List. Sorted in lexicographic order for trt<=9
TASK_RESULT_KEY (Output) The output type is the same as the input type, and the postprocessor can customize the output. Sorted in lexicographic order for trt<=9

min()/max()

The input range will be read from the TensorRT model.

Postprocessing Extension

To facilitate batch postprocessing, TensorrtTensor introduces postprocessing extensions. The base class is:

template <typename T=at::Tensor>
class PostProcessor {
 public:
  virtual bool init(const std::unordered_map<std::string, std::string>& /*config*/,
                    dict /*dict_config*/) {
    return true;
  };

  virtual void forward(std::vector<T> net_batched_outputs, std::vector<dict> inputs,
                       const std::vector<T>& net_batched_inputs) {
    if (inputs.size() == 1) {
      if (net_outputs.size() == 1)
        (*inputs[0])[TASK_RESULT_KEY] = net_outputs[0];
      else
        (*inputs[0])[TASK_RESULT_KEY] = net_outputs;
      return;
    }
    for (std::size_t i = 0; i < inputs.size(); ++i) {
      std::vector<T> single_result;
      for (const auto& item : net_outputs) {
        single_result.push_back(item[i].unsqueeze(0));
      }
      if (single_result.size() == 1) {
        (*inputs[i])[TASK_RESULT_KEY] = single_result[0];  // When `batch==1`, a single value is returned. 
      } else
        (*inputs[i])[TASK_RESULT_KEY] = single_result;
    }
  }

  virtual ~PostProcessor() = default;
};

After inheriting from PostProcessor<at::Tensor> and implementing the forward interface, you can compile and use it using AOT compilation. Built-in postprocessing:

Functionality Note
cpu Copy data to CPU
SoftmaxCpu Perform softmax operation on 2D tensor and copy data to CPU from v0.3.2b3
SoftmaxMax Get the maximum value and its corresponding index after performing softmax operation on a 2D tensor. from v0.3.2b3
Reference Implementation

:::info SoftmaxCpu:

#include "prepost.hpp"
class BatchingPostProcSoftmaxCpu : public PostProcessor<at::Tensor> {
 public:
  void forward(std::vector<at::Tensor> net_outputs, std::vector<dict> input,
               const std::vector<at::Tensor>& net_inputs) {
    for (auto& item : net_outputs) {
      if (item.dim() == 2) {
        item = item.softmax(1).cpu();  // Implicit Synchronization
      }
    }
    PostProcessor<at::Tensor>::forward(net_outputs, input, net_inputs);
  }
};

IPIPE_REGISTER(PostProcessor<at::Tensor>, BatchingPostProcSoftmaxCpu, "SoftmaxCpu");

:::

LoadTensor

Used to load tensors (.pt files) from disk, which can be saved using torch.save(). If you want to load an image, you can use S[LoadTensor, Tensor2Mat, SyncTensor].