Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to modify Detect layer to allow for converting yolov5 to Qualcomm's SNPE format? #4790

Closed
evdoks opened this issue Sep 14, 2021 · 79 comments
Labels
question Further information is requested Stale

Comments

@evdoks
Copy link

evdoks commented Sep 14, 2021

❔Question

I am trying to convert a trained yolov5s model to an SNPE format in order to be able to run it on a Snapdragon chip. Unfortunately, Qualcomm's ONNX to SNPE converter fails on the Detect level with the following error message

ValueError: Unable to permute shape [1, 3, 64, 64, 2] to NSC ordering
2021-09-14 15:15:37,327 - 183 - ERROR - Node Mul_268: Unable to permute shape [1, 3, 64, 64, 2] to NSC ordering

I can imagine, it may have something to do with the fact that SNPE currently supports 4D input data, where the first dimension is batch SNPE doc and yolov5 Detect layer has 5D reshape.

Would it be possible to modify Detect layer so that no 5D reshape is performed?

@evdoks evdoks added the question Further information is requested label Sep 14, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Sep 14, 2021

👋 Hello @evdoks, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

@evdoks I haven't used the SNPE converter myself, so I can't help directly, but I do see Qualcomm compatibility with YOLOv5 officially mentioned here in the Snapdragon Neural Engine SDK release notes from March 2021:
https://developer.qualcomm.com/sites/default/files/docs/snpe/revision_history.html

Screen Shot 2021-09-14 at 6 42 58 PM

@evdoks
Copy link
Author

evdoks commented Sep 15, 2021

@glenn-jocher Thanks for the link, I saw this and this is why I was hoping that the conversion should work.

However, I was not able to find anyone who could successfully do it on a trained yolo model and there are questions on Qualcomm's dev forum from people hitting the same wall:
Screenshot 2021-09-15 at 09 58 17

The conversion works, if one removes Detect layer (using --train flag in your export.py script), but then the mdel is not of much use.

@glenn-jocher
Copy link
Member

@evdoks I think you are not understanding --train. All models export with all layers, there are no circumstances when export omits the Detect layer.

@evdoks
Copy link
Author

evdoks commented Sep 15, 2021

@glenn-jocher, you a right, I have expressed it incorrectly, but my understanding is that when using --train flag, the exported onnx model is in training mode and I am not quite sure how can I use it for making inferences. At least in my case, the onnx model stops doing predictions if exported with --train flag, which is not the case if no training mode is set.

@glenn-jocher
Copy link
Member

@evdoks yes in --train mode the grid for inference output is not constructed (as it is not needed for loss computation), so there's something isolated in that area that is causing the issue. The 5D reshape is still present in --train mode though on L55, so it's probably not the source of the problem. You might try turning self.inplace on or off to see if it has an effect.

yolov5/models/yolo.py

Lines 50 to 71 in b74dd4b

def forward(self, x):
z = [] # inference output
for i in range(self.nl):
x[i] = self.m[i](x[i]) # conv
bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
if not self.training: # inference
if self.grid[i].shape[2:4] != x[i].shape[2:4] or self.onnx_dynamic:
self.grid[i] = self._make_grid(nx, ny).to(x[i].device)
y = x[i].sigmoid()
if self.inplace:
y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i] # xy
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
else: # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
xy = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i] # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(1, self.na, 1, 1, 2) # wh
y = torch.cat((xy, wh, y[..., 4:]), -1)
z.append(y.view(bs, -1, self.no))
return x if self.training else (torch.cat(z, 1), x)

@evdoks
Copy link
Author

evdoks commented Sep 15, 2021

@glenn-jocher thanks for looking into it, but it didn't help. Neither exporting the model to onnx with --inline nor training the .pt model with inline turned on and off in the YAML file and exporting it to onnx afterward.

Qualcomm's dev forum seems to be a dead place - some people have already posted questions there regarding yolov5 compatibility but got no response.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 16, 2021

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@jayer95
Copy link

jayer95 commented Nov 18, 2021

@evdoks
Hi, I am also paying attention to this issue
Do you have any progress? About YOLOv5 on SNPE

@evdoks
Copy link
Author

evdoks commented Nov 18, 2021

@jayer95 unfortunately not. Switched to ResNet (which totally sucks). Let us know here if you get any breakthroughs. Qualcomm keeps updating the converter, but I haven't noticed anything that could be relevant to the issue with YOLO in release notes of the latest versions.

@jayer95
Copy link

jayer95 commented Nov 18, 2021

@evdoks
Thank you for your reply. I am currently experimenting closely. I have successfully converted YOLOv5.dlc, but there is currently no way to verify whether this model is available.

I don't quite understand the "--train" parameter proposed by the author of YOLOv5 to shield the 5D network model layer.

@glenn-jocher
Can I ask your opinion?

@glenn-jocher
Copy link
Member

glenn-jocher commented Nov 18, 2021

@evdoks Detect() does not have a self.train parameter. It has a self.training parameter returns the grids when training, or the sigmoid predictions during inference.

yolov5/models/yolo.py

Lines 50 to 71 in 562191f

def forward(self, x):
z = [] # inference output
for i in range(self.nl):
x[i] = self.m[i](x[i]) # conv
bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
if not self.training: # inference
if self.onnx_dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)
y = x[i].sigmoid()
if self.inplace:
y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
else: # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
y = torch.cat((xy, wh, y[..., 4:]), -1)
z.append(y.view(bs, -1, self.no))
return x if self.training else (torch.cat(z, 1), x)

@jayer95
Copy link

jayer95 commented Nov 18, 2021

@glenn-jocher
Can you help convert the DLC of SNPE? You must know better than us!!!

@glenn-jocher
Copy link
Member

@jayer95 sorry, I don't actually know what DLC is. We have a mobile developer who's working on our upcoming HUB app, but for Android there we are using established TFLite export workflows and not yet targeting specific backends. What's the benefit of going this export route? Does it provide better access to NNAPI or Hexagon delegates?

If the main issue is simply the 5D nature of the tensors in Detect there's certainly a workaround you could do to handle the reshaping/permutation ops differently. You'd want to create an x-y fused grid (1d rather than 2d), and then of course also create the offsets/gains you need in 1d rather than 2d, then your tensor would be 4d (batch, anchors, xy, outputs)

@hansoullee20
Copy link

hansoullee20 commented Dec 20, 2021

@evdoks @jayer95 It is possible to convert yolov5 to .dlc format. You'd need to use version 3.1 yolov5s and specify the output nodes to the convolution layer output before the 5D reshape. Check out the SNPE release notes page 20. The exact texts are as below:
• Export the pre trained YOLO v3/v5 to ONNX

  1. Follow the official export tutorial to obtain the ONNX model:
    https://docs.ultralytics.com/yolov5/tutorials/model_export
    2.Simplify the exported ONNX model by onnx simplifier
    https://github.com/daquexian/onnx simplifier
    3.Conversion: specify output nodes before 5D Reshape
    •Example for YOLOv5
    snpe onnx to dlc i yolov5s.onnx out_node 742
    out_nodec 762 out_node 782
    •Example for YOLOv3
    snpe onnx to dlc i yolov3.onnx out_node 332
    out_node 352 out_node 372
    •Need to handle 5D Ops and postprocessing outside the model

I came as far as getting the 4D output in NativeCpp settings but made zero progress on extracting inferences. Has anyone made any progress?

@wwxzxd
Copy link

wwxzxd commented Dec 22, 2021

@evdoks 感谢您的回复。我目前正在密切试验。我已经成功转换了YOLOv5.dlc,但是目前没有办法验证这个模型是否可用。

不太明白YOLOv5作者提出的屏蔽5D网络模型层的“--train”参数。

@glenn-jocher 我可以问问你的意见吗?

Hello! It's convenient to ask how you can convert the pt model file of yolov5 into dlc. Thank you very much

@glenn-jocher
Copy link
Member

@wwxzxd sorry what is dlc?

@jayer95
Copy link

jayer95 commented Dec 24, 2021

@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 24, 2021

@jayer95 got it, thanks!

@wwxzxd @jayer95 @evdoks The main step we could take here would be to try to add official export support for Snapdragon dlc export to export.py. We currently support 10 different model formats, and there is a system in place for export and inference of each. From TFLite, ONNX, CoreML, TensorRT Export #251:

Formats

YOLOv5 export is supported for the following formats

Format Example --include ... argument
PyTorch yolov5s.pt -
TorchScript yolov5s.torchscript torchscript
ONNX yolov5s.onnx onnx
CoreML yolov5s.mlmodel coreml
OpenVINO yolov5s_openvino_model/ openvino
TensorFlow SavedModel yolov5s_saved_model/ saved_model
TensorFlow GraphDef yolov5s.pb pb
TensorFlow Lite yolov5s.tflite tflite
TensorFlow.js yolov5s_web_model/ tfjs
TensorRT yolov5s.engine engine

The fastest and easiest way to incorporate your ideas into the official codebase is to submit a Pull Request (PR) implementing your idea, and if applicable providing before and after profiling/inference/training results to help us understand the improvement your feature provides. This allows us to directly see the changes in the code and to understand how they affect workflows and performance.

Please see our ✅ Contributing Guide to get started. Thank you!

@hansoullee20
Copy link

@glenn-jocher thank you for your reply. I'm somewhat relieved to know Im not alone in this search. The models are converted to .dlc format via snpe tools. (https://developer.qualcomm.com/sites/default/files/docs/snpe/tools.html)
So far snpe supports the conversion for 6 models (tensorflow, tflite, onnx, pytorch, caffe, and caffe2).

I've tried to convert the yolov5.pb model by exporting in onnx and tensorflow models. The issue rises when the converters reach the following line in yolo.py:
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

It seems as snpe has problem converting the permute function. Could this line get rewritten without using the permute function? I think as long as we pass this part, we will have the dlc model.

@zhiqwang
Copy link
Contributor

zhiqwang commented Dec 29, 2021

Could this line get rewritten without using the permute function? I think as long as we pass this part, we will have the dlc model.

FYI @hansoullee20, I guess one workaround for this is just to remove this line in the Detect module when exporting the ONNX model for SNPE backend (set --train as True also), and do this function in the SNPE parts.

yolov5/models/yolo.py

Lines 53 to 54 in db6ec66

bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

@hansoullee20
Copy link

@zhiqwang
but as far as I understand, the --train option in export removes the detect layer right? Wouldn't that be pointless since the detect layer will then need to be processed outside of dlc?

@zhiqwang
Copy link
Contributor

zhiqwang commented Dec 29, 2021

but as far as I understand, the --train option in export removes the detect layer right? Wouldn't that be pointless since the detect layer will then need to be processed outside of dlc?

Seems that it will remove the

yolov5/models/yolo.py

Lines 56 to 68 in db6ec66

if not self.training: # inference
if self.onnx_dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)
y = x[i].sigmoid()
if self.inplace:
y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
else: # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
y = torch.cat((xy, wh, y[..., 4:]), -1)
z.append(y.view(bs, -1, self.no))

only and return a list containing 3 intermediate head if you set --train like following

python export.py --weights path/to/your/model.pt --include onnx --simplify --train

And if SNPE doesn't support the permute op, they will have a low probability of supporting torch.meshgrid used in the _make_grid in the above line 58.

@hansoullee20
Copy link

So, good news! Seems like yolov5 is now compatible with SNPE!
Pull from the master branch, export to onnx, and convert to dlc without specifying out_node.
Would appreciate any inputs on how to proceed from here in SNPE :)

@jayer95
Copy link

jayer95 commented Jan 10, 2022

At present, yolov5 v6.0 version can convert snpe correctly.

onnx==1.6.0
onnx-simplifier==0.3.6
onnxoptimizer==0.2.6
onnxruntime==1.1.0
scikit-learn==0.19.2
numpy==1.19.5
protobuf==3.17.3
torch==1.10.0

git clone https://github.com/ultralytics/yolov5.git
cd yolov5
git checkout v6.0

python export.py --weights yolov5n.pt --optimize --opset 11 --simplify

Please use Netron to view the exported yolov5n.onnx, you will find that the layer above the 5D output nodes is the 4D output nodes: Conv_198, Conv_232, Conv_266, then the output nodes are: 326, 379, 432, so we need to specify these 3 output nodes when converting yolov5n.dlc.

But at present, a program is still needed to demo the converted yolov5n.dlc.
The most important thing is that the inference program must contain the "letterbox" preprocessing algorithm of yolov5 to ensure that "letterbox" is used in training and also in inference.

@zhiqwang
Copy link
Contributor

Hi @jayer95 ,

Please use Netron to view the exported yolov5n.onnx, you will find that the previous layer of reshape into 5D output is: Conv_198, Conv_232, Conv_266, then the output nodes are: 326, 379, 432, so we need to specify these 3 output nodes when converting yolov5n.dlc.

I have a question here, seems that the anchor decoder part in Detect at below will not make sense if we specify the output nodes to 326, 379, 432 ?

yolov5/models/yolo.py

Lines 57 to 68 in 6865d19

if self.onnx_dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)
y = x[i].sigmoid()
if self.inplace:
y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
else: # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
y = torch.cat((xy, wh, y[..., 4:]), -1)
z.append(y.view(bs, -1, self.no))

@jayer95
Copy link

jayer95 commented Jan 10, 2022

Hi @zhiqwang ,

Is the reason why you converted to yolov5n.dlc because you want to load yolov5.dlc on the SNPE SDK and output the post-processing result?

The correct conversion steps should be as follows:
yolov5n.pt --> yolov5n.onnx --> yolov5n.dlc

When we converted yolov5n.onnx to yolov5n.dlc, we specified 3 output nodes: 326, 379, 432 (Conv_198, Conv_232, Conv_266), as shown below,

SNPE_4D_Output_Nodes

For nodes with 3 outputs: Conv_198, Conv_232, Conv_266, for 4D outputs specified by SNPE, please refer to:
https://developer.qualcomm.com/sites/default/files/docs/snpe//image_input.html

SNPE 4D image output format is:
batch_size * grid_size * grid_size * (3 * (box_size + score_size + class_size))

batch_size=1, box_size=4, score_size=1, class_size=80

Conv_266 node: 1x20x20x255 (grid_size=20)
Conv_232 node: 1x40x40x255 (grid_size=40)
Conv_198 node: 1x80x80x255 (grid_size =80)

At this time, it has been converted to yolov5n.dlc. The post-processing program for parsing yolov5n.dlc should be developed in C++ on the SNPE SDK or QCS devices. It has nothing to do with the post-processing of "yolov5/models/yolo.py".

I'm using SNPE SDK 1.58 (the latest version at present), when converting yolov5n.dlc, I use "snpe-onnx-to-dlc" under the x86 architecture for model conversion, and use "snpe-dlc-info" to view the model architecture of yolov5n.dlc.

Hi @glenn-jocher ,
Let's discuss the yolov5 dlc model supported by Qualcomm SNPE.

@fwzdev1
Copy link

fwzdev1 commented Mar 3, 2022

@fwzdev1

Congratulations on your successful conversion, the rest is how to parse yolov5.dlc on SNPE SDK.

Thank you for your sharing. @jayer95

I was stuck in using dsp/aip to run the network because of the unsupported 5-dimension reshape operation, and the speed in CPU (100+ms) is totally unacceptable. After asking Qualcomm stuff, it comes the bad news that there's no option but to change yolov5 network, especially the detect head part. Which is a little bit tricky.

@JISHNUSHAJI
Copy link

I have converted the yolov5 model to dlc,now i have to do the 5d reshape and other post processing outside the model.Could someone share the code for post processing from 5d reshape onwards?

@eeyzl5
Copy link

eeyzl5 commented Jun 7, 2022

@JISHNUSHAJI @glenn-jocher @fwzdev1

Hi all, just to share my recent exploration of running yolov5 with SNPE.

I am using SNPE v1.62 and yolov5 release v6.1. My task is to detect a custom class which are very small objects typically 10x10 pixels. The model I chose was yolov5s with default 640x640 input size, but I think other models are also compatible.

Least Modification

Since the main issue of running yolov5 with SNPE is caused by the unsupported 5d reshape operation, simply changing the 5d reshape to 4d can solve the problem. For example, one of the detection head using 1x3x85x20x20 reshape is unacceptable to SNPE, but is acceptable after changing it to 3x85x20x20 reshape. In a word, just eliminate the batchsize.

The modification in the Detect() module in models/yolo.py:

In forward(), simply delete bs and change the permute index

# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()  # original
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous()  # modified

In _make_grid, also delete the batchsize part of all 5d tensors

# grid = torch.stack((xv, yv), 2).expand((1, self.na, ny, nx, 2)).float()  #original
grid = torch.stack((xv, yv), 2).expand((self.na, ny, nx, 2)).float()  #modified
# anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((1, self.na, 1, 1, 2)).expand((1, self.na, ny, nx, 2)).float()    #original 
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((self.na, 1, 1, 2)).expand((self.na, ny, nx, 2)).float()  #modified

No other modification needed, directly convert the original pt model to onnx then dlc without need to specify the out_nodes. The SNPE is able to compute the entire network including the operations inside Detect() layer, so no need to reimplement this detection part outside the model. Just apply confidence selection and nms then you can get the bounding boxes.

Both CPU and DSP runtimes can execute this network without raising any error. However, it will only give you the correct output by using CPU. The precision is affected significantly with 8-bit quantization when using DSP, mainly caused by the operations in Detect() layer as far as I am concerned.

Running with DSP

If you just need to run with default CPU then the above solution may be the simpliest one. But I believe most of us choose SNPE because of the accelartion by DSP/AIP. So the reimplementation of detection part is unavoidable.

The modification in the Detect module is mainly to comment out these codes.

# if self.inplace:
#     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
#     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
#     y = torch.cat((xy, wh, y[..., 4:5]), -1)

Also remember to include the change of 5d reshape to 4d discussed above.

The final output shape will still be consistent with the original model which is a 3d tensor (actually 2d): 1x25200x85 for the default yolov5s model. This output tensor can be obtained by DSP/AIP runtime with acceleration but no large precision drop. Then we use CPU to parse this output by performing exactly the same operation that we have commented.

Since the output from SNPE is always 1d, a single for loop is enough to do the parsing. An example code is shown below, which is written in Java but also easy to convert to C++ etc.

float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}

The location of bounding boxes are represented by left, top, right, bottom and confidence by obj_conf, these can be provided to nms functions to get clean boxes. The parsing of class confidence and class index are not provided here cause they are not relevant to my task, but could be easily extracted using some sort of max and argmax functions.

In my Android app, running the net with DSP takes about 120ms on a Snapdragon 870 platform compared to 500+ms using CPU, and the accuracy is nearly the same. However this speed is still a bit slow for real time tasks, probably cause I was using the SNPE Java SDK instead of C++. Still further optimization can be made to achieve a faster speed.

Optimization for SNPE

When looking into the yolov5 models released recently, the activation layer used after each convolution is nn.SiLU(). However, both onnx and SNPE do not support SiLU activation directly, but splitting to separate operations of Sigmoid and Multiplication. Which means that SNPE currently does not optimize the execution of SiLU layers, and this apparently slows down entire execution of the network as you can see there are 50+ activation layers in yolov5s model!

Simply change the SiLU activations to commonly used LeakyReLU which is optimized by SNPE, by modifying the base Conv() module in models/common.py:

# self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  # original
self.act = nn.LeakyReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  #modified

Re-training is required and the original checkpoint can not be used. Training one epoch is faster but it will take more epochs to converge the model. Switching to LeakyReLU activations will result in slightly lower mAP but faster execution, so it is kind of performance trade off.

For my specific task to detect a single class of small objects, I prune the other two detection heads for median and large objects. In addition, I select the first 5 colomns to only output x,y,w,h,conf results, so the output shape becomes 1x19200x5 instead of 1x25200x85. This will further speed up the network execution as well as the detection post processing.

After all these optimizations, the final execution time for DSP drops to 25ms (almost in real time) on the same 870 device. The precision is also not much affected although it has less robustness and stability than the original yolov5s model. If your main concern is speed then apply these optimizations to your model otherwise just use the original one.

Even faster execution time may achieve by switching to C++ and yolov5n model.

Good luck!

@glenn-jocher
Copy link
Member

@eeyzl5 awesome, thanks for the detailed feedback!

@JISHNUSHAJI
Copy link

@eeyzl5 thanks for the detailed explanation

@hansoullee20
Copy link

@eeyzl5

Thank you for sharing the details with us.

I am also trying to use DSP on an embedded system. I followed your advise and made the modifications in yolo.py but unable to run the train.py script. When I run the train.py script, following your instructions upto Running with DSP section, I get the following error.

 Epoch   gpu_mem       box       obj       cls    labels  img_size

0%| | 0/44 [00:02<?, ?it/s]
Traceback (most recent call last):
File "./yolov5/train.py", line 643, in
main(opt)
File "./yolov5/train.py", line 539, in main
train(opt.hyp, opt, device, callbacks)
File "./yolov5/train.py", line 330, in train
pred = model(imgs) # forward
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 434, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/tmp/yolov5/models/yolo.py", line 128, in forward
return self._forward_once(x, profile, visualize) # single-scale inference, train
File "/root/tmp/yolov5/models/yolo.py", line 151, in _forward_once
x = m(x) # run
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/tmp/yolov5/models/yolo.py", line 55, in forward
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous() # modified
RuntimeError: shape '[3, 6, 80, 80]' is invalid for input of size 921600

Have you made any other modifications you haven't shared with us?

Thank you for reading.

@eeyzl5
Copy link

eeyzl5 commented Jul 21, 2022

@eeyzl5

Thank you for sharing the details with us.

I am also trying to use DSP on an embedded system. I followed your advise and made the modifications in yolo.py but unable to run the train.py script. When I run the train.py script, following your instructions upto Running with DSP section, I get the following error.

 Epoch   gpu_mem       box       obj       cls    labels  img_size

0%| | 0/44 [00:02<?, ?it/s] Traceback (most recent call last): File "./yolov5/train.py", line 643, in main(opt) File "./yolov5/train.py", line 539, in main train(opt.hyp, opt, device, callbacks) File "./yolov5/train.py", line 330, in train pred = model(imgs) # forward File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 434, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/tmp/yolov5/models/yolo.py", line 128, in forward return self._forward_once(x, profile, visualize) # single-scale inference, train File "/root/tmp/yolov5/models/yolo.py", line 151, in _forward_once x = m(x) # run File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/tmp/yolov5/models/yolo.py", line 55, in forward x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous() # modified RuntimeError: shape '[3, 6, 80, 80]' is invalid for input of size 921600

Have you made any other modifications you haven't shared with us?

Thank you for reading.

@hansoullee20

Hi, just to remind that the modifications are only valid in deployment steps after you've already got a trained model and you wish to export this model to SNPE compatible format. So you just keep the original code while training, then apply the modifications when you gonna export to onnx format (refer to this), and then export to dlc format.

@hansoullee20
Copy link

@eeyzl5

Thank you very much for your comment. We were able to implement the model on the device and execute in DSP. However, we are encountering a serious issue where nothing gets detected afterwards. Have you had similar issues in the past?

When we run the model in CPU, the model seems to detect something but accuracy and speed is still highly compromised.

If you have any recommendations would be much appriciated.
Than you in advance.

@ravineti
Copy link

Hi - Any reference implementation to integrate YoloV5 in Android APP using SNPE ?

We are able to successfully convert the DLC, and run on Snapdragon device, using ARM CPU, GPU, and DSP runtimes.
However, looking for any pre-post processing reference code in JNI or Java ?

@eeyzl5
Copy link

eeyzl5 commented Aug 8, 2022

@eeyzl5

Thank you very much for your comment. We were able to implement the model on the device and execute in DSP. However, we are encountering a serious issue where nothing gets detected afterwards. Have you had similar issues in the past?

When we run the model in CPU, the model seems to detect something but accuracy and speed is still highly compromised.

If you have any recommendations would be much appriciated. Than you in advance.

@hansoullee20

Hi, I was able to get correct detections. If you try to run with DSP please refer to "Running with DSP" section from my above comment. Otherwise you may not get the correct result especially if you don't do post-processing on cpu. Post-processing includes all the operations after 5d reshape. Again you may refer to my sample codes. My suggestion is to start with the default official model and compare the raw values output from PC and your snpe device.

@rszeto-sy
Copy link

I ran into a problem running the code from @eeyzl5's detailed answer (linked here for brevity) but found a likely solution. There's a bug in the "Running with DSP section" where, in the final if statement, the grid location and anchors are not set correctly. The posted version only subtracts num_filters[1] from ci, whereas it should subtract (num_filters[1] + num_filters[0]) so that all grid locations and anchors are sampled correctly. This is what the final if statement should look like:

gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2];
gridY = (int)(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2]);
anchor_gridX = anchorX[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
anchor_gridY = anchorY[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
stride = 32;

This is what it looks like inside the entire code snippet:

float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}

And since I happened to need this in Python, here's that too in case it's useful (it returns a copy instead of operating in-place):

def postprocess_raw_output(
        values,
        anchorX=[10,16,33,30,62,59,116,156,373],
        anchorY=[13,30,23,61,45,119,90,198,326],
        num_filters=[19200,4800,1200],
        filter_size=[80,40,20],
        last_dim_size=85
    ):

    ret = values.copy()

    for c in range(4, values.size, last_dim_size):
        cx = values[c-4]
        cy = values[c-3]
        w = values[c-2]
        h = values[c-1]

        ci = int(c / last_dim_size)
        if ci < num_filters[0]:
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0]
            gridY = int((ci%(filter_size[0]*filter_size[0]))/filter_size[0])
            anchor_gridX = anchorX[int(ci/(filter_size[0]*filter_size[0]))]
            anchor_gridY = anchorY[int(ci/(filter_size[0]*filter_size[0]))]
            stride = 8
        elif ci>=num_filters[0] and ci<(num_filters[0]+num_filters[1]):
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1]
            gridY = int(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1])
            anchor_gridX = anchorX[int((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3]
            anchor_gridY = anchorY[int((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3]
            stride = 16
        else:
            gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2]
            gridY = int(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2])
            anchor_gridX = anchorX[int((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6]
            anchor_gridY = anchorY[int((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6]
            stride = 32

        cx = float(cx*2-0.5+gridX)*stride
        cy = float(cy*2-0.5+gridY)*stride
        w = w*2*w*2*anchor_gridX
        h = h*2*h*2*anchor_gridY
        ret[c-4:c] = [cx, cy, w, h]

    return ret

Hopefully this is of use particularly to @hansoullee20.

@fwzdev1
Copy link

fwzdev1 commented Oct 13, 2022

@JISHNUSHAJI @glenn-jocher @fwzdev1

Hi all, just to share my recent exploration of running yolov5 with SNPE.

I am using SNPE v1.62 and yolov5 release v6.1. My task is to detect a custom class which are very small objects typically 10x10 pixels. The model I chose was yolov5s with default 640x640 input size, but I think other models are also compatible.

Least Modification

Since the main issue of running yolov5 with SNPE is caused by the unsupported 5d reshape operation, simply changing the 5d reshape to 4d can solve the problem. For example, one of the detection head using 1x3x85x20x20 reshape is unacceptable to SNPE, but is acceptable after changing it to 3x85x20x20 reshape. In a word, just eliminate the batchsize.

The modification in the Detect() module in models/yolo.py:

In forward(), simply delete bs and change the permute index

# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()  # original
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous()  # modified

In _make_grid, also delete the batchsize part of all 5d tensors

# grid = torch.stack((xv, yv), 2).expand((1, self.na, ny, nx, 2)).float()  #original
grid = torch.stack((xv, yv), 2).expand((self.na, ny, nx, 2)).float()  #modified
# anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((1, self.na, 1, 1, 2)).expand((1, self.na, ny, nx, 2)).float()    #original 
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((self.na, 1, 1, 2)).expand((self.na, ny, nx, 2)).float()  #modified

No other modification needed, directly convert the original pt model to onnx then dlc without need to specify the out_nodes. The SNPE is able to compute the entire network including the operations inside Detect() layer, so no need to reimplement this detection part outside the model. Just apply confidence selection and nms then you can get the bounding boxes.

Both CPU and DSP runtimes can execute this network without raising any error. However, it will only give you the correct output by using CPU. The precision is affected significantly with 8-bit quantization when using DSP, mainly caused by the operations in Detect() layer as far as I am concerned.

Running with DSP

If you just need to run with default CPU then the above solution may be the simpliest one. But I believe most of us choose SNPE because of the accelartion by DSP/AIP. So the reimplementation of detection part is unavoidable.

The modification in the Detect module is mainly to comment out these codes.

# if self.inplace:
#     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
#     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
#     y = torch.cat((xy, wh, y[..., 4:5]), -1)

Also remember to include the change of 5d reshape to 4d discussed above.

The final output shape will still be consistent with the original model which is a 3d tensor (actually 2d): 1x25200x85 for the default yolov5s model. This output tensor can be obtained by DSP/AIP runtime with acceleration but no large precision drop. Then we use CPU to parse this output by performing exactly the same operation that we have commented.

Since the output from SNPE is always 1d, a single for loop is enough to do the parsing. An example code is shown below, which is written in Java but also easy to convert to C++ etc.

float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}

The location of bounding boxes are represented by left, top, right, bottom and confidence by obj_conf, these can be provided to nms functions to get clean boxes. The parsing of class confidence and class index are not provided here cause they are not relevant to my task, but could be easily extracted using some sort of max and argmax functions.

In my Android app, running the net with DSP takes about 120ms on a Snapdragon 870 platform compared to 500+ms using CPU, and the accuracy is nearly the same. However this speed is still a bit slow for real time tasks, probably cause I was using the SNPE Java SDK instead of C++. Still further optimization can be made to achieve a faster speed.

Optimization for SNPE

When looking into the yolov5 models released recently, the activation layer used after each convolution is nn.SiLU(). However, both onnx and SNPE do not support SiLU activation directly, but splitting to separate operations of Sigmoid and Multiplication. Which means that SNPE currently does not optimize the execution of SiLU layers, and this apparently slows down entire execution of the network as you can see there are 50+ activation layers in yolov5s model!

Simply change the SiLU activations to commonly used LeakyReLU which is optimized by SNPE, by modifying the base Conv() module in models/common.py:

# self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  # original
self.act = nn.LeakyReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  #modified

Re-training is required and the original checkpoint can not be used. Training one epoch is faster but it will take more epochs to converge the model. Switching to LeakyReLU activations will result in slightly lower mAP but faster execution, so it is kind of performance trade off.

For my specific task to detect a single class of small objects, I prune the other two detection heads for median and large objects. In addition, I select the first 5 colomns to only output x,y,w,h,conf results, so the output shape becomes 1x19200x5 instead of 1x25200x85. This will further speed up the network execution as well as the detection post processing.

After all these optimizations, the final execution time for DSP drops to 25ms (almost in real time) on the same 870 device. The precision is also not much affected although it has less robustness and stability than the original yolov5s model. If your main concern is speed then apply these optimizations to your model otherwise just use the original one.

Even faster execution time may achieve by switching to C++ and yolov5n model.

Good luck!

Thank you so much for such a detailed reply...or even a report!

I was doing exactly the same thing as you did, which is to detect a single class of objects smaller than 10 * 10. What I did for the network is almost the same as you explained, except changing silu to leakyrelu (I just make it relu instead). I've tested yolov5n with backbone scale = 0.2 or 0.3, and it took about 8-10 for networks in DSP (snapdragon 855, 640*640). I finally choose nanodet plus with the same custom changings for more convenient pre and post-process codes (the repository of nanodet contains official pre and post-process code for SNPE).

But there is another problem. It took me about 10ms for network inference, which is acceptable. When it comes to pre and post-process, things changed. I remember it was about 3ms for pre and 4ms for post, so the total time was over 17ms for a single image.
Compare with using CPU(1 ms pre, 1ms post), it took 2 or 3 ms more on data transmission and computing (in OpenCV) using DSP.

I wonder whether you had bothered by this issue or not.

Finally, great appreciation for your sharing!

@saadwaraich1
Copy link

Hi all,

I convert yolov5n.pt --imgsz 320 to yolov5n.onnx --imgsz 192 320 (the concept of letterbox), and then use SNPE 1.58 to convert yolov5n.dlc --imgsz 192 320(the concept of letterbox).

I use "gst-launch-1.0 / qtimlesnpe " to parse yolov5n.dlc to demo, the detection effect is very good, close to lossless conversion.

image: gnome-shell-screenshot-LQ5XH1

video: https://drive.google.com/file/d/1-eEi8dkh_3mLxd3CEnRpqPFLJJ4G5FPH/view?usp=sharing

Hey, thanks for the help. I am able to convert they way you mentioned. I am trying to demo using gstreamer and qtimlesnpe. I can see model running as pipeline is taking some time, but no bounding boxes on the video. I have seen this behavior before and going to a previous version of libqtioverlay.so solved the issue. Not sure how it solved but it worked. @jayer95 @Mohit-Ak Any idea how I can deal with it or maybe if you can somehow find what version libqtioverlay.so library was used on your end ?
Thanks

@jayer95
Copy link

jayer95 commented Dec 4, 2022

@saadwaraich1
Hi, thank you for your reply,

The codes of libqtioverlay.so and other qtimlesnpe plugins need to be rewritten and covered.
The main thing is to write the code for parsing the 4D format output of yolov5.

Are you a Qualcomm chip buyer? Please contact Qualcomm’s customer support directly and ask Qualcomm’s technical staff by raising a case.

@wofvh
Copy link

wofvh commented May 11, 2023

So, good news! Seems like yolov5 is now compatible with SNPE! Pull from the master branch, export to onnx, and convert to dlc without specifying out_node. Would appreciate any inputs on how to proceed from here in SNPE :)
@hansoullee20 hi hansoul im also trying to run yolov5 on snpe-sdk may i eamil to you ?

@hansoullee20
Copy link

hansoullee20 commented May 11, 2023 via email

@glenn-jocher
Copy link
Member

@wofvh 안녕하세요,

해당 정보에 대해 감사드립니다. 제가 이해하기로는 yolov5가 SNPE와 호환되어 문제가 해결된 것 같습니다. 어떻게 진행해야 하는지에 대해 추가적인 정보를 공유해주시면 감사하겠습니다.

감사합니다.

@wofvh
Copy link

wofvh commented May 11, 2023

안녕하세요 어디까지 실행해 보셨나요? 위 스레드에서 구체적으로 이해가 안가시는 부분이 있으신가요?

On Thu, May 11, 2023, 11:06 AM teddy @.> wrote: So, good news! Seems like yolov5 is now compatible with SNPE! Pull from the master branch, export to onnx, and convert to dlc without specifying out_node. Would appreciate any inputs on how to proceed from here in SNPE :) @hansoullee20 https://github.com/hansoullee20 hi hansoul im also trying to run yolov5 on snpe-sdk may i eamil to you ? — Reply to this email directly, view it on GitHub <#4790 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA . You are receiving this because you were mentioned.Message ID: @.>

@hansoullee20 답변정말 감사드립니다 지금 현재 qualcomm 튜토리얼보고 리눅스 18.04 _x86_64 에서 inceptionv3 모델을 dlc 로 바꾸고 다시 양자화 해서 무게를 줄이는거 까지는 완료 했습니다 지금 저에게 yolov5 .onnx 가중치가 있는데 이걸 snpe-sdk 에서 어떻게 dlc로 바꿔서 스넵드레곤이 아닌 qualcomm RB5에 실행시킬수있는지 궁금합니다 ! 도와주시면 정말 감사하겠습니다

@glenn-jocher
Copy link
Member

@hansoullee20 안녕하세요,

좋은 소식입니다. yolov5가 SNPE와 호환되어 문제가 해결된 것으로 보입니다. 하지만, 제가 직접 구체적으로 실행하지는 않았기 때문에 어떻게 진행해야 하는지에 대해서는 정확한 정보를 제공드리기 어려울 것 같습니다.

추가적인 정보가 필요하시다면, 코드를 분석하거나 질문과 관련된 문서를 참고하시는 것이 좋을 것입니다. 좀 더 구체적인 질문이 있다면 언제든지 문의해 주세요.

감사합니다.

@hansoullee20
Copy link

hansoullee20 commented May 11, 2023 via email

@wofvh
Copy link

wofvh commented May 11, 2023

@hansoullee20 qualcomm sdk 페이지에 나와있는 기본적인 방법으로 inception_v3모델을 5D에서 4D 변환후 양자화 까지는 했는데 YOLOV5가중치는 어떻게 dlc파일로 변환후 양자화는법을 잘 모르겠습니다 제가 이쪽에 대해 많이 부족해서 조금 쉽게 알려주시면 감사하겠습니다 onnx 파일까진 변환이 되어있는 상태입니다 감사합니다

현재 우분투 18.04 파이썬 3.6.9

@glenn-jocher
Copy link
Member

Hello @wofvh,

I have been following the Qualcomm SDK page and was able to convert the Inception_v3 model from 5D to 4D and perform quantization successfully. However, I am having difficulty with converting YOLOv5 weights to a dlc file and then performing quantization. As I am relatively new to this topic, I was hoping you could provide some guidance on how to proceed with these steps in a relatively easy-to-understand manner. Currently, the onnx file has been converted already. Thank you for your help.

Best,

@aleshem
Copy link

aleshem commented Mar 6, 2024

Hi Glenn
I managed to convert to dlc and run after quantization
I put my changes in this fork yolov5_snpe_conversion
However, I have been having some trouble quantizing the network using real images. the results look worse than having zeros as the input bin files.
Does anyone have any experience with this and know what would be good practices for this conversion?
Thanks

@glenn-jocher
Copy link
Member

@aleshem hi there,

It's great to hear that you've managed to convert to dlc and run after quantization. Regarding the issues you're facing with quantizing the network using real images, it's not uncommon to encounter challenges during this process. Quantization can sometimes lead to a degradation in model performance, especially if the quantization process or the selection of calibration images is not optimal.

Here are a few general tips that might help improve the quantization results:

  1. Calibration Dataset: Ensure that the dataset used for calibration is representative of the actual use case and diverse enough to cover various scenarios the model will encounter.
  2. Quantization Strategy: Experiment with different quantization strategies. For instance, symmetric vs. asymmetric quantization, per-channel vs. per-tensor quantization, and so on.
  3. Model Fine-tuning: After quantization, it might be beneficial to fine-tune the model with a small learning rate for a few epochs to regain some of the lost accuracy.
  4. Quantization-aware Training: If possible, consider quantization-aware training, where the model is trained with simulated quantization, making it more robust to the effects of quantization.

Unfortunately, without more specific details, it's challenging to provide more targeted advice. I recommend reviewing the documentation and resources available for the specific quantization tools you're using, as they might offer insights or best practices specific to their methodology.

Remember, the community and forums dedicated to the specific tools or frameworks you're using can also be valuable resources for advice and troubleshooting.

Best of luck with your quantization efforts, and feel free to reach out if you have more specific questions or issues.

Best regards.

@BaoHaoo
Copy link

BaoHaoo commented Apr 14, 2024

@JISHNUSHAJI @glenn-jocher @fwzdev1

Hi all, just to share my recent exploration of running yolov5 with SNPE.

I am using SNPE v1.62 and yolov5 release v6.1. My task is to detect a custom class which are very small objects typically 10x10 pixels. The model I chose was yolov5s with default 640x640 input size, but I think other models are also compatible.

Least Modification

Since the main issue of running yolov5 with SNPE is caused by the unsupported 5d reshape operation, simply changing the 5d reshape to 4d can solve the problem. For example, one of the detection head using 1x3x85x20x20 reshape is unacceptable to SNPE, but is acceptable after changing it to 3x85x20x20 reshape. In a word, just eliminate the batchsize.

The modification in the Detect() module in models/yolo.py:

In forward(), simply delete bs and change the permute index

# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()  # original
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous()  # modified

In _make_grid, also delete the batchsize part of all 5d tensors

# grid = torch.stack((xv, yv), 2).expand((1, self.na, ny, nx, 2)).float()  #original
grid = torch.stack((xv, yv), 2).expand((self.na, ny, nx, 2)).float()  #modified
# anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((1, self.na, 1, 1, 2)).expand((1, self.na, ny, nx, 2)).float()    #original 
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((self.na, 1, 1, 2)).expand((self.na, ny, nx, 2)).float()  #modified

No other modification needed, directly convert the original pt model to onnx then dlc without need to specify the out_nodes. The SNPE is able to compute the entire network including the operations inside Detect() layer, so no need to reimplement this detection part outside the model. Just apply confidence selection and nms then you can get the bounding boxes.

Both CPU and DSP runtimes can execute this network without raising any error. However, it will only give you the correct output by using CPU. The precision is affected significantly with 8-bit quantization when using DSP, mainly caused by the operations in Detect() layer as far as I am concerned.

Running with DSP

If you just need to run with default CPU then the above solution may be the simpliest one. But I believe most of us choose SNPE because of the accelartion by DSP/AIP. So the reimplementation of detection part is unavoidable.

The modification in the Detect module is mainly to comment out these codes.

# if self.inplace:
#     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
#     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
#     y = torch.cat((xy, wh, y[..., 4:5]), -1)

Also remember to include the change of 5d reshape to 4d discussed above.

The final output shape will still be consistent with the original model which is a 3d tensor (actually 2d): 1x25200x85 for the default yolov5s model. This output tensor can be obtained by DSP/AIP runtime with acceleration but no large precision drop. Then we use CPU to parse this output by performing exactly the same operation that we have commented.

Since the output from SNPE is always 1d, a single for loop is enough to do the parsing. An example code is shown below, which is written in Java but also easy to convert to C++ etc.

float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}

The location of bounding boxes are represented by left, top, right, bottom and confidence by obj_conf, these can be provided to nms functions to get clean boxes. The parsing of class confidence and class index are not provided here cause they are not relevant to my task, but could be easily extracted using some sort of max and argmax functions.

In my Android app, running the net with DSP takes about 120ms on a Snapdragon 870 platform compared to 500+ms using CPU, and the accuracy is nearly the same. However this speed is still a bit slow for real time tasks, probably cause I was using the SNPE Java SDK instead of C++. Still further optimization can be made to achieve a faster speed.

Optimization for SNPE

When looking into the yolov5 models released recently, the activation layer used after each convolution is nn.SiLU(). However, both onnx and SNPE do not support SiLU activation directly, but splitting to separate operations of Sigmoid and Multiplication. Which means that SNPE currently does not optimize the execution of SiLU layers, and this apparently slows down entire execution of the network as you can see there are 50+ activation layers in yolov5s model!

Simply change the SiLU activations to commonly used LeakyReLU which is optimized by SNPE, by modifying the base Conv() module in models/common.py:

# self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  # original
self.act = nn.LeakyReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  #modified

Re-training is required and the original checkpoint can not be used. Training one epoch is faster but it will take more epochs to converge the model. Switching to LeakyReLU activations will result in slightly lower mAP but faster execution, so it is kind of performance trade off.

For my specific task to detect a single class of small objects, I prune the other two detection heads for median and large objects. In addition, I select the first 5 colomns to only output x,y,w,h,conf results, so the output shape becomes 1x19200x5 instead of 1x25200x85. This will further speed up the network execution as well as the detection post processing.

After all these optimizations, the final execution time for DSP drops to 25ms (almost in real time) on the same 870 device. The precision is also not much affected although it has less robustness and stability than the original yolov5s model. If your main concern is speed then apply these optimizations to your model otherwise just use the original one.

Even faster execution time may achieve by switching to C++ and yolov5n model.

Good luck!

Hello, thank you very much for your detailed report. Based on your work, I further tested the quantized YOLOv5 model running on the Qualcomm SNPE DSP. I found that in the final output tensor, the object detection boxes could be detected properly, but strangely their confidence scores (which is y[4:6] in the code below) were very low. Combining with SNPE's quantization tool "snpe-dlc-quantize", I speculate that this is because SNPE adopts a rather basic method of post-training quantization, which is 𝑄 = round((FP32 − Min)/Scale Factor) + Zero Point.

if self.inplace:
     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
 else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
     y = torch.cat((xy, wh, y[..., 4:5]), -1)

Before the code block, xy, wh, and conf are tensors ranging from 0 to 1, so their quantization does not cause much loss of accuracy. However, after this code block, xy scales to a range approximately equal to the size of the image, wh scales to the size of the target in pixel range, but only confidence remains within the range of 0 to 1.

After this code block, all these variables including xy, wh, and conf are concatenated into one vector. For SNPE, one vector shares one quantization parameter. During quantization, the quantization scale is equivalent to the tensor with the largest change in value. Corresponding to xy, wh, and conf, the xy with the largest change will be used as the quantization scale. This scale will cause severe loss in the range of variables, and the confidence will tend to zero. This is also the reason why there is no output after quantization. For example, for an image of size 640x640, using Int8 quantization, the Scale Factor is 640/256 = 2.5, which is even larger than the entire range of conf. This is the fundamental reason for the significant loss of quantization accuracy.

To address this issue, the strategy I employed here is a very naive one: I multiply the confidence scores (conf) by a coefficient to scale them along with xy and wh to the same range. This prevents excessive loss of precision during quantization. After the final model output, I divide it by the same coefficient to obtain normal detection confidence.

 if self.inplace:
     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
 else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
     conf = y[..., 4:6]*1280 # 1280 is the max length of the input size, which can be the width of the input image
     y = torch.cat((xy, wh, conf), -1)

By using this method, adjustments only need to be made to the confidence scores at the end, without the need to add additional code elsewhere. Through testing, I found that the quantization accuracy loss resulting from this approach is acceptable.

@glenn-jocher
Copy link
Member

Hello,

Thanks for sharing your exploration and solution to the quantization accuracy loss issue when running YOLOv5 with SNPE. It's insightful to see how adjusting the range of the confidence scores can help mitigate precision loss due to quantization. Your approach to scale the confidence scores to be in line with other tensor ranges and then scaling back for final output is a clever workaround. This strategy could be beneficial for others facing similar quantization challenges.

It's always exciting to see community members contributing novel solutions to complex problems. Keep up the good work, and thank you for contributing to the broader knowledge base around YOLOv5 and SNPE!

@aleshem
Copy link

aleshem commented Apr 15, 2024

BTW, I managed to convert yolov8 using SNPE 2.10 (opset=12 in export) without any major changes. in case your chip supports this version it may save you a lot of time

@glenn-jocher
Copy link
Member

Hey there!

That's fantastic news! 🎉 It's always great to hear about smooth conversions, especially with newer versions like YOLOv8 and SNPE 2.10. Your sharing could indeed save a lot of time for many in the community working on similar projects. If there are specific steps or minor tweaks that helped you along the way, feel free to drop those details. Every bit helps! Thanks for sharing, and happy coding! 👍

@aleshem
Copy link

aleshem commented Apr 16, 2024

The major problem at the moment, is that the quantization doesn't work well for yolov8-pose, for some reason it ruins the confidence

@glenn-jocher
Copy link
Member

Hello!

Thanks for reaching out about the quantization issue with YOLOv8-pose. It's not uncommon for quantization to affect model confidence, as precision loss can significantly impact the network's output. 🤔

A potential approach is to experiment with different quantization techniques or tools that might offer better control over precision loss. Considering calibration datasets that closely represent your use case might also help mitigate this issue. It's all about finding the right balance for your specific scenario.

If this doesn't resolve the problem, could you share more details about the quantization method you're using? This info might provide further insights for troubleshooting.

Best regards!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests