ONNX model output boxes are all zeros. #1172

vandesa003 · 2020-05-13T17:10:04Z

🐛 Bug

Hi @glenn-jocher , first of all, thanks again for your great work. I met this problem after I trained on my own dataset and convert the model to ONNX. While I am running the ONNX model on a normal input image, I got the output like this:

boxes:
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
...

the classes output seems normal just between 0 and 1.
this is the ONNX model:

which I think should be correct.

To Reproduce

REQUIRED: Code to reproduce your issue below
First, I set ONNX_EXPORT = True in [model.py])(

yolov3/models.py

Line 5 in b2fcfc5

ONNX_EXPORT = False

)
Then, due to the machine env problem, I have to use opset_version=9 in detect.py
After this I convert the model to onnx:

python detect.py --cfg yolov3-spp.cfg \
    --names data/mydataset.names \
    --weights weights/best.pt \
    --source data/samples \
    --conf-thres 0.3 \
    --iou-thres 0.6

I will receive a warning during conversion:

yolov3/utils/layers.py:60: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if nx == na:  # same shape

which I think is related to

yolov3/utils/layers.py

Line 60 in b2fcfc5

if nx == na: # same shape

But I am not sure whether this warning will cause this issue.

After then I run the inference through onnxruntime. and got a normal classes output and a all zero output boxes.

Expected behavior

Expected behavior the boxes definitely not all zeros.

Environment

If applicable, add screenshots to help explain your problem.

OS: [Ubuntu 1604]
GPU [V100]

The text was updated successfully, but these errors were encountered:

vandesa003 · 2020-05-13T17:14:27Z

To make things more clear, I also tested with opset_version=11, but still the output boxes are all zeros. I am really confused why this happens. I've been trapped here for near one week... any hints would be appreciated!

glenn-jocher · 2020-05-13T17:19:37Z

@vandesa003 the warning is normal, but opset 9 export is unsupported, so you are on your own if you choose to pass that argument.

Recommend you retry with the latest versions of pytorch and onnx, and opset 10 or 11.

glenn-jocher · 2020-05-13T17:25:47Z

@vandesa003 also make sure you are using the latest code when you convert: run git pull.

vandesa003 · 2020-05-14T03:57:40Z

@vandesa003 the warning is normal, but opset 9 export is unsupported, so you are on your own if you choose to pass that argument.

Recommend you retry with the latest versions of pytorch and onnx, and opset 10 or 11.

Sure @glenn-jocher I also tested with opset_version=11, but still receive all zeros boxes. Maybe I missed something related to the onnx version or other environment dependencies. Just make sure if this is a rare case, then it should be environment related issue. Thanks for your reply, I am closing this issue.

vandesa003 · 2020-05-16T06:40:51Z

Hi @glenn-jocher , finally I fixed the problem after pulling the new code. Thanks a lot! But I compared the box output from pytorch and onnx model and found that:

pytorch output:

tensor([[ 27.06218,  26.60020,  57.13964,  56.54650],
        [ 43.89460,  25.20766,  93.28602,  48.31281],
        [ 85.00977,  25.02055, 152.51794,  48.84522],
        ...,
        [395.99429, 409.55035,  48.99826,  17.14021],
        [402.78732, 409.71115,  35.31765,  18.58157],
        [410.01999, 410.44141,  21.80795,  18.26802]], device='cuda:0')

onnx output:

tensor([[0.0768, 0.0720, 0.0398, 0.0806],
        [0.0993, 0.0679, 0.1353, 0.0163],
        [0.1595, 0.0448, 0.5260, 0.0086],
        ...,
        [0.9423, 0.9808, 0.8031, 0.0051],
        [0.9615, 0.9808, 0.4474, 0.0102],
        [0.9808, 0.9808, 0.2708, 0.0243]])

Just wonder is the onnx box outputs are normalized? I need to multiply by the image size?

glenn-jocher · 2020-05-16T16:37:37Z

@vandesa003 yes they are normalized. These are the requirements of the the coreml model in
our iDetection app.

vandesa003 · 2020-05-16T16:49:31Z

@vandesa003 yes they are normalized. These are the requirements of the the coreml model in
our iDetection app.

@glenn-jocher Oh I see. I tried to restore the result by multiply the image size, but it seems not exact same. How can I restore the exact result?

glenn-jocher · 2020-05-16T17:29:36Z

@vandesa003 actually looking at the code there are no normalization steps, so they should be in pixel space. You can compare how the two outputs are handled here:

yolov3/models.py

Lines 196 to 217 in 3f27ef1

    
           elif ONNX_EXPORT: 
        
               # Avoid broadcasting for ANE operations 
        
               m = self.na * self.nx * self.ny 
        
               ng = 1. / self.ng.repeat(m, 1) 
        
               grid = self.grid.repeat(1, self.na, 1, 1, 1).view(m, 2) 
        
               anchor_wh = self.anchor_wh.repeat(1, 1, self.nx, self.ny, 1).view(m, 2) * ng 
        
               p = p.view(m, self.no) 
        
               xy = torch.sigmoid(p[:, 0:2]) + grid  # x, y 
        
               wh = torch.exp(p[:, 2:4]) * anchor_wh  # width, height 
        
               p_cls = torch.sigmoid(p[:, 4:5]) if self.nc == 1 else \ 
        
                   torch.sigmoid(p[:, 5:self.no]) * torch.sigmoid(p[:, 4:5])  # conf 
        
               return p_cls, xy * ng, wh 
        
           else:  # inference 
        
               io = p.clone()  # inference output 
        
               io[..., :2] = torch.sigmoid(io[..., :2]) + self.grid  # xy 
        
               io[..., 2:4] = torch.exp(io[..., 2:4]) * self.anchor_wh  # wh yolo method 
        
               io[..., :4] *= self.stride 
        
               torch.sigmoid_(io[..., 4:]) 
        
               return io.view(bs, -1, self.no), p  # view [1, 3, 13, 13, 85] as [1, 507, 85]

glenn-jocher · 2020-05-16T17:30:43Z

@vandesa003 ah yes, I was correct originally. 1/ng is normalizing it in grid space.

gasparramoa · 2020-05-20T15:33:43Z

First of all thanks for your work.
I'm trying to use your yolov3-tiny-1cls model into a tensorrt model for Jetson Nano.

I successfully converted the model to a onnx model (opset_version = 10) and to a tensorrt.
The problem is the shape of the output of the onnx model.
If I used the torch inference the prediction has shape (12096, 6)
While the tensorrt prediction has shape (12096,) , (48384,) -> (12096, 5)

I just don't know how to use this prediction to draw the bounding boxes etc...
In the torch approach you used the function:

def non_max_suppression(prediction, conf_thres=0.1, iou_thres=0.6, multi_label=True, classes=None, agnostic=False):
    """
    Performs  Non-Maximum Suppression on inference results
    Returns detections with shape:
        nx6 (x1, y1, x2, y2, conf, cls)
    """

The output(prediction) of the torch model: with shape:

(1, 12096, 6)
[[[     24.503       23.25      89.051      86.157  0.00035801     0.97608]
  [     49.759      28.121       102.1      55.264   0.0097554     0.98109]
  [      79.55      28.874      125.83      53.467    0.012427     0.98558]
  ...
  [     364.86      508.49      47.539      30.411  7.7272e-05     0.97495]
  [     372.76       508.1      46.075      27.851   7.985e-05     0.97406]
  [     380.88      508.44      41.789      28.087  7.4096e-05     0.97476]]]

The output(prediction) of tensorrt/onnx model: with shape:

(12096,) #classes
(48384,) #boxes

[3.5801044e-04 9.7554326e-03 1.2427208e-02 ... 7.7271565e-05 7.9850302e-05 7.4095879e-05]
[0.06380871 0.04540921 0.23190494 ... 0.99304414 0.10882567 0.05485656]

I just want to know how to use these values to build the predictions of the model.
Thanks in advance.

glenn-jocher · 2020-05-20T15:45:27Z

@gasparramoa use netron to view.

gasparramoa · 2020-05-20T15:47:06Z

@gasparramoa use netron to view.

I used, I just don't know how to use the result to build the prediction.

Others details of the onnx model:

glenn-jocher · 2020-05-20T16:11:56Z

@gasparramoa the outputs are the boxes and the confidences (0-1) of each class (looks like you have a single-class model), you can see them right there in your screenshot. What else do you need?

gasparramoa · 2020-05-25T17:05:02Z

So, what I need to do is to find the max value of confidence and use the bounding boxes for that confidence. Am I right?

glenn-jocher · 2020-05-25T19:33:33Z

@gasparramoa I can't advise you on this, if you want please open a new issue as this original issue is resolved.

glenn-jocher · 2020-05-25T19:33:36Z

This issue has been resolved in a commit in early May 2020. If you are having this issue update your code with git pull or clone a new repo.

vandesa003 · 2020-05-26T07:02:39Z

@gasparramoa as I said in previous reply, the output of onnx model is normalised in anchor wise. You only need to remove the normalised process, then everything is ok.

vandesa003 · 2020-05-26T07:06:15Z

This issue has been resolved in a commit in early May 2020. If you are having this issue update your code with git pull or clone a new repo.

@glenn-jocher Sorry I should have closed this issue. Now I restored the normalised value and I can use it! thanks again for you guys great work! learned a lot from your repo.

gasparramoa · 2020-05-26T09:49:45Z

@gasparramoa as I said in previous reply, the output of onnx model is normalised in anchor wise. You only need to remove the normalised process, then everything is ok.

So I can not restore the result by multiply the image size? What is normalization in achor wise ? Can you give me a simple example of how to remove this normalization process please?
Thanks in advance, I really mean it.

vandesa003 · 2020-05-26T15:27:41Z

@gasparramoa as I said in previous reply, the output of onnx model is normalised in anchor wise. You only need to remove the normalised process, then everything is ok.

So I can not restore the result by multiply the image size? What is normalization in achor wise ? Can you give me a simple example of how to remove this normalization process please?
Thanks in advance, I really mean it.

Just multiply by the self.stride.

        elif ONNX_EXPORT:
            # Avoid broadcasting for ANE operations
            m = self.na * self.nx * self.ny
            # ng = 1. / self.ng.repeat(m, 1)
            grid = self.grid.repeat(1, self.na, 1, 1, 1).view(m, 2)
            # anchor_wh = self.anchor_wh.repeat(1, 1, self.nx, self.ny, 1).view(m, 2) * ng
            anchor_wh = self.anchor_wh.repeat(1, 1, self.nx, self.ny, 1).view(m, 2)

            p = p.view(m, self.no)
            xy = torch.sigmoid(p[:, 0:2]) + grid  # x, y
            wh = torch.exp(p[:, 2:4]) * anchor_wh  # width, height
            p_cls = torch.sigmoid(p[:, 4:5]) if self.nc == 1 else \
                torch.sigmoid(p[:, 5:self.no]) * torch.sigmoid(p[:, 4:5])  # conf
            # return p_cls, xy * ng, wh
            return p_cls, xy * self.stride, wh * self.stride

gasparramoa · 2020-05-27T17:14:24Z

Thank you @vandesa003 !!!
That was it!
Now I have exactly the same result in the torch model and in the TensorRT model.

marvision-ai · 2020-06-13T18:11:01Z

@gasparramoa Where you able to get the onnx model into a tensorRT model? If so, did you use the onnx-tensorRT github for that? If not, what tool did you use and what is your performance?
I would like to look into this more as I am very curious.

Thanks in advance!

glenn-jocher · 2020-06-13T18:17:40Z

@gasparramoa @mbufi there's a request for tensorrt on our new repo as well. I personally don't have experience with it, but if you guys have time or suggestions that would be awesome.
ultralytics/yolov5#45

There is a tensorrt export here as well that is already working for this repo:
https://github.com/wang-xinyu/tensorrtx/tree/master/yolov3-spp

marvision-ai · 2020-06-13T18:36:43Z

@glenn-jocher Yes I saw! Thanks for the suggestion. I may look into it.

prathik-naidu · 2020-06-24T22:00:15Z

@vandesa003 @gasparramoa @glenn-jocher I'm running into some issues where the torch output and onnx model outputs do not match in the current version of the repo. Steps to reproduce:

First, I set ONNX_EXPORT = True and ran detect.py to generate an onnx file

python detect.py --cfg ./cfg/yolov3-spp.cfg --weights weights/yolov3-spp.pt

Then, I set ONNX_EXPORT = False and run detect.py normally and as expected, the outputs look correct on the sample images.
Then, I wanted to try running with onnx_runtime using my new onnx file. To do this, I replaced the pred = model(img, augment=opt.augment)[0] call in detect.py (so all the normal image preprocessing still runs) with the following:

session = onnxruntime.InferenceSession('weights/yolov3-spp.onnx')
in_img = {session.get_inputs()[0].name: img.numpy()}
out = session.run(None, in_img)[0]

However, when I was debugging, I saw that the onnxruntime output and the pytorch model outputs do not match:

pytorch output (after running inference but before nms):

tensor([[[1.89963e+01, 1.56430e+01, 2.04850e+02,  ..., 1.42084e-03, 1.65047e-03, 8.64788e-04],
         [4.88579e+01, 2.42638e+01, 1.55053e+02,  ..., 1.70676e-03, 1.44675e-03, 7.56415e-04],
         [8.29035e+01, 2.43567e+01, 1.74981e+02,  ..., 2.03217e-03, 1.57435e-03, 5.97334e-04],
         ...,
         [2.99602e+02, 1.88690e+02, 8.93882e+01,  ..., 1.16396e-03, 3.20018e-04, 2.71256e-04],
         [3.06881e+02, 1.88730e+02, 8.49935e+01,  ..., 2.39168e-03, 6.58945e-04, 7.91102e-04],
         [3.16741e+02, 1.88525e+02, 9.01153e+01,  ..., 1.65509e-03, 1.31225e-03, 1.66143e-03]]], grad_fn=<CatBackward>)

onnx runtime output (after running inference but before nms):

array([[ 1.2182e-07,    2.07e-09,  6.1938e-09, ...,  2.7896e-09,  6.1711e-10,  2.6199e-10],
       [  6.162e-06,   5.007e-08,  5.9215e-08, ...,  4.0838e-08,  1.0652e-08,  2.6904e-09],
       [ 3.3448e-05,  2.9604e-07,  2.3664e-07, ...,  1.4082e-07,  5.1649e-08,  7.7577e-09],
       ...,
       [ 3.7442e-05,  4.1841e-07,  1.5349e-06, ...,  2.9282e-07,  1.8697e-08,  3.0575e-08],
       [  8.224e-06,  1.4772e-07,  6.8398e-07, ...,  1.1183e-07,  6.8377e-09,  1.1774e-08],
       [ 8.2141e-07,  2.1218e-08,  6.2566e-08, ...,  1.6764e-08,  2.4703e-09,   3.212e-09]], dtype=float32)

I added the fix from @vandesa003 (return p_cls, xy * self.stride, wh * self.stride) in models.py but I'm still getting this issue. Any ideas why this might be happening?

glenn-jocher · 2020-06-24T22:25:28Z

@prathik-naidu we offer model export to onnx and coreml as a service. If you are interested please send us a request via email.

prathik-naidu · 2020-06-24T22:44:52Z

@glenn-jocher I'm just running on the current open source yolov3 code given that it has ONNX export functionality. Does this not work? I'm using the default yolov3-spp.cfg and yolov3-spp.pt files that are from the repo but still not able to match the outputs between pytorch and onnxruntime.

glenn-jocher · 2020-06-24T22:55:07Z

@glenn-jocher yes, there is limited export functionality available here! If you can get by with this then great :)

prathik-naidu · 2020-06-24T23:26:02Z

@glenn-jocher I see so just to clarify, what is currently possible with the export functionality in this repo? It seems like there is capability to export to an onnx file but that onnx file doesn't actually replicate the results of the pytorch model. Is that expected?

Is there something that needs to be changed with this open source code to get that working (not sure if I'm missing something) or does this functionality not exist?

glenn-jocher · 2020-06-24T23:32:09Z

@prathik-naidu export works as intended here. If you need additional help we can provide it as a service.

sky-fly97 · 2020-06-25T14:11:46Z

@glenn-jocher I see so just to clarify, what is currently possible with the export functionality in this repo? It seems like there is capability to export to an onnx file but that onnx file doesn't actually replicate the results of the pytorch model. Is that expected?

Is there something that needs to be changed with this open source code to get that working (not sure if I'm missing something) or does this functionality not exist?

So can't we use the exported onnx model normally? I wanted to use OPENCV of C + + to call the exported onnx model, and then use C + + reasoning to deploy the project. But if the prediction result of onnx model is not correct, does that mean that the result of subsequent deployment will also be incorrect

glenn-jocher · 2020-06-25T14:21:49Z

@sky-fly97 export operates correctly.

sky-fly97 · 2020-06-25T14:33:52Z

@sky-fly97 export operates correctly.

Oh, Thank you! I see that the above person said that the output of the exported onnx model is quite inconsistent with the original pytorch model, so I have such a question.By the way, thank you very much for your work, which is really important

prathik-naidu · 2020-06-25T15:39:37Z

@sky-fly97 Let me know if you are able to get consistent results with your work. I'm still not able to figure out why the exported onnx model generates different results from the pytorch model (even on simple inputs like torch.ones). If export works correctly, I assume that means the model that is loaded from the onnx file should also work as well?

sky-fly97 · 2020-06-26T03:29:16Z

@sky-fly97如果你能和你的工作取得一致的结果，请告诉我。我仍然无法弄清楚为什么导出的onnx模型会产生与py手电模型不同的结果(即使是在像torch.ones这样的简单输入上)。如果导出工作正常，我假设这意味着从onnx文件加载的模型也应该工作吗？

OK。I will try. I have another question, why does the onnx model take (320，192) as the input size.Does it matter?

goldwater668 · 2020-07-30T03:36:42Z

谢谢@ vandesa003 !!!
就是这样！
现在，在割炬模型和TensorRT模型中，我得到的结果完全相同。

Hello, I have also successfully converted the downloaded yolov3.weights into onnx, but the error in converting to tensor RT is as follows:

[TensorRT] ERROR: Network must have at least one output

Traceback (most recent call last):

context = engine.create_ execution_ context()

AttributeError: 'NoneType' object has no attribute 'create_ execution_ context'

StanislasBertrand · 2020-08-21T12:50:38Z

@prathik-naidu , I have results similar to yours (onnx pred probabilities around 10e-7), have you figured it out ?

dengfenglai321 · 2020-09-15T12:10:04Z

Hi @glenn-jocher , finally I fixed the problem after pulling the new code. Thanks a lot! But I compared the box output from pytorch and onnx model and found that:

pytorch output:

tensor([[ 27.06218, 26.60020, 57.13964, 56.54650],
[ 43.89460, 25.20766, 93.28602, 48.31281],
[ 85.00977, 25.02055, 152.51794, 48.84522],
...,
[395.99429, 409.55035, 48.99826, 17.14021],
[402.78732, 409.71115, 35.31765, 18.58157],
[410.01999, 410.44141, 21.80795, 18.26802]], device='cuda:0')

onnx output:

tensor([[0.0768, 0.0720, 0.0398, 0.0806],
[0.0993, 0.0679, 0.1353, 0.0163],
[0.1595, 0.0448, 0.5260, 0.0086],
...,
[0.9423, 0.9808, 0.8031, 0.0051],
[0.9615, 0.9808, 0.4474, 0.0102],
[0.9808, 0.9808, 0.2708, 0.0243]])

Just wonder is the onnx box outputs are normalized? I need to multiply by the image size?

hi , my onnx model have two output
box (10647, 4)
class (10647, 2).

how to decode this output and get final result? could you give me some advice?
do you did it?
could you share your code?

thanks！！！！

glenn-jocher · 2023-11-14T18:52:58Z

Hi @dengfenglai321! Great to hear that you've made progress and resolved the issue after updating the code! Regarding the differences in box output between PyTorch and ONNX, the ONNX output appears to be normalized. Depending on your requirements, you may need to multiply the ONNX box outputs by the image size to obtain the final result. As for decoding the output and obtaining the final result, we have inferred that you could benefit from the documentation at https://docs.ultralytics.com for guidance on decoding the ONNX output. Please feel free to reach out if you have any more questions or need further assistance. Good luck with your project!

vandesa003 added the bug Something isn't working label May 13, 2020

vandesa003 closed this as completed May 14, 2020

vandesa003 reopened this May 16, 2020

glenn-jocher closed this as completed May 25, 2020

vandesa003 mentioned this issue Jun 25, 2020

how to use onnx-model? #1163

Closed

ONNX model output boxes are all zeros. #1172

ONNX model output boxes are all zeros. #1172

Comments

vandesa003 commented May 13, 2020

🐛 Bug

To Reproduce

Expected behavior

Environment

vandesa003 commented May 13, 2020

glenn-jocher commented May 13, 2020

glenn-jocher commented May 13, 2020

vandesa003 commented May 14, 2020

vandesa003 commented May 16, 2020

glenn-jocher commented May 16, 2020

vandesa003 commented May 16, 2020

glenn-jocher commented May 16, 2020

glenn-jocher commented May 16, 2020

gasparramoa commented May 20, 2020 • edited Loading

glenn-jocher commented May 20, 2020

gasparramoa commented May 20, 2020 • edited Loading

glenn-jocher commented May 20, 2020

gasparramoa commented May 25, 2020

glenn-jocher commented May 25, 2020

glenn-jocher commented May 25, 2020

vandesa003 commented May 26, 2020

vandesa003 commented May 26, 2020

gasparramoa commented May 26, 2020

vandesa003 commented May 26, 2020

gasparramoa commented May 27, 2020

marvision-ai commented Jun 13, 2020

glenn-jocher commented Jun 13, 2020

marvision-ai commented Jun 13, 2020

prathik-naidu commented Jun 24, 2020 • edited Loading

glenn-jocher commented Jun 24, 2020 • edited Loading

prathik-naidu commented Jun 24, 2020

glenn-jocher commented Jun 24, 2020

prathik-naidu commented Jun 24, 2020

glenn-jocher commented Jun 24, 2020

sky-fly97 commented Jun 25, 2020

glenn-jocher commented Jun 25, 2020

sky-fly97 commented Jun 25, 2020

prathik-naidu commented Jun 25, 2020

sky-fly97 commented Jun 26, 2020

goldwater668 commented Jul 30, 2020

StanislasBertrand commented Aug 21, 2020

dengfenglai321 commented Sep 15, 2020

glenn-jocher commented Nov 14, 2023

gasparramoa commented May 20, 2020 •

edited

Loading

gasparramoa commented May 20, 2020 •

edited

Loading

prathik-naidu commented Jun 24, 2020 •

edited

Loading

glenn-jocher commented Jun 24, 2020 •

edited

Loading