Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to fuse the results of different scales? #37

Closed
liqikai9 opened this issue Jan 26, 2022 · 10 comments
Closed

How to fuse the results of different scales? #37

liqikai9 opened this issue Jan 26, 2022 · 10 comments

Comments

@liqikai9
Copy link

Thanks for your great work. And I wonder during the inference, how you combine the results of 4 different output grids? Will there be some special fusion? Looking forward to your reply.

@wmcnally
Copy link
Owner

wmcnally commented Jan 26, 2022

Thanks! There’s no grid fusion. Just like in YOLO, all the grids are passed to the NMS function. Each grid point represents a unique prediction. Larger objects are usually detected in the smaller grids.

@liqikai9
Copy link
Author

I see. And did you do the ablation study on these scales? How does the number of different scales affect the results?

@wmcnally
Copy link
Owner

Personally, I did not. But I asked the same question in ultralytics/yolov5, and I was informed that an evolutionary algorithm was used to find the optimal configuration. However, the optimization was performed for object detection, not human pose estimation.

@liqikai9
Copy link
Author

liqikai9 commented Jan 26, 2022

Larger objects are usually detected in the smaller grids.

I also wonder why these conclusions are derived as you stated in your paper: The receptive field of an output grid increases with s, so smaller output grids are better suited for detecting larger objects.

@liqikai9 liqikai9 reopened this Jan 26, 2022
@wmcnally
Copy link
Owner

wmcnally commented Jan 26, 2022

I also wonder why these conclusions are derived as you stated in your paper: The receptive field of an output grid increases with s, so smaller output grids are better suited for detecting larger objects.

If you’re asking why the receptive field increases: that’s how CNNs work - the deeper you go, the larger the effective receptive field. If you’re asking why larger receptive fields are better for large objects: it’s because larger receptive fields can “see” a larger portion of the input image. The anchor boxes were also defined such that larger anchors are used with smaller grids. Again, none of this has been optimized for human pose estimation specifically, just object detection .

@liqikai9
Copy link
Author

I see. I didn't know much detail about yolov5 before as well as that the anchor boxes were also defined such that larger anchors are used with smaller grids.

Could you explain the meaning of the following lines? How does the [19,27, 44,40, 38,94] represent exactly?

anchors:
- [19,27, 44,40, 38,94] # P3/8
- [96,68, 86,152, 180,137] # P4/16
- [140,301, 303,264, 238,542] # P5/32
- [436,615, 739,380, 925,792] # P6/64

Again, Thanks for your quick reply!

@wmcnally
Copy link
Owner

Those are the width, height (in pixels) of the three anchor boxes used with the largest output grid (1/8th the size of the original image).

@liqikai9
Copy link
Author

I see. And I wonder where is the center point of the anchor boxes? Is it the center of the grid by default?

Those are the width, height (in pixels) of the three anchor boxes used with the largest output grid (1/8th the size of the original image).

@wmcnally
Copy link
Owner

No each grid point predicts an offset for the center of a box! I suggest going through the paper section 3 in more detail to get a better understanding!

@liqikai9
Copy link
Author

No each grid point predicts an offset for the center of a box! I suggest going through the paper section 3 in more detail to get a better understanding!

I will go through this part in detail. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants