New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to fuse the results of different scales? #37
Comments
Thanks! There’s no grid fusion. Just like in YOLO, all the grids are passed to the NMS function. Each grid point represents a unique prediction. Larger objects are usually detected in the smaller grids. |
I see. And did you do the ablation study on these scales? How does the number of different scales affect the results? |
Personally, I did not. But I asked the same question in ultralytics/yolov5, and I was informed that an evolutionary algorithm was used to find the optimal configuration. However, the optimization was performed for object detection, not human pose estimation. |
I also wonder why these conclusions are derived as you stated in your paper: The receptive field of an output grid increases with s, so smaller output grids are better suited for detecting larger objects. |
If you’re asking why the receptive field increases: that’s how CNNs work - the deeper you go, the larger the effective receptive field. If you’re asking why larger receptive fields are better for large objects: it’s because larger receptive fields can “see” a larger portion of the input image. The anchor boxes were also defined such that larger anchors are used with smaller grids. Again, none of this has been optimized for human pose estimation specifically, just object detection . |
I see. I didn't know much detail about yolov5 before as well as that the anchor boxes were also defined such that larger anchors are used with smaller grids. Could you explain the meaning of the following lines? How does the [19,27, 44,40, 38,94] represent exactly? Lines 7 to 11 in f7bd62d
Again, Thanks for your quick reply! |
Those are the width, height (in pixels) of the three anchor boxes used with the largest output grid (1/8th the size of the original image). |
I see. And I wonder where is the center point of the anchor boxes? Is it the center of the grid by default?
|
No each grid point predicts an offset for the center of a box! I suggest going through the paper section 3 in more detail to get a better understanding! |
I will go through this part in detail. Thanks. |
Thanks for your great work. And I wonder during the inference, how you combine the results of 4 different output grids? Will there be some special fusion? Looking forward to your reply.
The text was updated successfully, but these errors were encountered: