Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The grounding detection frame coordinates are inconsistent with the image resolution. #46

Closed
JoaquinChou opened this issue Oct 23, 2023 · 3 comments

Comments

@JoaquinChou
Copy link

I try to ground the picture in this sites. The image solution is 288*296. However, the CogVLM said that the coordinates of the person is [[097, 514, 283, 996]]. Is the solution of the picture having changed before the model infer?

@1049451037
Copy link
Member

The output integers are not absolute coordinates. They are the proportion of the axis, i.e., [0.097, 0.514, 0.283, 0.996].

@JoaquinChou
Copy link
Author

By the way, last time we try to input the prompt "Is there any pesons?" with the this image. However, the model only output one bounding box. Generally, when there are multiple objects in the image, the model only tends to predict a box.

@Sleepychord
Copy link
Contributor

The traditional REC task tends to find only one thing, but you are right it is better to support many (like detection). I think training on detection data can solve this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants