The grounding detection frame coordinates are inconsistent with the image resolution. #46

JoaquinChou · 2023-10-23T06:48:17Z

I try to ground the picture in this sites. The image solution is 288*296. However, the CogVLM said that the coordinates of the person is [[097, 514, 283, 996]]. Is the solution of the picture having changed before the model infer?

1049451037 · 2023-10-23T06:54:39Z

The output integers are not absolute coordinates. They are the proportion of the axis, i.e., [0.097, 0.514, 0.283, 0.996].

JoaquinChou · 2023-10-23T07:10:45Z

By the way, last time we try to input the prompt "Is there any pesons?" with the this image. However, the model only output one bounding box. Generally, when there are multiple objects in the image, the model only tends to predict a box.

Sleepychord · 2023-10-23T13:25:06Z

The traditional REC task tends to find only one thing, but you are right it is better to support many (like detection). I think training on detection data can solve this problem.

Sleepychord closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The grounding detection frame coordinates are inconsistent with the image resolution. #46

The grounding detection frame coordinates are inconsistent with the image resolution. #46

JoaquinChou commented Oct 23, 2023

1049451037 commented Oct 23, 2023

JoaquinChou commented Oct 23, 2023

Sleepychord commented Oct 23, 2023

The grounding detection frame coordinates are inconsistent with the image resolution. #46

The grounding detection frame coordinates are inconsistent with the image resolution. #46

Comments

JoaquinChou commented Oct 23, 2023

1049451037 commented Oct 23, 2023

JoaquinChou commented Oct 23, 2023

Sleepychord commented Oct 23, 2023