You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I try to ground the picture in this sites. The image solution is 288*296. However, the CogVLM said that the coordinates of the person is [[097, 514, 283, 996]]. Is the solution of the picture having changed before the model infer?
The text was updated successfully, but these errors were encountered:
By the way, last time we try to input the prompt "Is there any pesons?" with the this image. However, the model only output one bounding box. Generally, when there are multiple objects in the image, the model only tends to predict a box.
The traditional REC task tends to find only one thing, but you are right it is better to support many (like detection). I think training on detection data can solve this problem.
I try to ground the picture in this sites. The image solution is 288*296. However, the CogVLM said that the coordinates of the person is [[097, 514, 283, 996]]. Is the solution of the picture having changed before the model infer?
The text was updated successfully, but these errors were encountered: