You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I agree that some categories may not provide enough aligned vision-language information for multi-modal learning. However, in the paper, you mentioned "video game commentaries" as an example.
I wonder why it is not visually grounded. The people's comments are usually related to the games. In my opinion, we could filter this category only for its unreality, which means it may not benefit downstream tasks.
The text was updated successfully, but these errors were encountered:
I agree that some categories may not provide enough aligned vision-language information for multi-modal learning. However, in the paper, you mentioned "video game commentaries" as an example.
I wonder why it is not visually grounded. The people's comments are usually related to the games. In my opinion, we could filter this category only for its unreality, which means it may not benefit downstream tasks.
The text was updated successfully, but these errors were encountered: