You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I notice that MERLOT adopts segment-level positional embeddings. However, there are only 16 segments during pre-training.
For longer videos, e.g., movies, 16 segments are not enough to encode their information. Specifically, I have two questions:
How to extract features for extremely long videos like movies?
How about using fixed positional embeddings instead of learned ones?
The text was updated successfully, but these errors were encountered:
I notice that MERLOT adopts segment-level positional embeddings. However, there are only 16 segments during pre-training.
For longer videos, e.g., movies, 16 segments are not enough to encode their information. Specifically, I have two questions:
The text was updated successfully, but these errors were encountered: