Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
dataset
vision-language
audio-language
multimodal-foundation-model
cross-modality-pretraining
vision-audio-subtitle-text
-
Updated
Mar 14, 2024 - Jupyter Notebook