This repository contains data and code for our paper PACS: A Dataset for Physical Audiovisual CommonSense Reasoning.
It is recommended to create an Anaconda environment:
conda create --name PACS python=3.8.11
conda activate PACS
pip install -r requirements.txt
Then, install the correct version of PyTorch, based on your cuda version here. For example:
pip3 install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
The dataset is available for download here.
Alternatively, if you want to replicate the original download steps, you can run the following code (this will take a while):
cd dataset/scripts
python3 download.py -data_dir PATH_TO_DATA_STORAGE_HERE
python3 preprocess.py -data_dir PATH_TO_DATA_STORAGE_HERE
To run baseline models, visit the experiments folder. We have currently benchmarked the following models:
Model | With Audio (%) | Without Audio (%) | Δ |
---|---|---|---|
Fusion (I+A+V) | 51.9 ± 1.1 | - | - |
Fusion (Q+I) | - | 51.2 ± 0.8 | - |
Fusion (Q+A) | 50.9 ± 0.6 | - | - |
Fusion (Q+V) | - | 51.5 ± 0.9 | - |
Late Fusion | 55.0 ± 1.1 | 52.5± 1.6 | 2.5 |
CLIP/AudioCLIP | 60.0 ± 0.9 | 56.3 ± 0.7 | 3.7 |
UNITER (L) | - | 60.6 ± 2.2 | - |
Merlot Reserve (B) | 66.5 ± 1.4 | 64.0 ± 0.9 | 2.6 |
Merlot Reserve (L) | 70.1 ± 1.0 | 68.4 ± 0.7 | 1.8 |
Majority | 50.4 | 50.4 | - |
Human | 96.3 ± 2.1 | 90.5 ± 3.1 | 5.9 |
If you used this repository or our dataset, please consider citing us:
@inproceedings{yu2022pacs,
title={PACS: A Dataset for Physical Audiovisual CommonSense Reasoning},
author={Yu, Samuel and Wu, Peter and Liang, Paul Pu and Salakhutdinov, Ruslan and Morency, Louis-Philippe},
booktitle={European Conference on Computer Vision},
year={2022}
}