Skip to content

weiguoPian/AV-CIL_ICCV2023

Repository files navigation

Audio-Visual Class-Incremental Learning

We introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition, and propose a method AV-CIL. [paper]

AV-CIL

Environment

We conduct experiments with Python 3.8.13 and Pytorch 1.13.0.

To setup the environment, please simply run

pip install -r requirements.txt

Datasets

AVE

The original AVE dataset can be downloaded through link.

Please put the downloaded AVE videos in ./raw_data/AVE/videos/.

Kinetics-Sounds

The original Kinetics dataset can be downloaded through link. After downloading the Kinetics dataset, please apply our provided video id list (here) to extract the Kinetics-Sounds dataset used in our experiments.

Please put the downloaded videos in ./raw_data/kinetics-sounds/videos/.

VGGSound100

The original VGGSound dataset can be downloaded through link. After downloading the VGGSound dataset, please apply our provided video id list (here) to extract the Kinetics-Sounds dataset used in our experiments.

Please put the downloaded videos in ./raw_data/VGGSound/videos/.

Extract audio and frames

After downloading the datasets to the folds, please run the following command to extract the audios and frames

sh extract_audios_frames.sh 'dataset'

where the 'dataset' should be in [AVE, ksounds, VGGSound_100].

Pre-trained models

For the audio encoder, please download the pre-trained AudioMAE and put it in ./model/pretrained/.

Feature extraction

For the pre-trained audio features extraction, please run

sh extract_pretrained_features 'dataset'

where the 'dataset' should be in [AVE, ksounds, VGGSound_100].

For the running environment of the AudioMAE, we follow the official implementation and use timm==0.3.2, for which a fix is needed to work with Pytorch 1.8.1+.

(option) Use our extracted features directly

We also released the pre-trained features, you can use them directly instead of pre-processing and extracting them from the raw data: AVE, Kinetics-Sounds [part-1, part-2, part-3], VGGSound100[part-1, part-2, part-3, part-4, part-5, part-6].

For Kinetics-Sounds and VGGSound100, please download all the parts and concatenate them before unzipping.

After obtaining the pre-trained audio and visual features, please put them to ./data/'dataset'/audio_pretrained_feature/ and ./data/'dataset'/visual_pretrained_feature/.

Training & Evaluation

For vanilla fine-tuning strategy, please run

sh run_incremental_fine_tuning.sh 'dataset' 'modality'

where the 'dataset' should be in [AVE, ksounds, VGGSound_100], and the 'modality' should be in [audio, visual, audio-visual].

For the upper bound, please run

sh run_incremental_upper_bound.sh 'dataset' 'modality'

For LwF, please run

sh run_incremental_lwf.sh 'dataset' 'modality'

For iCaRL, please run

sh run_incremental_lwf.sh 'dataset' 'modality' 'classifier'

where the 'classifier' should be in [NME, FC].

For SS-IL, please run

sh run_incremental_ssil.sh 'dataset' 'modality'

For AFC, please run

sh run_incremental_afc.sh 'dataset' 'modality' 'classifier'

where the 'classifier' should be in [NME, LSC].

For our AV-CIL, please run

sh run_incremental_ours.sh 'dataset'

Citation

If you find this work useful, please consider citing it.

@inproceedings{pian2023audiovisual,
  title={Audio-Visual Class-Incremental Learning},
  author={Pian, Weiguo and Mo, Shentong and Guo, Yunhui and Tian, Yapeng},
  booktitle={IEEE/CVF International Conference on Computer Vision},
  year={2023}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published