GitHub

Mask-Augmentation for Motion-Aware Video Representation Learning

Introduction

This is the codebase for the paper "MAC: Mask-Augmentation for Motion-Aware Video Representation Learning".

If you find our work useful in your research please consider citing our paper:

@inproceedings{macssl,
  title={MAC: Mask-Augmentation for Motion-Aware Video Representation Learning},
  author={Akar, Arif and Senturk, Ufuk Umut and Ikizler-Cinbis, Nazli},
  booktitle={BMVC},
  year={2022}
}

Getting Started

Installation

You can install some necessary libraries with conda.

# Step 0, create a new python environment
conda create -n mac python=3.7
conda activate mac

# Step 1, install mmcv with torch 1.8, cuda 11.1 
pip install mmcv-full==1.3.7 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.8.0/index.html

# Step 2, install tensorboard
pip install tensorboard

Data Preparation

UCF-101

Download the compressed file of raw videos from the official website.

mkdir -p data/ucf101/
wget https://www.crcv.ucf.edu/data/UCF101/UCF101.rar -O data/ucf101/UCF101.rar --no-check-certificate

Unzip the compressed file.

# install unrar if necessary 
# sudo apt-get install unrar
unrar e data/ucf101/UCF101.rar data/ucf101/UCF101_raw/

Download train/test split file

wget https://www.crcv.ucf.edu/data/UCF101/UCF101TrainTestSplits-RecognitionTask.zip -O data/ucf101/UCF101TrainTestSplits-RecognitionTask.zip --no-check-certificate
unzip data/ucf101/UCF101TrainTestSplits-RecognitionTask.zip -d data/ucf101/.

Run the preprocessing script (it takes about 1hours to extract raw frames).

python scripts/process_ucf101.py --raw_dir data/ucf101/UCF101_raw/ --ann_dir data/ucf101/ucfTrainTestlist/ --out_dir data/ucf101/

(Optional) The generated annotation file is in format of .txt. One can convert it into the json format by scripts/cvt_txt_to_json.py.

(Optional) delete raw videos to save disk space.

rm data/ucf101/UCF101.rar
rm -r data/ucf101/UCF101_raw/

HMDB-51

Download the compressed file of raw videos from the official website.

mkdir -p data/hmdb51/
wget http://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/hmdb51_org.rar -O data/hmdb51/HMDB51.rar --no-check-certificate

Unzip the compressed file.

unrar e data/hmdb51/HMDB51.rar data/hmdb51/HMDB51_raw/
for file in data/hmdb51/HMDB51_raw/*.rar; do unrar e ${file} ${file%".rar"}/; done

Download train/test split file.

wget http://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/test_train_splits.rar -O data/hmdb51/test_train_splits.rar --no-check-certificate
unrar e data/hmdb51/test_train_splits.rar data/hmdb51/test_train_splits/

Run the preprocessing script (it takes about 20 mins to extract raw frames).

python scripts/process_ucf101.py --raw_dir data/hmdb51/HMDB51_raw/ --ann_dir data/hmdb51/test_train_splits/ --out_dir data/hmdb51/

(Optional) delete raw videos to save disk space.

rm data/hmdb51/HMDB51.rar
rm -r data/hmdb51/HMDB51_raw/

Kinetics

Since Kinetics-400/600 dataset is relatively large, we do not provide the download script and the preprocessing script here.

You can easily follow the building instruction just like UCF-101. The raw video frames are extracted from the video file and they are further saved in a compressed .zip file.

**You can alternatively implement your own storage backend, as like src/datasets/backends/zip_backend.py If you do not want to use zip backend, one can use src/datasets/backends/jpeg_backend.py We provide subset of mini K400 backend src/datasets/backends/mini_k400.py as well.

CoT Pretraining

If you want a distributed pretraining for a CoT-2 model on UCF-101 dataset:

python -m torch.distributed.launch --master_port=<PORT_NUMBER> --nproc_per_node=<GPU_NUMBER> 
    ./tools/train_net.py --validate --cfg ./configs/mac2_moco/r2plus1d_18_ucf101/pretraining.py 
    --work_dir <WORKING_DIRECTORY> --data_dir  <DATASET_DIRECTORY> --launcher pytorch

The checkpoints and logs will be saved in the work_dir. If you do clarify work_dir as above, it will use work_dir defined in configuration file. --validate only refers to validation for self supervised objective. Our models are trained 300 epochs for UCF-101.

Action Recognition

After pretraining, one can use the CoT-pretrained model to initialize the action recognizer. The checkpoint path is defined in the key of model.backbone.pretrained. Our models are trained 150 epochs for UCF-101.

Model Training (with validation):

python -m torch.distributed.launch --master_port=<PORT_NUMBER> --nproc_per_node=<GPU_NUMBER> 
    ./tools/train_net.py --validate --cfg ./configs/mac2_moco/r2plus1d_18_ucf101/finetune_ucf101.py 
    --work_dir <WORKING_DIRECTORY> --data_dir  <DATASET_DIRECTORY> --launcher pytorch

Model evaluation:

python ./tools/test_net.py --cfg ./configs/mac2_moco/r2plus1d_18_ucf101/eval_ucf101.py 
--data_dir <DATASET_DIRECTORY> --progress

Video Retrieval

One can run video retrieval on UCF-101 below command with CoT-4 model:

python. /tools/retrieve.py --checkpoint <MODEL_PATH> --work_dir <WORKING_DIRECTORY> --data_dir <DATASET_DIRECTORY>
--cfg ./configs/mac4_moco/r2plus1d_18_kinetics/retrieve_ucf101.py

Other Methods

We have implementations of CtP[1], BE[2], MemDPC[3], SpeedNet[4], VCOP[5].

Acknowledgement

This repository is based on CtP. We thank authors for their contributions.

References

[1] Wang, G., Zhou, Y., Luo, C., Xie, W., Zeng, W., Xiong, Z.: Unsupervised visual representation learning by tracking patches in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2563– 2572 (2021)
[2] Wang, J., Gao, Y., Li, K., Lin, Y., Ma, A.J., Cheng, H., Peng, P., Huang, F., Ji, R., Sun, X.: Removing the background by adding the background: Towards background robust self-supervised video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11804–11813 (June 2021)
[3] Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: European Conference on Computer Vision (2020)
[4] Benaim, S., Ephrat, A., Lang, O., Mosseri, I., Freeman, W.T., Rubinstein, M., Irani, M., Dekel, T.: Speednet: Learning the speediness in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9922–9931 (2020)
[5] Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10334–10343 (2019)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
docs		docs
scripts		scripts
src		src
tools		tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mask-Augmentation for Motion-Aware Video Representation Learning

Introduction

Getting Started

Installation

Data Preparation

UCF-101

HMDB-51

Kinetics

CoT Pretraining

Action Recognition

Video Retrieval

Other Methods

Acknowledgement

References

About

Releases

Packages

Languages

ufukpage/MAC_SSL

Folders and files

Latest commit

History

Repository files navigation

Mask-Augmentation for Motion-Aware Video Representation Learning

Introduction

Getting Started

Installation

Data Preparation

UCF-101

HMDB-51

Kinetics

CoT Pretraining

Action Recognition

Video Retrieval

Other Methods

Acknowledgement

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages