This is the official code for the paper "When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning" accepted by Conference on Computer Vision and Pattern Recognition (CVPR 2025). This paper is available at here.
When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning
Authors: Yang Liu, Qianqian Xu*, Peisong Wen, Siran Dai, Qingming Huang*
| Dataset | Backbone | Epoch | mIoU | PCK@0.1 | Download | |
|---|---|---|---|---|---|---|
| ImageNet | VIT-S/16 | 100 | 64.1 | 39.7 | 46.2 | link |
| K400 | VIT-S/16 | 400 | 64.7 | 37.8 | 47.0 | link |
| K400 | VIT-B/16 | 200 | 66.4 | 38.9 | 47.1 | link |
- Ubuntu 20.04
- CUDA 12.4
- Python 3.9
- Pytorch 2.2.0
See requirement.txt for others.
-
Clone this repository
git clone https://github.com/yafeng19/T-CORE.git
-
Create a virtual environment with Python 3.9 and install the dependencies
conda create --name T_CORE python=3.9 conda activate T_CORE
-
Install the required libraries
pip install -r requirements.txt
- Download Kinetics-400 training set.
- Use third-party tools or scripts to extract frames from original videos.
- Place the frames in
data/Kinetics-400/frames/train. - Generate files for training data by
python base_model/tools/dump_files.pyand plce the files indata/Kinetics-400/frames. - Integrate the frames and files into the following structure:
T-CoRe ├── data │ └── Kinetics-400 │ └── frames │ ├── train │ │ ├── class_1 │ │ │ ├── video_1 │ │ │ │ ├── 00000.jpg │ │ │ │ ├── 00001.jpg │ │ │ │ ├── ... │ │ │ │ └── 00019.jpg │ │ │ ├── ... │ │ │ └── video_m │ │ ├── ... │ │ └── class_n │ ├── class-ids-TRAIN.npy │ ├── class-names-TRAIN.npy │ ├── entries-TRAIN.npy │ └── labels.txt ├── base_model └── scripts
We provide a script with default parameters. Run the following command for training.
bash scripts/pretrain.shThe well-trained models are saved at here.
In our paper, three dense-level benchmarks are adopted for evaluation.
| Dataset | Video Task | Download link |
|---|---|---|
| DAVIS | Video Object Segmentation | link |
| JHMDB | Human Pose Propagation | link |
| VIP | Semantic Part Propagation | link |
We provide a script with default parameters. Run the following command for evaluation.
bash scripts/eval.shIf you find this repository useful in your research, please cite the following papers:
@misc{liu2025futurepasttamingtemporal,
title={When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning},
author={Yang Liu and Qianqian Xu and Peisong Wen and Siran Dai and Qingming Huang},
year={2025},
eprint={2503.15096},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.15096},
}
If you have any detailed questions or suggestions, you can email us: liuyang232@mails.ucas.ac.cn. We will reply in 1-2 business days. Thanks for your interest in our work!
