GitHub - ymhzyj/UMMAFormer: [ACM MM'23] UMMAFormer: A Universal Multimodal-adaptive Transformer Framework For Temporal Forgery Localization

[ACM MM'23] UMMAFormer: A Universal Multimodal-adaptive Transformer Framework For Temporal Forgery Localization

Rui Zhang Hongxia wang Mingshan Du Hanqing Liu Yang Zhou Qiang Zeng

School of Cyber Science and Engineering, Sichuan University

Temporal Video Inpainting Localization(TVIL) dataset and pytorch training/validation code for UMMAFormer. This is the official repository of our Work accepted to ACM MM'23. If you have any question, please contact zhangrui1997[at]stu.scu.edu.cn. The paper can be found in arxiv.

Overview of UMMAFormer

Abstract

The emergence of artificial intelligence-generated content~(AIGC) has raised concerns about the authenticity of multimedia content in various fields. Existing research has limitations and is not widely used in industrial settings, as it is only focused on binary classification tasks of complete videos. We propose a novel universal transformer framework for temporal forgery localization (TFL) called UMMAFormer, which predicts forgery segments with multimodal adaptation. We also propose a Temporal Feature Abnormal Attention (TFAA) module based on temporal feature reconstruction to enhance the detection of temporal differences. In addition, we introduce a parallel cross-attention feature pyramid network (PCA-FPN) to optimize the Feature Pyramid Network (FPN) for subtle feature enhancement. To address the lack of available datasets, we introduce a novel temporal video inpainting localization (TVIL) dataset that is specifically tailored for video inpainting scenes. Our experiments demonstrate that our proposed method achieves state-of-the-art performance on benchmark datasets, Lav-DF, TVIL, Psynd surpassing the previous best results significantly.

Movitation of UMMAFormer

TVIL dataset

a. Data Download

If you need the TVIL dataset for academic purposes, please download the full data from BaiduYun Disk (8tj1) or OneDrive.

b. Data Sources

The raw data is coming from Youtube VOS 2018.

c. Inpainting Methods

We use four different video inpainting methods to create new videos. They are E2FGVI, FGT, FuseFormer, and STTN, respectively. We used XMEM to generate the inpainting mask.

d. Feature Extract

We also provided TSN features (code：8tj1) as used in the paper, specifically extracted by mmaction2==0.24.1.

Code

Requirements

Linux
Python 3.5+
PyTorch 1.11
TensorBoard
CUDA 11.0+
GCC 4.9+
NumPy 1.11+
PyYaml
Pandas
h5py
joblib

Compilation

Part of NMS is implemented in C++. The code can be compiled by

cd ./libs/utils
python setup.py install --user
cd ../..

The code should be recompiled every time you update PyTorch.

To Reproduce Our Results

Download Features and Annotations We provided the following features and annotations for download:

annotations and features of Lav-DF from BaiduYun (code：k6jq)

annotations and features of Psynd from BaiduYun (code：m6iq)

annotations and features of TVIL from BaiduYun Disk (8tj1) or OneDrive

These features are the same as those used in our paper and are extracted using the bylo-a and tsn models. They can be directly used for training and testing. The labels, on the other hand, have been converted from their original different forms to fit the format of our code. The ground truth values remain unchanged and are the same as the original ones.

optional

You also can extract features on your own using mmaction==0.24.1 and BYOL-A. First, you need to apply to the official source for the original datasets from Lav-DF and Psynd. Then you need to download mmaction==0.24.1 and BYOL-A and set up the environment following official instructions. Furthermore, you need to extract frames and optical flow from the videos. You can use mmaction for this purpose. In the case of Lav-DF, you also need to separate the corresponding audio from the original video. And The pre-trained model also needs to be downloaded from tsn_rgb and tsn_flow. You can use the following command to generate a video list txt file for Lav-DF and extract visual features.
```
python tools/gen_lavdf_filelist.py
bash tools/gen_tsn_features_lavdf.sh
```
For audio features, please put the code tools/byola_extract_lavdf.py in the BYOL-A directory and use following command.
```
python bylo-a/byola_extract_lavdf.py
```
Unpack Features and Annotations

Unpack the file under ./data (or elsewhere and link to ./data).
The folder structure should look like

This folder
│   README.md
│   ...  
│
└───data/
│    └───lavdf/
│    │	 └───annotations
│    │	 └───feats   
│    │	 　　　└───byola   
│    │	 　　　   └───train   
│    │	 　　　   └───dev   
│    │	 　　　   └───test   
│    │	 　　　└───tsn   
│    │	 　　　   └───train   
│    │	 　　　   └───dev   
│    │	 　　　   └───test   
│    └───...
|
└───libs
│
│   ...

Training and Evaluation Train our UMMAFomrer with TSN and BYOL-A features. This will create an experiment folder ./paper_results that stores training config, logs, and checkpoints.
```
python ./train.py ./configs/UMMAFormer/dataset.yaml
```
Then you can run the evaluation process using the loaded model and evaluation dataset.
```
python ./eval.py ./configs/UMMAFormer/dataset.yaml ./paper_results/dataset/model_best.pth.tar
```
To modify the configuration file for psynd, you need to change the value of "test_split" to the corresponding subset name, such as "test_cellular" or "test_landline."　For each subset, you can calculate IoU for each subset using the following command:
```
python tools/test_miou.py
```
You need to modify the 'split' variable in the code, as well as the addresses of the labels and results.
Evaluating Our Pre-trained Model We also provide a pre-trained models. The following link is for Baidu cloud drive. Considering that some users may not be able to access it, we have additionally provided a OneDrive link.

Dataset	Modal	Config	Pretrained	AP@0.5	AP@0.75	AP@0.95	AR@10	AR@20	AR@50	AR@100
Lav-DF	V	Yaml	Ckpt	97.30	92.96	25.68	90.19	90.85	91.14	91.18
Lav-DF	V+A	Yaml	Ckpt	98.83	95.54	37.61	92.10	92.42	92.47	92.48
Lav-DF Subset	V	Yaml	Ckpt	98.83	95.95	30.11	92.32	92.65	92.74	92.75
Lav-DF Subset	V+A	Yaml	Ckpt	98.54	94.30	37.52	91.61	91.97	92.06	92.06
TVIL	V	Yaml	Ckpt	88.68	84.70	62.43	87.09	88.21	90.43	91.16
Psynd-Test	A	Yaml	Ckpt	100.00	100.00	79.87	97.60	97.60	97.60	97.60

Results

TODO List

Release full code.
Release TVIL datasets and TSN features.
Release TSN features and BYOL-A features for Lav-DF and Psynd
Release our pre-trained model

Cite UMMAFormer

Acknowledgement

Thanks for the work of Actionformer. Our code is based on the implementation of them.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
figures		figures
libs		libs
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
train.py		train.py

License

ymhzyj/UMMAFormer

Folders and files

Latest commit

History

Repository files navigation