Skip to content

ymhzyj/UMMAFormer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ACM MM'23] UMMAFormer: A Universal Multimodal-adaptive Transformer Framework For Temporal Forgery Localization

PyTorch Conference

Rui Zhang Hongxia wang Mingshan Du Hanqing Liu Yang Zhou Qiang Zeng
School of Cyber Science and Engineering, Sichuan University

Temporal Video Inpainting Localization(TVIL) dataset and pytorch training/validation code for UMMAFormer. This is the official repository of our Work accepted to ACM MM'23. If you have any question, please contact zhangrui1997[at]stu.scu.edu.cn. The paper can be found in arxiv.

drawing

Overview of UMMAFormer

Abstract

The emergence of artificial intelligence-generated content~(AIGC) has raised concerns about the authenticity of multimedia content in various fields. Existing research has limitations and is not widely used in industrial settings, as it is only focused on binary classification tasks of complete videos. We propose a novel universal transformer framework for temporal forgery localization (TFL) called UMMAFormer, which predicts forgery segments with multimodal adaptation. We also propose a Temporal Feature Abnormal Attention (TFAA) module based on temporal feature reconstruction to enhance the detection of temporal differences. In addition, we introduce a parallel cross-attention feature pyramid network (PCA-FPN) to optimize the Feature Pyramid Network (FPN) for subtle feature enhancement. To address the lack of available datasets, we introduce a novel temporal video inpainting localization (TVIL) dataset that is specifically tailored for video inpainting scenes. Our experiments demonstrate that our proposed method achieves state-of-the-art performance on benchmark datasets, Lav-DF, TVIL, Psynd surpassing the previous best results significantly.

drawing

Movitation of UMMAFormer

TVIL dataset

a. Data Download

If you need the TVIL dataset for academic purposes, please download the full data from BaiduYun Disk (8tj1) or OneDrive.

b. Data Sources

The raw data is coming from Youtube VOS 2018.

c. Inpainting Methods

We use four different video inpainting methods to create new videos. They are E2FGVI, FGT, FuseFormer, and STTN, respectively. We used XMEM to generate the inpainting mask.

Inpainting Sample Inpainting Sample

d. Feature Extract

We also provided TSN features (code:8tj1) as used in the paper, specifically extracted by mmaction2==0.24.1.

Code

Requirements

  • Linux
  • Python 3.5+
  • PyTorch 1.11
  • TensorBoard
  • CUDA 11.0+
  • GCC 4.9+
  • NumPy 1.11+
  • PyYaml
  • Pandas
  • h5py
  • joblib

Compilation

Part of NMS is implemented in C++. The code can be compiled by

cd ./libs/utils
python setup.py install --user
cd ../..

The code should be recompiled every time you update PyTorch.

To Reproduce Our Results

  1. Download Features and Annotations We provided the following features and annotations for download:

    annotations and features of Lav-DF from BaiduYun (code:k6jq)

    annotations and features of Psynd from BaiduYun (code:m6iq)

    annotations and features of TVIL from BaiduYun Disk (8tj1) or OneDrive

    These features are the same as those used in our paper and are extracted using the bylo-a and tsn models. They can be directly used for training and testing. The labels, on the other hand, have been converted from their original different forms to fit the format of our code. The ground truth values remain unchanged and are the same as the original ones.

    optional

    You also can extract features on your own using mmaction==0.24.1 and BYOL-A. First, you need to apply to the official source for the original datasets from Lav-DF and Psynd. Then you need to download mmaction==0.24.1 and BYOL-A and set up the environment following official instructions. Furthermore, you need to extract frames and optical flow from the videos. You can use mmaction for this purpose. In the case of Lav-DF, you also need to separate the corresponding audio from the original video. And The pre-trained model also needs to be downloaded from tsn_rgb and tsn_flow. You can use the following command to generate a video list txt file for Lav-DF and extract visual features.

    python tools/gen_lavdf_filelist.py
    bash tools/gen_tsn_features_lavdf.sh

    For audio features, please put the code tools/byola_extract_lavdf.py in the BYOL-A directory and use following command.

    python bylo-a/byola_extract_lavdf.py
  2. Unpack Features and Annotations

  • Unpack the file under ./data (or elsewhere and link to ./data).
  • The folder structure should look like
This folder
│   README.md
│   ...  
│
└───data/
│    └───lavdf/
│    │	 └───annotations
│    │	 └───feats   
│    │	    └───byola   
│    │	       └───train   
│    │	       └───dev   
│    │	       └───test   
│    │	    └───tsn   
│    │	       └───train   
│    │	       └───dev   
│    │	       └───test   
│    └───...
|
└───libs
│
│   ...
  1. Training and Evaluation Train our UMMAFomrer with TSN and BYOL-A features. This will create an experiment folder ./paper_results that stores training config, logs, and checkpoints.
    python ./train.py ./configs/UMMAFormer/dataset.yaml
    Then you can run the evaluation process using the loaded model and evaluation dataset.
    python ./eval.py ./configs/UMMAFormer/dataset.yaml ./paper_results/dataset/model_best.pth.tar
    To modify the configuration file for psynd, you need to change the value of "test_split" to the corresponding subset name, such as "test_cellular" or "test_landline." For each subset, you can calculate IoU for each subset using the following command:
    python tools/test_miou.py
    You need to modify the 'split' variable in the code, as well as the addresses of the labels and results.
  2. Evaluating Our Pre-trained Model We also provide a pre-trained models. The following link is for Baidu cloud drive. Considering that some users may not be able to access it, we have additionally provided a OneDrive link.
Dataset Modal Config Pretrained AP@0.5 AP@0.75 AP@0.95 AR@10 AR@20 AR@50 AR@100
Lav-DF V Yaml Ckpt 97.30 92.96 25.68 90.19 90.85 91.14 91.18
Lav-DF V+A Yaml Ckpt 98.83 95.54 37.61 92.10 92.42 92.47 92.48
Lav-DF Subset V Yaml Ckpt 98.83 95.95 30.11 92.32 92.65 92.74 92.75
Lav-DF Subset V+A Yaml Ckpt 98.54 94.30 37.52 91.61 91.97 92.06 92.06
TVIL V Yaml Ckpt 88.68 84.70 62.43 87.09 88.21 90.43 91.16
Psynd-Test A Yaml Ckpt 100.00 100.00 79.87 97.60 97.60 97.60 97.60

Results

TFL Result 1 (Lav-DF)

TFL Result 2 (Lav-DF)

TFL Result 1 (TVIL)

TFL Result 2 (TVIL)

TODO List

  • Release full code.
  • Release TVIL datasets and TSN features.
  • Release TSN features and BYOL-A features for Lav-DF and Psynd
  • Release our pre-trained model

Cite UMMAFormer

Acknowledgement

Thanks for the work of Actionformer. Our code is based on the implementation of them.

About

[ACM MM'23] UMMAFormer: A Universal Multimodal-adaptive Transformer Framework For Temporal Forgery Localization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published