Skip to content

This is the official implementation of RGNet: A Unified Retrieval and Grounding Network for Long Videos

Notifications You must be signed in to change notification settings

Tanveer81/RGNet

Repository files navigation

RGNet

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Tanveer Hannan Md Mohaiminul Islam Thomas Seidl Gedas Bertasius

Accepted by ECCV 2024

[Website] [Paper]

PWC

PWC


📢 Latest Updates

  • Jul-13: The trained models weights are available here
  • Jul-13: Released the training and evaluation code.
  • Jul-1: RGNet is accepted to ECCV 2024! 🔥🔥

RGNet Overview 💡

RGNet is a novel architecture for processing Long Videos (20–120 minutes) for fine-grained video moment understanding and reasoning. It predicts the moment boundary specified by textual queries from an hour-long video. RGNet unifies retrieval and moment detection into a single network and processes long videos into multiple granular levels, e.g., clips and frames.

drawing


Contributions 🏆

  • We systematically deconstruct existing LVTG methods into clip retrieval and grounding stages. Through empirical evaluations, we discern that disjoint retrieval is the primary factor contributing to poor performance.
  • Based on our observations, we introduce RGNet, which integrates clip retrieval with grounding through parallel clip and frame-level modeling. This obviates the necessity for a separate video retrieval network, replaced instead by an end-to-end clip retrieval module tailored specifically for long videos.
  • We introduce sparse attention to the retriever and a corresponding loss to model fine-grained event understanding in long-range video. We propose a contrastive negative clip-mining strategy to simulate clip retrieval from a long video during training.
  • RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.

Installation 🔧

  • Follow INSTALL.md for installing necessary dependencies and compiling the code.

Prepare-offline-data

  • Download full Ego4D-NLQ data Ego4D-NLQ (8.29GB).
  • Download partial MAD data MAD (6.5GB). We CAN NOT share the MAD visual features at this moment, please request access to the MAD dataset from official resource MAD github.
  • We provide the feature extraction and file pre-processing procedures for both benchmarks in detail, please refer to Feature_Extraction_MD.
  • Follow DATASET.md for processing the dataset.

Ego4D-NLQ-training

Training can be launched by running the following command. The checkpoints and other experiment log files will be written into results.

bash rgnet/scripts/pretrain_ego4d.sh 
bash rgnet/scripts/finetune_ego4d.sh

Ego4D-NLQ-inference

Once the model is trained, you can use the following commands for inference, where CHECKPOINT_PATH is the path to the saved checkpoint.

bash rgnet/scripts/inference_ego4d.sh CHECKPOINT_PATH 

MAD-training

Training can be launched by running the following command:

bash rgnet/scripts/train_mad.sh 

MAD-inference

Once the model is trained, you can use the following commands for inference, where CUDA_DEVICE_ID is cuda device id, CHECKPOINT_PATH is the path to the saved checkpoint.

bash rgnet/scripts/inference_mad.sh CHECKPOINT_PATH 

Qualitative Analysis 🔍

A Comprehensive Evaluation of RGNEt's Performance on Ego4D-NLQ Datasets.

drawing


Acknowledgements 🙏

We are grateful for the following awesome projects our VTimeLLM arising from:

  • Moment-DETR: Detecting Moments and Highlights in Videos via Natural Language Queries
  • QD-DETR: Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
  • CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
  • MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
  • Ego-4D: Ego4D Episodic Memory Benchmark

If you're using VTimeLLM in your research or applications, please cite using this BibTeX:

@article{hannan2023rgnet,
  title={RGNet: A Unified Retrieval and Grounding Network for Long Videos},
  author={Hannan, Tanveer and Islam, Md Mohaiminul and Seidl, Thomas and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2312.06729},
  year={2023}
}

License 📜

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License.

Looking forward to your feedback, contributions, and stars! 🌟

About

This is the official implementation of RGNet: A Unified Retrieval and Grounding Network for Long Videos

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published