[AAAI 2026 Oral] See, Rank and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection
Official Repository for "See, Rank and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection".
Accepted at AAAI 2026 Oral🔥
(* indicate corresponding author)
git clone https://github.com/VisualAIKHU/SRF.git
cd SRF
-
Download the official feature files for the QVHighlights dataset from Moment-DETR.
-
Download moment_detr_features.tar.gz(8GB) and extract it under the
../featuresdirectory. -
Additionally, You can download the caption_features_internvl.tar.gz.
-
Download the feature files from UMT.
-
Additionally, You can download the TVSum_caption_features.tar.gz.
-
Download the feature files from UMT.
-
Additionally, You can download the Charades-STA_caption_features.tar.gz.
condda create -n srf python=3.11.8
conda activate srf
pip install -r requirements.txt
You can train the model using only video features or both video and audio features by running the shell below.
bash srf/scripts/train.sh
bash srf/scripts/train_audio.sh
You need to modify reseults_root, exp_id and feat_root before running the shell and make sure each feature directory(v_feat_dirs, t_feat_dir and c_feat_dir) is set correctly.
You can generate hl_val_submission.jsonl and hl_test_submission.jsonl after training by running the shell below.
bash srf/scripts/inference.sh {results_path}/model_best.ckpt 'val'
bash srf/scripts/inference.sh {results_path}/model_best.ckpt 'test'
where results_path is the path to the saved checkpoint.
For more details for submission, check standalone_eval/README.md
@article{lee2025see,
title={See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection},
author={Lee, YuEun and Kim, Jung Uk},
journal={arXiv preprint arXiv:2511.22906},
year={2025}
}
Our codes benefits from the excellent TR-DETR.