🚄 SRT

i-SRT:Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment,
Daechul Ahn*^1,3, Yura Choi*^1,3, San Kim^1,3, Youngjae Yu¹, Dongyeop Kang², Jonghyun Choi^3,†(*Equal Contribution)
¹Yonsei University, ²University of Minnesota, ³Seoul National University
^†Corresponding Author

Abstract: Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in language model alignment, particularly on reasoning tasks, self-aligned models applied to large video-language models often result in lengthy and irrelevant responses. To address these challenges, we propose a novel method that employs self-retrospection to enhance both response generation and preference modeling, and call iterative self-retrospective judgment (i-SRT). By revisiting and evaluating already generated content and preference in loop, i-SRT improves the alignment between textual and visual modalities, reduce verbosity, and enhances content relevance. Our empirical evaluations across diverse video question answering benchmarks demonstrate that i-SRT significantly outperforms prior arts. We are committed to opensourcing our code, models, and datasets to encourage further investigation.

Release

[07/02] Upload model checkpoint & evaluation code
[06/17] Create repository, update README

Evaluation

Prepare evaluation dataset

using the script from LLaVA-Hound-DPO

TEST_VIDEO_DIR=YOUR_PATH bash setup/setup_test_data.sh

or, download manually from this link

Evaluating the model

# out-domain video question answering
bash Evaluation/pipeline/outdomain_test_pipeline.sh \
    results \
    SNUMPR/isrt_video_llava_7b_9th

Building Preference Data w/ Model

Coming soon

Training

Coming soon

License

GNU GENERAL PUBLIC LICENSE

Acknowledgement

LLaVA-Hound-DPO: Our code is built upon the codebase from LLaVA-Hound-DPO

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Evaluation		Evaluation
assets/images		assets/images
config		config
data_processing		data_processing
inference		inference
llava		llava
setup		setup
trl		trl
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚄 SRT

Release

Evaluation

Prepare evaluation dataset

Evaluating the model

Building Preference Data w/ Model

Training

License

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

snumprlab/srt

Folders and files

Latest commit

History

Repository files navigation

🚄 SRT

Release

Evaluation

Prepare evaluation dataset

Evaluating the model

Building Preference Data w/ Model

Training

License

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages