Skip to content

πŸŽ‰ The code repository for "Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning" in PyTorch.

Notifications You must be signed in to change notification settings

sun-hailong/TVC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

Hai-Long Sun1,2,3  Zhun Sun3  Houwen Peng3  Han-Jia Ye1,2,βœ‰β€ƒ

1School of Artificial Intelligence, Nanjing University   2National Key Laboratory for Novel Software Technology, Nanjing University  3Tencent 

  βœ‰ Corresponding Author


πŸ“’ News

  • πŸŽ‰[18/3/2025] The TVC is released! Check our project page, model weights, arXiv paper for the strong multi-modal reasoning model!

  • πŸ”₯[06/3/2025] We release the training data and model, welcome to have a try!

  • πŸ”₯[22/2/2025] TVC-72B achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.

πŸš€Coming Soon

  • Evaluation code
  • Training code
  • Model weights
  • Training Data

🌟 Introduction

TVC (Take-along Visual Conditioning) is a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning.

compare.png

Architecture

teaser.png

The TVC method consists of two key stages: training and testing. In the training stage, we introduce Dynamic Visual Reaffirmation (DVR), which guides the model through iterative reinforcement of visual evidence during long reasoning chains. In the testing phase, we present Periodic Visual Calibration (PVC), where visual reactivation is periodically triggered at self-reflection intervals.

Data Generation Pipeline

data-pipeline.png

We use iterative distillation to collect long-chain reasoning data, followed by a comprehensive response filtering process to ensure high-quality reasoning.

Performance

main-result.png

We conduct evaluation experiments across 6 benchmarks, covering both general reasoning and task-specific reasoning assessments. TVC exhibits notable effectiveness and generalizability when applied to Qwen2-VL, surpassing other state-of-the-art MLLMs by a large margin.

Installation

python -m venv llama-factory
source llama-factory/bin/activate
pip uninstall -y accelerate vllm matplotlib
cd LLaMA-Factory
pip install -r requirement.txt

You can also follow https://github.com/hiyouga/LLaMA-Factory to prepare the environment.

Quick Start

from vllm import LLM, SamplingParams
from PIL import Image

model_name = "Allen8/TVC-72B"
llm = LLM(
        model=model_name,
        trust_remote_code=True,
        tensor_parallel_size=8,
    )

question = "Hint: Please answer the question requiring an integer answer and provide the final value, e.g., 1, 2, 3, at the end.\nQuestion: Subtract all red things. Subtract all tiny matte balls. How many objects are left?\nPlease answer the question using a long-chain reasoning style and think step by step."
placeholder = "<|image_pad|>"
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n<|vision_start|>{placeholder}<|vision_end|>"
f"{question}<|im_end|>\n"
"<|im_start|>assistant\n")

sampling_params = SamplingParams(
    temperature=0.0,
    top_k=1,
    top_p=1.0,
    stop_token_ids=[],
    repetition_penalty=1.05,
    max_tokens=8192
)

image = Image.open("images/case1.png")
inputs = {
            "prompt": prompt,
            "multi_modal_data": {
                "image": image
            },
        }

outputs = llm.generate([inputs], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Evaluation

Coming Soon, Stay tuned!

Training

We use LLaMA-Factory to fine-tune Qwen2-VL-72B-Instruct.

cd LLaMA-Factory
bash tvc-sft/scripts/train_qwen2vl_72b.sh

Case Study

case-study.png

Citation

If you find it useful for your research and applications, please cite our paper using this BibTeX:

@article{sun2024mitigating,
    title={Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning},
    author={Sun, Hai-Long and Sun, Zhun and Peng, Houwen and Ye, Han-Jia},
    journal={arXiv preprint arXiv:2503.13360},
    year={2025}
}

Acknowledgement

About

πŸŽ‰ The code repository for "Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning" in PyTorch.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages