GitHub - sun-hailong/TVC: [ACL 2025] The code repository for "Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning" in PyTorch.

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

Hai-Long Sun^1,2,3 Zhun Sun³ Houwen Peng³ Han-Jia Ye^1,2,✉

¹School of Artificial Intelligence, Nanjing University ²National Key Laboratory for Novel Software Technology, Nanjing University ³Tencent

^✉ Corresponding Author

📢 News

🎉[15/5/2025] TVC is accepted by ACL 2025 main track.
🎉[18/3/2025] The TVC is released! Check our project page, model weights, arXiv paper for the strong multi-modal reasoning model!
🔥[06/3/2025] We release the training data and model, welcome to have a try!
🔥[22/2/2025] TVC-72B achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.

🚀Coming Soon

Evaluation code
Training code
Model weights
Training Data

🌟 Introduction

TVC (Take-along Visual Conditioning) is a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning.

Architecture

The TVC method consists of two key stages: training and testing. In the training stage, we introduce Dynamic Visual Reaffirmation (DVR), which guides the model through iterative reinforcement of visual evidence during long reasoning chains. In the testing phase, we present Periodic Visual Calibration (PVC), where visual reactivation is periodically triggered at self-reflection intervals.

Data Generation Pipeline

We use iterative distillation to collect long-chain reasoning data, followed by a comprehensive response filtering process to ensure high-quality reasoning.

Performance

We conduct evaluation experiments across 6 benchmarks, covering both general reasoning and task-specific reasoning assessments. TVC exhibits notable effectiveness and generalizability when applied to Qwen2-VL, surpassing other state-of-the-art MLLMs by a large margin.

Installation

python -m venv llama-factory
source llama-factory/bin/activate
pip uninstall -y accelerate vllm matplotlib
cd LLaMA-Factory
pip install -r requirement.txt

You can also follow https://github.com/hiyouga/LLaMA-Factory to prepare the environment.

Quick Start

from vllm import LLM, SamplingParams
from PIL import Image

model_name = "Allen8/TVC-72B"
llm = LLM(
        model=model_name,
        trust_remote_code=True,
        tensor_parallel_size=8,
    )

question = "Hint: Please answer the question requiring an integer answer and provide the final value, e.g., 1, 2, 3, at the end.\nQuestion: Subtract all red things. Subtract all tiny matte balls. How many objects are left?\nPlease answer the question using a long-chain reasoning style and think step by step."
placeholder = "<|image_pad|>"
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n<|vision_start|>{placeholder}<|vision_end|>"
f"{question}<|im_end|>\n"
"<|im_start|>assistant\n")

sampling_params = SamplingParams(
    temperature=0.0,
    top_k=1,
    top_p=1.0,
    stop_token_ids=[],
    repetition_penalty=1.05,
    max_tokens=8192
)

image = Image.open("images/case1.png")
inputs = {
            "prompt": prompt,
            "multi_modal_data": {
                "image": image
            },
        }

outputs = llm.generate([inputs], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Evaluation

Coming Soon, Stay tuned!

Training

We use LLaMA-Factory to fine-tune Qwen2-VL-72B-Instruct.

cd LLaMA-Factory
bash tvc-sft/scripts/train_qwen2vl_72b.sh

Case Study

Citation

If you find it useful for your research and applications, please cite our paper using this BibTeX:

@inproceedings{sun2025mitigating,
  title={Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning},
  author={Sun, Hai-Long and Sun, Zhun and Peng, Houwen and Ye, Han-Jia},
  booktitle={ACL},
  year={2025}
}

Acknowledgement

Our codebase is conducted on LLaMA-Factory
Thanks VLMEvalKit for the evaluation system!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LLaMA-Factory		LLaMA-Factory
images		images
inference		inference
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

📢 News

🚀Coming Soon

🌟 Introduction

Architecture

Data Generation Pipeline

Performance

Installation

Quick Start

Evaluation

Training

Case Study

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sun-hailong/TVC

Folders and files

Latest commit

History

Repository files navigation

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

📢 News

🚀Coming Soon

🌟 Introduction

Architecture

Data Generation Pipeline

Performance

Installation

Quick Start

Evaluation

Training

Case Study

Citation

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages