Hai-Long Sun1,2,3β Zhun Sun3β Houwen Peng3β Han-Jia Ye1,2,ββ
1School of Artificial Intelligence, Nanjing University β 2National Key Laboratory for Novel Software Technology, Nanjing Universityβ 3Tencentβ
β β Corresponding Author
-
π[18/3/2025] The TVC is released! Check our project page, model weights, arXiv paper for the strong multi-modal reasoning model!
-
π₯[06/3/2025] We release the training data and model, welcome to have a try!
-
π₯[22/2/2025] TVC-72B achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
- Evaluation code
- Training code
- Model weights
- Training Data
TVC (Take-along Visual Conditioning) is a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning.
The TVC method consists of two key stages: training and testing. In the training stage, we introduce Dynamic Visual Reaffirmation (DVR), which guides the model through iterative reinforcement of visual evidence during long reasoning chains. In the testing phase, we present Periodic Visual Calibration (PVC), where visual reactivation is periodically triggered at self-reflection intervals.
We use iterative distillation to collect long-chain reasoning data, followed by a comprehensive response filtering process to ensure high-quality reasoning.
We conduct evaluation experiments across 6 benchmarks, covering both general reasoning and task-specific reasoning assessments. TVC exhibits notable effectiveness and generalizability when applied to Qwen2-VL, surpassing other state-of-the-art MLLMs by a large margin.
python -m venv llama-factory
source llama-factory/bin/activate
pip uninstall -y accelerate vllm matplotlib
cd LLaMA-Factory
pip install -r requirement.txt
You can also follow https://github.com/hiyouga/LLaMA-Factory to prepare the environment.
from vllm import LLM, SamplingParams
from PIL import Image
model_name = "Allen8/TVC-72B"
llm = LLM(
model=model_name,
trust_remote_code=True,
tensor_parallel_size=8,
)
question = "Hint: Please answer the question requiring an integer answer and provide the final value, e.g., 1, 2, 3, at the end.\nQuestion: Subtract all red things. Subtract all tiny matte balls. How many objects are left?\nPlease answer the question using a long-chain reasoning style and think step by step."
placeholder = "<|image_pad|>"
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n<|vision_start|>{placeholder}<|vision_end|>"
f"{question}<|im_end|>\n"
"<|im_start|>assistant\n")
sampling_params = SamplingParams(
temperature=0.0,
top_k=1,
top_p=1.0,
stop_token_ids=[],
repetition_penalty=1.05,
max_tokens=8192
)
image = Image.open("images/case1.png")
inputs = {
"prompt": prompt,
"multi_modal_data": {
"image": image
},
}
outputs = llm.generate([inputs], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Coming Soon, Stay tuned!
We use LLaMA-Factory to fine-tune Qwen2-VL-72B-Instruct.
cd LLaMA-Factory
bash tvc-sft/scripts/train_qwen2vl_72b.sh
If you find it useful for your research and applications, please cite our paper using this BibTeX:
@article{sun2024mitigating,
title={Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning},
author={Sun, Hai-Long and Sun, Zhun and Peng, Houwen and Ye, Han-Jia},
journal={arXiv preprint arXiv:2503.13360},
year={2025}
}
-
Our codebase is conducted on LLaMA-Factory
-
Thanks VLMEvalKit for the evaluation system!