[AAAI2026] TextShield-R1: Reinforced Reasoning for Tampered Text Detection paper
Python3.10
pip install transformers==4.56.0 peft==0.18.1 ms-swift==3.5.0 qwen-vl-utils[decord]==0.0.8
cp orm.py [your original ms-swift orm file, e.g. /usr/local/lib/python3.10/dist-packages/swift/plugin/orm.py]
- All input images should be resized to ensure that their heights and widths are times of 28.
python resize_image_dir.py --input [your input image dir] --output [your output image dir]
-
Download the fine-tuned model at Baidu Cloud
-
Run model start code
CUDA_VISIBLE_DEVICES=0 swift infer --model [your downloaded textshield model dir] --stream true --max_new_tokens 4096 --model_type qwen2_5_vl
-
Input the prompt and image path.
Here are some processed images for testing the model pipeline.
<image> Is this image real, entirely generated, or tampered? If it has been tampered, what method was used, and what are the content and bounding box coordinates of the tampered text? Output the thinking process in <think> </think> and \n final answer (number) in <answer> </answer> tags. \n Here is an example answer for a real image: <answer> This image is real. </answer> Here is an example answer for an entirely generated image: <answer> This image is entirely generated. </answer> Here is an example answer for a locally tampered image: <answer> This image is tampered. It was tampered by copy-paste. The tampered text reads "small" in the text line "a small yellow flower", and it is located at ... </answer>
- OCR rectification
OCR rectification is integrated directly in the IoU evaluation scriptx.
unzip ocr_info.zip
python eval_iou_with_ocr_rectification.py --input [your inference output json file]
- Model inference with json-dataset-in and json-dataset-out
convert the image dataset dir to json file
python convert_image_dir_to_json.py --input [your image dataset dir]
Inference with the generated datase json file
CUDA_VISIBLE_DEVICES=0 swift infer --model [your downloaded textshield model dir] --val_dataset [your dataset json file] --max_new_tokens 4096 --model_type qwen2_5_vl
- Download the word-to-vector weights
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip
unzip wiki-news-300d-1M.vec.zip
- Run the evaluation script (input should be dataset json file generated by the 6th step of model inference)
Evaluate image-level classifcation
python eval_classification.py --input [your input model inference output json file]
Evaluate tampered text ocr accuracy
python eval_ocr.py --input [your input model inference output json file]
Evaluate tampered text localization IoU
python eval_iou.py --input [your input model inference output json file]
Evaluate tampered text localization IoU with OCR rectification
python eval_iou_with_ocr_rectification.py --input [your input model inference output json file]
Evaluate tampered text forgery reasoning score
python eval_reasoning.py --input [your input model inference output json file]
- Pre-training
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft --model Qwen2.5-VL-7B-Instruct --dataset [your pre-training dataset json file] --split_dataset_ratio 0.0001 --train_type lora --torch_dtype bfloat16 --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 1 --learning_rate 2e-4 --lora_rank 32 --lora_alpha 64 --target_modules all-linear --freeze_vit False --freeze_aligner False --gradient_accumulation_steps 1 --eval_steps 5000 --save_steps 5000 --save_total_limit 3 --logging_steps 50 --max_length 4096 --output_dir [your output dir] --warmup_ratio 0.03 --dataloader_num_workers 16 --weight_decay 0 --deepspeed zero2
CUDA_VISIBLE_DEVICES=0 swift export --adapters [your pre-training stage output dir] --merge_lora true
- Cold start
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft --model [your pre-trained model] --dataset "aaai_train.jsonl#12983" --split_dataset_ratio 0.0001 --train_type lora --torch_dtype bfloat16 --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 1 --learning_rate 2e-4 --lora_rank 32 --lora_alpha 64 --target_modules all-linear --freeze_vit False --freeze_aligner False --gradient_accumulation_steps 1 --eval_steps 5000 --save_steps 5000 --save_total_limit 3 --logging_steps 50 --max_length 4096 --output_dir [your output dir] --warmup_ratio 0.03 --dataloader_num_workers 16 --weight_decay 0 --deepspeed zero2
CUDA_VISIBLE_DEVICES=0 swift export --adapters [your cold-start stage output dir] --merge_lora true
- GRPO
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift rlhf --rlhf_type grpo \
--model [your cold-start model] \
--reward_funcs realfake method ocr iou format\
--reward_weights 1.0 0.5 1.0 1.0 0.1 \
--use_vllm true \
--vllm_gpu_memory_utilization 0.8 \
--train_type lora \
--torch_dtype bfloat16 \
--dataset 'aaai_train.jsonl#51690' \
--split_dataset_ratio 0.001 \
--max_length 2048 \
--max_completion_length 1024 \
--num_train_epochs 1 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 2 \
--learning_rate 1e-4 \
--target_modules all-linear \
--freeze_vit False \
--freeze_aligner False \
--lora_rank 64 \
--lora_alpha 128 \
--gradient_accumulation_steps 1 \
--eval_steps 500 \
--save_steps 500 \
--save_total_limit 100 \
--logging_steps 1 \
--output_dir [your GRPO output dir] \
--warmup_ratio 0.01 \
--dataloader_num_workers 4 \
--num_generations 8 \
--temperature 1.0 \
--system 'You are a helpful assistant.' \
--deepspeed zero2 \
--log_completions true \
--beta 0.001 \
--num_iterations 1
CUDA_VISIBLE_DEVICES=0 swift export --adapters [your GRPO stage output dir] --merge_lora true
The final output dir is the TextShield-R1 model and can be used as the input model dir for the above inference stage.
The proposed TFR benchmark was uploaded to HuggingFace and BaiduCloud.
Researchers are welcome 😃 to apply for this dataset by sending an email to 202221012612@mail.scut.edu.cn (with institution email address) and introducing:
- Who you are and your institution.
- Who is your supervisor/mentor.
The original data of the dataset is sourced from public channels such as the Internet, and its copyright shall remain with the original providers. The collated and annotated dataset presented in this case is for non-commercial use only and is currently licensed to universities and research institutions. To apply for the use of this dataset, please fill in the corresponding application form in accordance with the requirements specified on the dataset’s official website. The applicant must be a full-time employee of a university or research institute and is required to sign the application form. For the convenience of review, it is recommended to affix an official seal (a seal of a secondary-level department is acceptable).
Any question about this work please contact 202221012612@mail.scut.edu.cn.
The project is under CC-BY-NC-4.0 license.