TextShield

[AAAI2026] TextShield-R1: Reinforced Reasoning for Tampered Text Detection paper

Enviroment

Python3.10

pip install transformers==4.56.0  peft==0.18.1 ms-swift==3.5.0 qwen-vl-utils[decord]==0.0.8

cp orm.py [your original ms-swift orm file, e.g. /usr/local/lib/python3.10/dist-packages/swift/plugin/orm.py]

Model Inference

All input images should be resized to ensure that their heights and widths are times of 28.

python resize_image_dir.py --input [your input image dir] --output [your output image dir]

Download the fine-tuned model at Baidu Cloud
Run model start code

CUDA_VISIBLE_DEVICES=0 swift infer --model [your downloaded textshield model dir] --stream true --max_new_tokens 4096 --model_type qwen2_5_vl

Input the prompt and image path.

Here are some processed images for testing the model pipeline.

<image> Is this image real, entirely generated, or tampered? If it has been tampered, what method was used, and what are the content and bounding box coordinates of the tampered text? Output the thinking process in <think> </think> and \n final answer (number) in <answer> </answer> tags. \n Here is an example answer for a real image: <answer> This image is real. </answer> Here is an example answer for an entirely generated image: <answer> This image is entirely generated. </answer> Here is an example answer for a locally tampered image: <answer> This image is tampered. It was tampered by copy-paste. The tampered text reads "small" in the text line "a small yellow flower", and it is located at ... </answer>

OCR rectification

OCR rectification is integrated directly in the IoU evaluation scriptx.

unzip ocr_info.zip
python eval_iou_with_ocr_rectification.py --input [your inference output json file]

Model inference with json-dataset-in and json-dataset-out

convert the image dataset dir to json file

python convert_image_dir_to_json.py --input [your image dataset dir]

Inference with the generated datase json file

CUDA_VISIBLE_DEVICES=0 swift infer --model [your downloaded textshield model dir] --val_dataset [your dataset json file] --max_new_tokens 4096 --model_type qwen2_5_vl

Model Evaluation

Download the word-to-vector weights

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip
unzip wiki-news-300d-1M.vec.zip

Run the evaluation script (input should be dataset json file generated by the 6th step of model inference)

Evaluate image-level classifcation

python eval_classification.py --input [your input model inference output json file]

Evaluate tampered text ocr accuracy

python eval_ocr.py --input [your input model inference output json file]

Evaluate tampered text localization IoU

python eval_iou.py --input [your input model inference output json file]

Evaluate tampered text localization IoU with OCR rectification

python eval_iou_with_ocr_rectification.py --input [your input model inference output json file]

Evaluate tampered text forgery reasoning score

python eval_reasoning.py --input [your input model inference output json file]

Model Training

Pre-training

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft --model Qwen2.5-VL-7B-Instruct --dataset [your pre-training dataset json file] --split_dataset_ratio 0.0001 --train_type lora --torch_dtype bfloat16 --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 1 --learning_rate 2e-4 --lora_rank 32 --lora_alpha 64 --target_modules all-linear --freeze_vit False --freeze_aligner False --gradient_accumulation_steps 1 --eval_steps 5000 --save_steps 5000 --save_total_limit 3 --logging_steps 50 --max_length 4096 --output_dir [your output dir] --warmup_ratio 0.03 --dataloader_num_workers 16 --weight_decay 0 --deepspeed zero2

CUDA_VISIBLE_DEVICES=0 swift export --adapters [your pre-training stage output dir] --merge_lora true

Cold start

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft --model [your pre-trained model] --dataset "aaai_train.jsonl#12983" --split_dataset_ratio 0.0001 --train_type lora --torch_dtype bfloat16 --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 1 --learning_rate 2e-4 --lora_rank 32 --lora_alpha 64 --target_modules all-linear --freeze_vit False --freeze_aligner False --gradient_accumulation_steps 1 --eval_steps 5000 --save_steps 5000 --save_total_limit 3 --logging_steps 50 --max_length 4096 --output_dir [your output dir] --warmup_ratio 0.03 --dataloader_num_workers 16 --weight_decay 0 --deepspeed zero2

CUDA_VISIBLE_DEVICES=0 swift export --adapters [your cold-start stage output dir] --merge_lora true

GRPO

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift rlhf --rlhf_type grpo \
    --model [your cold-start model] \
    --reward_funcs realfake method ocr iou format\
    --reward_weights 1.0 0.5 1.0 1.0 0.1 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.8 \
    --train_type lora \
    --torch_dtype bfloat16 \
    --dataset 'aaai_train.jsonl#51690' \
    --split_dataset_ratio 0.001 \
    --max_length 2048 \
    --max_completion_length 1024 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-4 \
    --target_modules all-linear \
    --freeze_vit False \
    --freeze_aligner False \
    --lora_rank 64 \
    --lora_alpha 128 \
    --gradient_accumulation_steps 1 \
    --eval_steps 500 \
    --save_steps 500 \
    --save_total_limit 100 \
    --logging_steps 1 \
    --output_dir [your GRPO output dir] \
    --warmup_ratio 0.01 \
    --dataloader_num_workers 4 \
    --num_generations 8 \
    --temperature 1.0 \
    --system 'You are a helpful assistant.' \
    --deepspeed zero2 \
    --log_completions true \
    --beta 0.001 \
    --num_iterations 1

CUDA_VISIBLE_DEVICES=0 swift export --adapters [your GRPO stage output dir] --merge_lora true

The final output dir is the TextShield-R1 model and can be used as the input model dir for the above inference stage.

Text Forensics Reasoning (TFR) Benchmark

The proposed TFR benchmark was uploaded to HuggingFace and BaiduCloud.

Researchers are welcome 😃 to apply for this dataset by sending an email to 202221012612@mail.scut.edu.cn (with institution email address) and introducing:

Who you are and your institution.
Who is your supervisor/mentor.

Disclaimer

The original data of the dataset is sourced from public channels such as the Internet, and its copyright shall remain with the original providers. The collated and annotated dataset presented in this case is for non-commercial use only and is currently licensed to universities and research institutions. To apply for the use of this dataset, please fill in the corresponding application form in accordance with the requirements specified on the dataset’s official website. The applicant must be a full-time employee of a university or research institute and is required to sign the application form. For the convenience of review, it is recommended to affix an official seal (a seal of a secondary-level department is acceptable).

Contact

Any question about this work please contact 202221012612@mail.scut.edu.cn.

License

The project is under CC-BY-NC-4.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextShield

Enviroment

Model Inference

Model Evaluation

Model Training

Text Forensics Reasoning (TFR) Benchmark

Disclaimer

Contact

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
README.md		README.md
TextShield_R1_paper.pdf		TextShield_R1_paper.pdf
convert_image_dir_to_json.py		convert_image_dir_to_json.py
eval_classification.py		eval_classification.py
eval_iou.py		eval_iou.py
eval_iou_with_ocr_rectification.py		eval_iou_with_ocr_rectification.py
eval_ocr.py		eval_ocr.py
eval_reasoning.py		eval_reasoning.py
ocr_info.zip		ocr_info.zip
orm.py		orm.py
resize_image_dir.py		resize_image_dir.py
stopwords.txt		stopwords.txt
test_imgs.tar		test_imgs.tar

qcf-568/TextShield

Folders and files

Latest commit

History

Repository files navigation

TextShield

Enviroment

Model Inference

Model Evaluation

Model Training

Text Forensics Reasoning (TFR) Benchmark

Disclaimer

Contact

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages