Multiscale Positive-Unlabeled Detection of AI-Generated Texts

Yuchuan Tian, Hanting Chen, Xutao Wang, Zheyuan Bai, Qinghua Zhang, Ruifeng Li, Chao Xu, Yunhe Wang

The official codes of our paper "Multiscale Positive-Unlabeled Detection of AI-Generated Texts".

Paper Link: https://arxiv.org/pdf/2305.18149.pdf

BibTex formatted citation:

@misc{tian2023multiscale,
      title={Multiscale Positive-Unlabeled Detection of AI-Generated Texts}, 
      author={Yuchuan Tian and Hanting Chen and Xutao Wang and Zheyuan Bai and Qinghua Zhang and Ruifeng Li and Chao Xu and Yunhe Wang},
      year={2023},
      eprint={2305.18149},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Detector Models

We have open-sourced detector models in the paper as follows.

Links for Detectors: Google Drive Baidu Disk (PIN:1234)

We have also uploaded detector models to HuggingFace, where easy-to-use DEMOs and online APIs are provided.

Variants	HC3-Full-En	HC3-Sent-En
seed0	98.68	82.84
seed1 HuggingFace: en v1	98.56	87.06
seed2	97.97	86.02
Avg.	98.40$\pm$0.31	85.31$\pm$1.80

Stronger Detectors

We have also open-sourced detector models with strengthened training strategies. Specifically, we develop a strong Chinese detector AIGC_detector_zhv2, which demonstrates similar performance to SOTA closed-source Chinese detectors on various texts, including news articles, poetry, essays, etc. The DEMOs and APIs are available on HuggingFace.

Detector	Google Drive	Baidu Disk	HuggingFace Link
English, version 2 (env2)	Google Drive	Baidu Disk (PIN:1234)	en v2
Chinese, version 2 (zhv2)	Google Drive	Baidu Disk (PIN:1234)	zh v2

About the Dataset

Here we provide the official link for the HC3 dataset: Dataset Link. We also provide identical dataset copies on Google Drive and Baidu Disk (PIN:1234) for your ease of use. We acknowledge the marvelous work by HC3 authors.

Data Preprocessing

In Appendix B of our paper, we proposed the removal of redundant spaces in human texts of the HC3-English dataset. We have provided a helper function en_cleaning in corpus_cleaning_kit.py that takes a sentence string as input and returns a preprocessed sentence without redundant spaces.

Here we provide a cleaned version of HC3-English. In this version, all answers are cleaned (i. e. redundant spaces are removed). However, please use the original version of HC3 for all experiments in our paper, as we have embedded the cleaning procedures in the training & validation scripts.

CLEANED HC3-English Link: Google Drive Baidu Disk (PIN:1234)

Preparation

Install requirement packages:

pip install -r requirements.txt

Download datasets to directory: ./data
Download nltk package punct (This step could be done by nltk api: nltk.download('punkt'))
Download pretrained models (This step could be automatically done by transformers)

Before running, the directory should contain the following files:

├── data
│   ├── unfilter_full
│   │   ├── en_test.csv
│   │   └── en_train.csv
│   └── unfilter_sent
│       ├── en_test.csv
│       └── en_train.csv
├── README.md
├── corpus_cleaning_kit.py
├── dataset.py
├── multiscale_kit.py
├── option.py
├── pu_loss_mod.py
├── prior_kit.py
├── requirements.txt
├── train.py
└── utils.py

Training

The script for training is train.py.

RoBERTa on HC3-English

Commands for seed=0,1,2:

CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 0

CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 1

CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 2

BERT on HC3-English

Commands for seed=0,1,2:

CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 0


CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 1


CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 2

Acknowledgement

Our code refers to the following GitHub repo:

https://github.com/openai/gpt-2-output-dataset

We sincerely thank their authors for open-sourcing.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
imgs		imgs
LICENSE		LICENSE
README.md		README.md
corpus_cleaning_kit.py		corpus_cleaning_kit.py
dataset.py		dataset.py
multiscale_kit.py		multiscale_kit.py
option.py		option.py
prior_kit.py		prior_kit.py
pu_loss_mod.py		pu_loss_mod.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

License

YuchuanTian/AIGC_text_detector

Folders and files

Latest commit

History

Repository files navigation

Multiscale Positive-Unlabeled Detection of AI-Generated Texts

Detector Models

Stronger Detectors

About the Dataset

Data Preprocessing

Preparation

Training

RoBERTa on HC3-English

BERT on HC3-English

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Languages