chatgpt-comparison-detection-HC3-Plus

In order to fill the gap of HC3 under semanticinvariant tasks, we extend HC3 and propose a larger ChatGPT-generated text dataset covering translation, summarization, and paraphrasing tasks, called HC3 Plus. Details can be found in HC3 Plus: A Semantic-Invariant Human ChatGPT Comparison Corpus

Dataset

To build the HC3 semantic-invariance Datset, We first select several widely used high-quality corpora that were annotated by humans, encompassing translation, summarization, and paraphrasing tasks. The main datasets included are: CNN/DailyMail, Xsum, LCSTS, news2016, WMT, HC3 Question Paraphrase. Then, we merge the HC3 dataset to create the complete HC3 Plus dataset. The merged data is located in the data directory.

data/
    en/ # English Dataset
        train.jsonl # The training set includes both HC3-SI and HC3 datasets.
        val_hc3_si.sjonl # The validation set of HC3-SI dataset.
        val_hc3_QA.jsonl # The validation set of HC3 dataset.
        test_hc3_si.sjonl # The test set of HC3-SI dataset.
        test_hc3_QA.jsonl # The test set of HC3 dataset.
    zh/ # Chinese Dataset
        train.jsonl # The training set includes both HC3-SI and HC3 datasets.
        val_hc3_si.sjonl # The validation set of HC3-SI dataset.
        val_hc3_QA.jsonl # The validation set of HC3 dataset.
        test_hc3_si.sjonl # The test set of HC3-SI dataset.
        test_hc3_QA.jsonl # The test set of HC3 dataset.

Training

We train detectors for both English and Chinese based on Tk-instruct and Roberta, respectively.

English

For the English detector, we use the following command for training:

bash train_english_roberta.sh # Train the Roberta model.
bash train_english_s2s.sh # Train the Tk-instruct.

Chines

For the Chinese detector, we use the following command for training:

bash train_chinese_roberta.sh # Train the Roberta model.
bash train_chinese_s2s.sh # Train the Tk-instruct.

Evaluation

You can use the following command to obtain the scores on the test set based on the Roberta:

python test_roberta.py --data_path data/en/test_hc3_QA.jsonl -- model_path ${model_path}
python test_roberta.py --data_path data/en/test_hc3_si.jsonl -- model_path ${model_path}

You can use the following command to obtain the scores on the test set based on the Tk-instruct:

python test_s2s.py --data_path data/en/test_hc3_QA.jsonl -- model_path ${model_path} --lang en
python test_s2s.py --data_path data/en/test_hc3_si.jsonl -- model_path ${model_path} --lang en

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
README.md		README.md
instruction.py		instruction.py
run_roberta.py		run_roberta.py
run_s2s.py		run_s2s.py
test_roberta.py		test_roberta.py
test_s2s.py		test_s2s.py
train_chinese_roberta.sh		train_chinese_roberta.sh
train_chinese_s2s.sh		train_chinese_s2s.sh
train_english_roberta.sh		train_english_roberta.sh
train_english_s2s.sh		train_english_s2s.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

README.md

instruction.py

instruction.py

run_roberta.py

run_roberta.py

run_s2s.py

run_s2s.py

test_roberta.py

test_roberta.py

test_s2s.py

test_s2s.py

train_chinese_roberta.sh

train_chinese_roberta.sh

train_chinese_s2s.sh

train_chinese_s2s.sh

train_english_roberta.sh

train_english_roberta.sh

train_english_s2s.sh

train_english_s2s.sh

utils.py

utils.py

Repository files navigation

chatgpt-comparison-detection-HC3-Plus

Dataset

Training

English

Chines

Evaluation

About

Releases

Packages

Languages

suu990901/chatgpt-comparison-detection-HC3-Plus

Folders and files

Latest commit

History

Repository files navigation

chatgpt-comparison-detection-HC3-Plus

Dataset

Training

English

Chines

Evaluation

About

Resources

Stars

Watchers

Forks

Languages