Red Teaming Language Model Detectors with Language Models

In this work, we investigate the robustness and reliability of LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt.

More details can be found in our paper:

Zhouxing Shi*, Yihan Wang*, Fan Yin*, Xiangning Chen, Kai-Wei Chang, Cho-Jui Hsieh. Red Teaming Language Model Detectors with Language Models. To appear in TACL. (*Alphabetical order.)

Setup

Install Python depedencies:

pip install -r requirements.txt

If you want to use the LLaMA model in experiments, you need to download the models by yourself and convert them into the huggingface format (See instructions here).

Attack with Word Substitutions

Attack against Watermark Detectors

Enter the watermarking directory with cd lm_watermarking. The code is developed based on the codebase of the original watermarking paper.

python demo_watermark.py --attack_method llama_replacement --num_examples 100 --dataset eli5 --gamma 0.5 --test_ratio 0.15 --max_new_tokens 100 --delta 1.5 --replacement_checkpoint_path /home/data/llama/hf_models/65B/ --replacement_tokenizer_path /home/data/llama/hf_models/65B/ --num_replacement_retry 1 --valid_factor 1.5 --model_name_or_path gpt2-xl

attack_method: llama_replacement use a llama model with watermarking hyper-parameters gamma and delta to generate word replacement candidates; GPT_replacement queries the ChatGPT api to generate word replacement candidates.
num_examples: number of examples in evaluation
dataset: dataset used in evaluation, choose from ['eli5', 'xsum']
gamma, delta: watermarking hyperparameters controlling the watermarking strength
test_ratio: approximate final ratio of the replaced tokens in word replacement attack
max_new_tokens: max number of tokens in generation
replacement_checkpoint_path, replacement_tokenizer_path: path of the model checkpoint used to generate word replacement candidates
num_replacement_retry: Some word replacements generated by the replacement_model can be invalid and filtered out. Therefore, we can set a num_replacement_retry to retry the generation if there is randomness in the generation process. In all of our experiments in the paper, we use num_replacement_retry=1 as we use greedy decoding by default with no randomness.
valid_factor: We pick test_ratio * valid_factor tokens to generate their word_replacement as only approximately (1/valid_factor) word replacements generated by the replacement_model are valid. We use valid_factor=1.5 for our LLaMA-65B model
model_name_or_path: path (if local) or name (if on the huggingface hub) of the generative model that is used to generate the outputs with watermarks given the datasets

Attack against DetectGPT

Enter the DetectGPT directory with cd DetectGPT.

Code structure and options

Our attackers are in the file `attackers.py', where we implement the baseline: dipper paraphraser, and the query-free (random), query-based (genetic) attackers in this paper.

To run the attack, turn on the --attack argument, and setup the attacker with --paraphrase for the baseline, --attack_method genetic or --attack_method random for the attackers in this paper.

The red teaming model can be either ChatGPT or LLaMA by indicating --attack_model chatgpt or --attack_model llama argument.

To default model to generate sampled texts is GPT-2. Switch to ChatGPT by using --chatgpt.

Run the code

See cross.sh. Results will be printed written to results_gpt2 by default.

Attack with Instructional Prompts

The attack with instructional prompts was tested with ChatGPT (gpt-3.5-turbo) as the generative model and OpenAI AI Text Classifier as the detector. However, the OpenAI AI Text Classifier is currently unaccessible as of July 20, 2023.

Search for an instructional prompt

Run:

python prompt_attack.py --output_dir OUTPUT_DIR_XSUM --data xsum
python prompt_attack.py --output_dir OUTPUT_DIR_ELI5 --data eli5

To learn all the available arguments, run python prompt_attack.py --help or check prompt_attack.py.

Inference and evaluation

Run:

python prompt_attack.py --infer --data xsum \
--load OUTPUT_DIR_XSUM --output_dir OUTPUT_DIR_INFER_XSUM

python prompt_attack.py --infer --data eli5 \
--load OUTPUT_DIR_ELI5 --output_dir OUTPUT_DIR_INFER_ELI5

References

Disclaimer

Our open-source code is only for academic research. It should not be utilized for malicious purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
DetectGPT		DetectGPT
dataset		dataset
lm_watermarking		lm_watermarking
LICENSE		LICENSE
README.md		README.md
openai_adapters.py		openai_adapters.py
prompt_attack.py		prompt_attack.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DetectGPT

DetectGPT

dataset

dataset

lm_watermarking

lm_watermarking

LICENSE

LICENSE

README.md

README.md

openai_adapters.py

openai_adapters.py

prompt_attack.py

prompt_attack.py

requirements.txt

requirements.txt

utils.py

utils.py

Repository files navigation

Red Teaming Language Model Detectors with Language Models

Setup

Attack with Word Substitutions

Attack against Watermark Detectors

Attack against DetectGPT

Code structure and options

Run the code

Attack with Instructional Prompts

Search for an instructional prompt

Inference and evaluation

References

Disclaimer

About

Releases

Packages

Contributors 3

Languages

License

shizhouxing/LLM-Detector-Robustness

Folders and files

Latest commit

History

Repository files navigation

Red Teaming Language Model Detectors with Language Models

Setup

Attack with Word Substitutions

Attack against Watermark Detectors

Attack against DetectGPT

Code structure and options

Run the code

Attack with Instructional Prompts

Search for an instructional prompt

Inference and evaluation

References

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Languages