Skip to content

zky001/BadChain

 
 

Repository files navigation

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang1 ,  Fengqing Jiang2 ,  Zidi Xiong1
Bhaskar Ramasubramanian3 ,  Radha Poovendran2 ,  Bo Li1  
1University of Illinois Urbana-Champaign   2University of Washington   3Western Washington University   

ICLR 2024

[arXiv]       [OpenReview]

Overview

We propose BadChain,the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. In particular, a subset of demonstrations will be manipulated to incorporate a backdoor reasoning step in COT prompting. Consequently, given any query prompt containing the backdoor trigger, the LLM will be misled to output unintended content. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning.

Experiment

Setup

  • Make sure to set your api key before running experiment. See the placeholder at utils.py

Running

A example running command is as follow:

python run.py --llm gpt-3.5 --task gsm8k

More details can be found at run.py

Citation

If you find our work is useful in your research, please consider citing:

@misc{xiang2024badchain,
    title={BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models}, 
    author={Zhen Xiang and Fengqing Jiang and Zidi Xiong and Bhaskar Ramasubramanian and Radha Poovendran and Bo Li},
    year={2024},
    eprint={2401.12242},
    archivePrefix={arXiv},
    primaryClass={cs.CR}
}



About

Official Repo of ICLR 24 BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%