GitHub - yueliu1999/FlipAttack: [arXiv 2024] An official source code for paper "FlipAttack: Jailbreak LLMs via Flipping".

This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when noise is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing left-side noise merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task, and then develop 4 variants to guide LLMs to denoise, understand, and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves ~98% attack success rate on GPT-4o, and ~98% bypass rate against 5 guardrail models on average.

Figure 1. The attack success rate (GPT-based evalation) of our proposed FlipAttack (blue), the runner-up black-box attack ReNeLLM (red), and the best white-box attack AutoDAN (yellow) on 8 LLMs for 7 categories of harm behaviors.

Update

(2024/11/04) We release the codes of performance evaluation on sub-categories of AdvBench.
(2024/11/01) We add an overview GIF to help readers better understand FlipAttack.
(2024/10/18) FlipGuardData is released on huggingface. It contains 45k attack samples on 8 LLMs.
(2024/10/15) The development version of codes is released.
(2024/10/12) FlipAttack has been merged to PyRIT, check it here.
(2024/10/11) FlipAttack is pulled a new request in PyRIT, check it here.
(2024/10/04) The code of FlipAttack is released.
(2024/10/02) FlipAttack is on arXiv.

Figure 2: Overview of the proposed FlipAttack.

Usage

Quick Start

To evaluate FlipAttack, you should run the following codes.

change to source code dictionary
```
cd ./src
```

calculate ASR-GPT of FlipAttack on AdvBench

python eval_gpt.py

       ASR-GPT of FlipAttack against 8 LLMs on AdvBench       
| ---------------------------- | ---------------------------- |
|          Victim LLM          |           ASR-GPT            |
| ---------------------------- | ---------------------------- |
|        GPT-3.5 Turbo         |            94.81%            |
|         GPT-4 Turbo          |            98.85%            |
|            GPT-4             |            89.42%            |
|            GPT-4o            |            98.08%            |
|         GPT-4o mini          |            61.35%            |
|      Claude 3.5 Sonnet       |            86.54%            |
|        LLaMA 3.1 405B        |            28.27%            |
|        Mixtral 8x22B         |            97.12%            |
| ---------------------------- | ---------------------------- |
|           Average            |            81.80%            |
| ---------------------------- | ---------------------------- |

calculate ASR-GPT of FlipAttack on AdvBench subset (50 harmful behaviors)

python eval_subset_gpt.py

    ASR-GPT of FlipAttack against 8 LLMs on AdvBench subset    
| ---------------------------- | ---------------------------- |
|          Victim LLM          |           ASR-GPT            |
| ---------------------------- | ---------------------------- |
|        GPT-3.5 Turbo         |            96.00%            |
|         GPT-4 Turbo          |           100.00%            |
|            GPT-4             |            88.00%            |
|            GPT-4o            |           100.00%            |
|         GPT-4o mini          |            58.00%            |
|      Claude 3.5 Sonnet       |            88.00%            |
|        LLaMA 3.1 405B        |            26.00%            |
|        Mixtral 8x22B         |           100.00%            |
| ---------------------------- | ---------------------------- |
|           Average            |            82.00%            |
| ---------------------------- | ---------------------------- |

calculate ASR-DICT of FlipAttack on AdvBench

python eval_dict.py

      ASR-DICT of FlipAttack against 8 LLMs on AdvBench       
| ---------------------------- | ---------------------------- |
|          Victim LLM          |           ASR-DICT           |
| ---------------------------- | ---------------------------- |
|        GPT-3.5 Turbo         |            85.58%            |
|         GPT-4 Turbo          |            83.46%            |
|            GPT-4             |            62.12%            |
|            GPT-4o            |            83.08%            |
|         GPT-4o mini          |            87.50%            |
|      Claude 3.5 Sonnet       |            90.19%            |
|        LLaMA 3.1 405B        |            85.19%            |
|        Mixtral 8x22B         |            58.27%            |
| ---------------------------- | ---------------------------- |
|           Average            |            79.42%            |
| ---------------------------- | ---------------------------- |

calculate ASR-DICT of FlipAttack on AdvBench subset (50 harmful behaviors)

python eval_subset_dict.py

   ASR-DICT of FlipAttack against 8 LLMs on AdvBench subset   
| ---------------------------- | ---------------------------- |
|          Victim LLM          |           ASR-DICT           |
| ---------------------------- | ---------------------------- |
|        GPT-3.5 Turbo         |            84.00%            |
|         GPT-4 Turbo          |            86.00%            |
|            GPT-4             |            72.00%            |
|            GPT-4o            |            78.00%            |
|         GPT-4o mini          |            90.00%            |
|      Claude 3.5 Sonnet       |            94.00%            |
|        LLaMA 3.1 405B        |            86.00%            |
|        Mixtral 8x22B         |            54.00%            |
| ---------------------------- | ---------------------------- |
|           Average            |            80.50%            |
| ---------------------------- | ---------------------------- |

Main Result

Table 1: The attack success rate (%) of 16 methods on 8 LLMs. The bold and underlined values are the best and runner-up results. The evaluation metric is ASR-GPT based on GPT-4.

Figure 3: Token cost & attack performance of 16 attack methods. A larger bubble indicates higher token costs.

Development Version

To reproduce and further develop FlipAttack, you should run the following codes.

Install the environment
```
pip install -r requirements.txt
```
change to source code dictionary
```
cd ./src
```

set the API keys, obtain the API keys from OpenAI, Anthropic, and DeepInfra

# for GPTs
export OPENAI_API_KEY="your_api_key"

# for Claude
export ANTHROPIC_API_KEY="your_api_key"

# LLaMA and Mistral
export DEEPINFRA_API_KEY="your_api_key"

read the configurations

--victim_llm  |  victim LLM
--flip_mode   |  flipping mode
--cot         |  chain-of-thought
--lang_gpt    |  LangGPT
--few_shot    |  task-oriented few-shot demo
--data_name   |  name of benchmark
--begin       |  begin of tested data
--end         |  end of tested data
--eval        |  conduct evaluation
--parallel    |  run in parallel (use in main_parallel.py)

run the commands

# for gpt-4-0613
python main.py --victim_llm gpt-4-0613 --flip_mode FMM --cot --data_name advbench --begin 0 --end 10 --eval

# for gpt-4-turbo-2024-04-09
python main.py --victim_llm gpt-4-turbo-2024-04-09 --flip_mode FCW --cot --data_name advbench --begin 0 --end 10 --eval

# for gpt-4o-2024-08-06
python main.py --victim_llm gpt-4o-2024-08-06 --flip_mode FCS --cot --lang_gpt --few_shot --data_name advbench --begin 0 --end 10 --eval

# for gpt-4o-mini-2024-07-18
python main.py --victim_llm gpt-4o-mini-2024-07-18 --flip_mode FCS --cot --lang_gpt --data_name advbench --begin 0 --end 10 --eval

# for gpt-3.5-turbo-0125
python main.py --victim_llm gpt-3.5-turbo-0125 --flip_mode FWO --data_name advbench --begin 0 --end 10 --eval

# for claude-3-5-sonnet-20240620
python main.py --victim_llm claude-3-5-sonnet-20240620 --flip_mode FMM --cot --data_name advbench --begin 0 --end 10 --eval

# for Meta-Llama-3.1-405B-Instruct
python main.py --victim_llm Meta-Llama-3.1-405B-Instruct --flip_mode FMM --cot --data_name advbench --begin 0 --end 10 --eval

# for Mixtral-8x22B-Instruct-v0.1
python main.py --victim_llm Mixtral-8x22B-Instruct-v0.1 --flip_mode FCS --cot --lang_gpt --few_shot --data_name advbench --begin 0 --end 10 --eval

run code in parallel (recommended)

# e.g., for gpt-4-0613
python main_parallel.py --victim_llm gpt-4-0613 --flip_mode FMM --cot --data_name advbench --begin 0 --end 10 --eval --parallel

explore and further improve FlipAttack!

Citations

If you find this repository helpful, please cite our paper.

@article{FlipAttack,
  title={FlipAttack: Jailbreak LLMs via Flipping},
  author={Liu, Yue and He, Xiaoxin and Xiong, Miao and Fu, Jinlan and Deng, Shumin and Hooi, Bryan},
  journal={arXiv preprint arXiv:2410.02832},
  year={2024}
}

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
assets		assets
data		data
result		result
src		src
FlipAttack.pdf		FlipAttack.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlipAttack: Jailbreak LLMs via Flipping

Update

Usage

Quick Start

Main Result

Development Version

Citations

About

Releases

Packages

Languages

yueliu1999/FlipAttack

Folders and files

Latest commit

History

Repository files navigation

FlipAttack: Jailbreak LLMs via Flipping

Update

Usage

Quick Start

Main Result

Development Version

Citations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages