QROA, or Query-Response Optimization Attack, is an innovative and robust strategy designed to explore and exploit vulnerabilities in Large Language Models (LLMs) through black-box interactions. This method leverages optimized triggers embedded within benign-looking instructions to manipulate LLMs into generating harmful content. Developed without the need for direct model access or internal data insights, QROA operates solely via the standard input-output interface provided by LLMs. The attack's underlying techniques are inspired by advances in deep Q-learning, allowing dynamic token adjustments to maximize a reward function that aligns with the attacker's goals.
This Script is the official implementation of the article "QROA: A Black-Box Query-Response Optimization Attack on LLMs"
Paper: https://arxiv.org/abs/2406.02044
@article{jawad2024qroa,
title={QROA: A Black-Box Query-Response Optimization Attack on LLMs},
author={Jawad, Hussein and BRUNEL, Nicolas J-B},
journal={arXiv preprint arXiv:2406.02044},
year={2024}
}
Large Language Models (LLMs) have recently gained popularity, but they also raise concerns due to their potential to create harmful content if misused. This study in- troduces the Query-Response Optimization Attack (QROA), an optimization-based strategy designed to exploit LLMs through a black-box, query-only interaction. QROA adds an optimized trigger to a malicious instruction to compel the LLM to generate harmful content. Unlike previous approaches, QROA does not require access to the modelโs logit information or any other internal data and operates solely through the standard query-response interface of LLMs. Inspired by deep Q-learning and Greedy coordinate descent, the method iteratively updates tokens to maximize a designed reward function. We tested our method on various LLMs such as Vicuna, Falcon, and Mistral, achieving an Attack Success Rate (ASR) over 80%. We also tested the model against Llama2-chat, the fine-tuned version of Llama2 designed to resist Jailbreak attacks, achieving good ASR with a suboptimal initial trigger seed. This study demonstrates the feasibility of generating jailbreak attacks against deployed LLMs in the public domain using black-box optimization methods, enabling more comprehensive safety testing of LLMs
-
Clone the repository:
git clone https://github.com/qroa/qroa.git cd qroa
-
Install the required packages:
pip install -r requirements.txt
To run the script, you need to provide the path to the input file containing the instructions. The input file should be in CSV format. Run the script from the command line by specifying the path to the instruction file and the authentication token:
python main.py data/instructions.csv [API_AUTH_TOKEN]
Replace instructions.csv
with the path to your text file containing the instructions, and API_AUTH_TOKEN
with the actual authentication token.
You can test the following models with QROA:
- Llama2-chat (
llama2_chat_hf
) - Llama2 (
llama2_hf
) - Vicuna (
vicuna_hf
) - Mistral (
mistral_hf
) - Falcon (
falcon_hf
) - OpenAI GPT (
openai-0613
) - Mistral Next (
mistral
)
Simply change the model
parameter in the main
function to the desired model.
- Notebook Demo: Run
demo.ipynb
to see a demonstration of the process. - Notebook Analysis Experiement: Run
analysis.ipynb
to analyse results and calculate metrics value (ASR). - Testing Model: Generation: Execute
generate.py
to test the generation process on custom instructions and triggers.
This script can be run from the command line as follows:
python generate.py -auth_token [API_AUTH_TOKEN] -instruction [THE INSTRUCTION HERE] -suffix [THE SUFFIX HERE]
Where:
- auth_token: Authentication token required for accessing the model.
- instruction: The specific instruction you want the model to follow.
- suffix: The adversarial trigger that, when appended to the instruction, causes the LLM to obey the instruction.
The following output files are generated during the execution of the script:
Generated and validated triggers are saved in JSON format:
- Generated Triggers:
./results/[MODEL_NAME]/triggers.json
: Contains the triggers generated by the model. - Validated Triggers:
./results/[MODEL_NAME]/triggers_validate.json
: Contains the triggers validated after applying the z test.
Logs for generation and validation processes are also available:
- Trigger Generation Logs:
./logs/[MODEL_NAME]/logging_generator.csv
: Logs the process of trigger generation. - Trigger Validation Logs:
./logs/[MODEL_NAME]/logging_validator.csv
: Logs the process of validating the triggers with the z test.
The following table outlines the configuration settings for the JailBreak process. Each parameter plays a role in the setup and execution of the process:
Parameter | Description |
---|---|
model |
Specifies the Large Language Model (LLM) to be used for the attack, such as 'vicuna_hf', 'falcon_hf', etc. |
apply_defense_methods |
A boolean parameter that determines whether defense methods are activated to protect the model during the JailBreak process. |
auth_token |
Authentication token required for accessing the model. This token could be from Hugging Face for accessing their models or other providers like OpenAI for closed source models. |
system_prompt |
The initial message or command that initiates interaction with the LLM. |
embedding_model_path |
Path to the surrogate model's embedding layer. |
len_coordinates |
Specifies the number of tokens in the generated trigger, defining the length of the attack vector. |
learning_rate |
Learning rate for the optimizer. |
weight_decay |
Weight decay (L2 penalty) for the optimizer. |
nb_epochs |
The total number of training cycles through the dataset where the model learns by adjusting internal parameters. |
batch_size |
Number of training examples used to calculate gradient and update internal model parameters per iteration. |
scoring_type |
Method used to evaluate the effectiveness of triggers. For example, 'hm' could refer to a scoring model that uses a fine-tuned RoBERTa model for detecting harmful content. |
max_generations_tokens |
Maximum number of tokens that the LLM is allowed to generate in response to a query during the attack. |
topk |
The top K value triggers identified in each epoch, equivalent to the number of queries sent to the target LLM. |
max_d |
The maximum size of the memory buffer. |
ucb_c |
The exploration-exploitation parameter for the Upper Confidence Bound (UCB) algorithm. A higher value encourages exploration of less certain actions. |
triggers_init |
Initial triggers used as a starting point for the algorithm; these triggers are used to pre-fill the memory buffer to avoid starting from scratch. |
threshold |
The statistical significance threshold used when validating triggers. |
nb_samples_per_trigger |
Number of samples per trigger for statistically validating the efficiency of the trigger. |
logging_path |
Path to the logging directory. |
results_path |
Path to the results directory. |
temperature |
Sampling temperature used by the LLM. |
top_p |
Top P value for nucleus sampling used by the LLM. |
p_value |
P-value for statistical testing. |