Large language models (LLMs) often exhibit undesirable behaviours, e.g., generating untruthful or biased content. Editing their internal representations has been shown to be effective in mitigating such behaviours. We propose a novel inference-only editing method, namely spectral editing of activations (SEA), to project the input representations into directions with maximal covariance with the positive demonstrations (e.g., truthful) while minimising covariance with the negative demonstrations (e.g., hallucinated). We also extend our method to non-linear editing using feature functions. Extensive experiments on benchmarks concerning truthfulness and bias with six popular open-sourced LLMs of different sizes and model families demonstrate the superiority in inference efficiency and effectiveness of our method compared to several strong baselines.
@misc{qiu2024spectral,
title={Spectral Editing of Activations for Large Language Model Alignment},
author={Yifu Qiu and Zheng Zhao and Yftah Ziser and Anna Korhonen and Edoardo M. Ponti and Shay B. Cohen},
year={2024},
eprint={2405.09719},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
You can find all required packages and their versions in sea.yml
file. Remember to check and change the prefix /PATH-TO-CONDA/
to your conda path. And you can setup the environment by,
conda env create --name sea --file=sea.yml
Please also put your own huggingface token into src/models/huggingface.py
for accessing required models such as LLaMA-2-Chat-7B.
There are three steps to run SEA,
- Firstly, we need to trace the activations for positive/negative/base demonstrations. Once these activations are computed and saved, we do not need to run this phase again,
sh prepare-activations.sh
-
Once we have all activations, we use spectral decomposition to compute positive and negative editing projections,
$\overline{\mathbf{U}^+} \cdot {\overline{\mathbf{U}^+}^{\top}}$ and$\overline{\mathbf{U}^-} \cdot {\overline{\mathbf{U}^-}^{\top}}$ . -
In inference time, we simply apply this editing projections as an additional layer into forward computation when we evaluate LLMs on benchmarks.
To run 2nd and 3rd step, you can simply run the following scripts for TruthfulQA and BBQ.
sh run-truthfulqa.sh
sh run-bbq.sh
Alternatively, you can download our calculated projections and bake it in to the LLM you are evaluating, which means you can skip the 2nd and 3rd steps in running SEA from scratch.
First, please donwload our calculated projections from google clouds and place the corresponding projections into ./bias_projections
and ./truthful_projections
.
Then you can again run the following scripts for TruthfulQA and BBQ.
sh run-truthfulqa.sh
sh run-bbq.sh
We provide a template for a quick run with SEA-edited LLaMA-2-Chat-7B model.
from transformers import AutoModelForCausalLM, AutoTokenizer
from src.decoding_algorithm.inference import *
model_name = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
positive_proj_path = "truthful_projections/llama-2-chat-7b-halueval-qa-2000-kRatio-P99.8-N99.8/uu_positive.pt"
negative_proj_path = "truthful_projections/llama-2-chat-7b-halueval-qa-2000-kRatio-P99.8-N99.8/uu_negative.pt"
model_wrapper = model_with_adapter(model)
model = model_wrapper.get_model((positive_proj_path, negative_proj_path), apply_sea_layers="last-L", L=21, combine_sea_embeddings="l2_norm", feature_function=None)
prompt = "Where were fortune cookies invented?"
model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
model.to("cuda")
generated_ids = model.generate(**model_inputs, max_new_tokens=256, do_sample=True)
tokenizer.batch_decode(generated_ids)[0]
Part of the code is developed from Induce-then-Contrast Decoding and In-context Vectors projects. Thanks for their great works!