Skip to content

wang2226/CI-Steering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CI-Steering

Code release for the paper "Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations".

We investigate whether LLMs internally encode privacy norms as defined by Contextual Integrity (CI) framework. We provide tools for probing and steering privacy-related representations, and introduce CI-Steering, a compositional method that steers along per-CI-parameter axes (information type, recipient, transmission principle) for more effective and transferable privacy control.

Setup

Installation

conda create -n ci_steering python=3.11
conda activate ci_steering
pip install -r requirements.txt

# Or install as editable package
pip install -e .

Environment Variables

# Required for GPT-as-judge evaluation
export OPENAI_API_KEY="your-openai-api-key"

# Optional: HuggingFace token for gated models (e.g., Llama)
export HF_TOKEN="your-hf-token"

External Benchmarks

The following benchmarks must be cloned separately:

# CONFAIDE (Mireshghallah et al., ICLR 2024)
git clone https://github.com/skywalker023/confaide.git data/confaide

# PrivaCI-Bench (Li et al., ACL 2025)
git clone https://github.com/HKUST-KnowComp/PrivaCI-Bench.git data/privaci_bench

Supported Models

All scripts accept a --model flag. The following models are tested:

Model HuggingFace ID Type
Llama 3.1 8B meta-llama/Llama-3.1-8B-Instruct Instruct
Qwen 2.5 7B Qwen/Qwen2.5-7B-Instruct Instruct
Mistral 7B mistralai/Mistral-7B-Instruct-v0.3 Instruct
Llama 2 7B meta-llama/Llama-2-7b-hf Base

Pipeline

Phase 1 — Generate Stimuli

python src/generate_stimuli.py \
    --num-pairs-per-type 50 \
    --num-function-pairs 200 \
    --num-ci-per-condition 100

Phase 2 — Extract Activations

python src/extract_activations.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --batch-size 4

Phase 3 — Probe Representations

python src/read_representations.py \
    --activations-dir outputs/activations/Llama-3.1-8B-Instruct

Phase 4 — CI Decomposition

python src/ci_decomposition.py \
    --activations-dir outputs/activations/Llama-3.1-8B-Instruct \
    --output-dir outputs/ci_decomposition/Llama-3.1-8B-Instruct

Phase 5 — Evaluation (CONFAIDE & PrivaCI-Bench)

# Monolithic steering on CONFAIDE
python src/confaide_evaluation.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --reader-dir outputs/reading/Llama-3.1-8B-Instruct/pca_reader

# CI-parametric steering on CONFAIDE
python src/confaide_ci_steering.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --ci-dir outputs/ci_decomposition/Llama-3.1-8B-Instruct

# PrivaCI-Bench
python src/privaci_evaluation.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --data-dir data/privaci_bench

# CI-parametric on PrivaCI-Bench
python src/privaci_ci_steering.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --ci-dir outputs/ci_decomposition/Llama-3.1-8B-Instruct

Phase 6 — Tuning Baselines

# LoRRA
python src/lorra_finetune.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --reader-dir outputs/reading/Llama-3.1-8B-Instruct/pca_reader

# Representation Tuning
python src/rep_tuning.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --reader-dir outputs/reading/Llama-3.1-8B-Instruct/pca_reader

Utility Evaluation

python src/utility_evaluation.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --reader-dir outputs/reading/Llama-3.1-8B-Instruct/probe_reader \
    --ci-dir outputs/ci_decomposition/Llama-3.1-8B-Instruct

License

This project is released under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages