CI-Steering

Code release for the paper "Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations".

We investigate whether LLMs internally encode privacy norms as defined by Contextual Integrity (CI) framework. We provide tools for probing and steering privacy-related representations, and introduce CI-Steering, a compositional method that steers along per-CI-parameter axes (information type, recipient, transmission principle) for more effective and transferable privacy control.

Setup

Installation

conda create -n ci_steering python=3.11
conda activate ci_steering
pip install -r requirements.txt

# Or install as editable package
pip install -e .

Environment Variables

# Required for GPT-as-judge evaluation
export OPENAI_API_KEY="your-openai-api-key"

# Optional: HuggingFace token for gated models (e.g., Llama)
export HF_TOKEN="your-hf-token"

External Benchmarks

The following benchmarks must be cloned separately:

# CONFAIDE (Mireshghallah et al., ICLR 2024)
git clone https://github.com/skywalker023/confaide.git data/confaide

# PrivaCI-Bench (Li et al., ACL 2025)
git clone https://github.com/HKUST-KnowComp/PrivaCI-Bench.git data/privaci_bench

Supported Models

All scripts accept a --model flag. The following models are tested:

Model	HuggingFace ID	Type
Llama 3.1 8B	`meta-llama/Llama-3.1-8B-Instruct`	Instruct
Qwen 2.5 7B	`Qwen/Qwen2.5-7B-Instruct`	Instruct
Mistral 7B	`mistralai/Mistral-7B-Instruct-v0.3`	Instruct
Llama 2 7B	`meta-llama/Llama-2-7b-hf`	Base

Pipeline

Phase 1 — Generate Stimuli

python src/generate_stimuli.py \
    --num-pairs-per-type 50 \
    --num-function-pairs 200 \
    --num-ci-per-condition 100

Phase 2 — Extract Activations

python src/extract_activations.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --batch-size 4

Phase 3 — Probe Representations

python src/read_representations.py \
    --activations-dir outputs/activations/Llama-3.1-8B-Instruct

Phase 4 — CI Decomposition

python src/ci_decomposition.py \
    --activations-dir outputs/activations/Llama-3.1-8B-Instruct \
    --output-dir outputs/ci_decomposition/Llama-3.1-8B-Instruct

Phase 5 — Evaluation (CONFAIDE & PrivaCI-Bench)

# Monolithic steering on CONFAIDE
python src/confaide_evaluation.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --reader-dir outputs/reading/Llama-3.1-8B-Instruct/pca_reader

# CI-parametric steering on CONFAIDE
python src/confaide_ci_steering.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --ci-dir outputs/ci_decomposition/Llama-3.1-8B-Instruct

# PrivaCI-Bench
python src/privaci_evaluation.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --data-dir data/privaci_bench

# CI-parametric on PrivaCI-Bench
python src/privaci_ci_steering.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --ci-dir outputs/ci_decomposition/Llama-3.1-8B-Instruct

Phase 6 — Tuning Baselines

# LoRRA
python src/lorra_finetune.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --reader-dir outputs/reading/Llama-3.1-8B-Instruct/pca_reader

# Representation Tuning
python src/rep_tuning.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --reader-dir outputs/reading/Llama-3.1-8B-Instruct/pca_reader

Utility Evaluation

python src/utility_evaluation.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --reader-dir outputs/reading/Llama-3.1-8B-Instruct/probe_reader \
    --ci-dir outputs/ci_decomposition/Llama-3.1-8B-Instruct

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CI-Steering

Setup

Installation

Environment Variables

External Benchmarks

Supported Models

Pipeline

Phase 1 — Generate Stimuli

Phase 2 — Extract Activations

Phase 3 — Probe Representations

Phase 4 — CI Decomposition

Phase 5 — Evaluation (CONFAIDE & PrivaCI-Bench)

Phase 6 — Tuning Baselines

Utility Evaluation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CI-Steering

Setup

Installation

Environment Variables

External Benchmarks

Supported Models

Pipeline

Phase 1 — Generate Stimuli

Phase 2 — Extract Activations

Phase 3 — Probe Representations

Phase 4 — CI Decomposition

Phase 5 — Evaluation (CONFAIDE & PrivaCI-Bench)

Phase 6 — Tuning Baselines

Utility Evaluation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages