This repository contains the artifact for the paper "Wave: Leveraging Architecture Observation for Privacy-Preserving Model Oversight", accepted by ASPLOS 2026.
- Nvidia GPU (Tested on 4090, 5080, and H100)
- CUDA 12.8
- Nsight Compute 2025.1.1.0 or 2025.2.1.0
- PyTorch 2.7.0
- uv for Python package management
- hyperfine for evaluating overhead
bash scripts/setup_environment.shWe collect FLOP-relevant metrics on an Nvidia 5080 GPU to illustrate how FLOPs for matrix-multiplication kernels vary with the hidden dimension and number of layers of a LLaMA model.
bash scripts/collect_motivating_example_data.sh 5080Outputs are written to data/motivating_example/5080/ (raw reports and CSVs). Our collected CSVs are included under data/motivating_example/5080/csv/.
uv run scripts/run_preprocessing_pipeline.py data/motivating_example/5080/csv/Outputs: processed PMC CSV files under data/motivating_example/5080/processed/.
uv run scripts/analyze_motivating_example.py data/motivating_example/5080Plots are saved to figs/motivating_example/; the repository already contains the generated figures for reference.
uv run scripts/plot_load_bytes.py --csv_dir data/lower_bound/<gpu_name>/processedPlots are saved to figs/load_bytes_observation/; the repository already contains the generated figures for reference.
We evaluate Wave's ability to enforce a minimum (lower-bound) model size, corresponding to the cloud inference scenario.
bash scripts/collect_lower_bound_data.sh <gpu_name>Outputs: raw NCU reports under data/lower_bound/<gpu_name>/raw/ and CSV exports under data/lower_bound/<gpu_name>/csv/.
Collected data for 4090, 5080, and H100 GPUs follow the same structure (see configs in scripts/collect_lower_bound_data.sh). The data uses 2 generated tokens (1 for prefill and 1 for decoding that is used in verification).
uv run scripts/run_preprocessing_pipeline.py data/lower_bound/<gpu_name>/csv/Outputs: processed PMC CSV files under data/lower_bound/<gpu_name>/processed/, with intermediate files under data/lower_bound/<gpu_name>/preprocessed/.
uv run scripts/verify_lower_bound.py --model-size tight --data-folder data/lower_bound/<gpu_name>/processed/or
uv run scripts/verify_lower_bound.py --model-size loose --data-folder data/lower_bound/<gpu_name>/processed/Mode semantics:
tight: claimed minimum size = 0.75× actual. Expect no solution; a solution is a false positive.loose: claimed minimum size = 1.25× actual. Expect a solution; missing one is a false negative.
We evaluate Wave's ability to detect model size violations when attackers split linear layers to evade the upper-bound check. The split attack implementation (src/gpu_pmc_verifier/attacks/split_attacker.py) randomly splits attention and feed-forward layers into multiple smaller matrix operations.
bash scripts/collect_upper_bound_data.shOutputs: split configurations, raw NCU reports, and CSVs under data/upper_bound/. Provided data includes one no-split case, three all-split cases, and ten random cases (1–5 split linear layers) for batch size 4, sequence length 1, hidden dim 1024, and FFN dim 4096.
uv run scripts/run_upper_bound_pipeline.pyOutputs: processed PMC CSV files under data/upper_bound/processed/, with intermediate files under data/upper_bound/preprocessed/.
uv run scripts/verify_upper_bound.py --model-size tightor
uv run scripts/verify_upper_bound.py --model-size looseMode semantics:
tight: claimed maximum size = 0.75× actual. Expect a solution; missing one is a false negative.loose: claimed maximum size = 1.25× actual. Expect no solution; finding one is a false positive.
We measure inference runtime overhead from collecting metrics. The hw mode collects a single GPU timing metric, while all collects the full set of metrics used by Wave.
bash scripts/evaluate_overhead.sh <gpu_name> <hw/all>Results are written to data/overhead/<gpu_name>/<mode>/. We provide collected results for 4090, 5080, and H100 in the same directory structure.
uv run scripts/analyze_overhead.py data/overhead/<gpu_name>/<mode>/overhead_summary.txtThe script prints statistics of timings and overhead percentages for baseline vs. profiled runs.