~2× Compression (uint8) · 10.1× Theoretical (3-bit) · 3.59 ms Forward Pass · 320+ Concurrent Adapters
S2LC enables serving hundreds of fine-tuned LLM adapters on a single GPU through geometrically-correct shared subspace compression and register-file-resident inference.
| Metric | S2LC | Baseline | Improvement |
|---|---|---|---|
| Forward-pass latency (CUDA Graph) | 3.59 ms | ~11 ms (sequential kernel dispatch) | 3.1× |
| Forward-pass latency vs CPU-hosted | 3.59 ms | ~78 ms | 21.8× |
| Compression ratio — current (uint8) | ~2× | 1.0× | — |
| Compression ratio — theoretical (3-bit) | 10.1× | 1.0× | — |
| Intermediate HBM writes | 0 bytes | — | zero-copy |
| Max concurrent adapters, A100-80GB (theoretical) | 320+ | ~110 | 2.9× |
Validated output: see latency_test_v3/validation_output.txt.
S2LC compresses K LoRA adapters into two components:
- One shared spectral basis
V_common(shape D×R, FP16) — computed once per layer-projection via truncated SVD across the adapter population, stored in GPU shared memory. - Per-adapter codebook indices — each adapter's unique contribution encoded at ~3 bits/element using two-stage hierarchical k-means (16 centroids + 8 residual centroids = uint8 + uint8 per element).
At inference the fused Triton kernel reconstructs adapter weights entirely in the GPU register file during the tiled GEMM — no intermediate HBM writes. CUDA Graph capture collapses 128 sequential kernel dispatches into a single replay call, cutting CPU-side overhead from 11+ ms to 3.59 ms.
Memory breakdown (K=100, D=4096, R=16, 128 layer-projections):
V_common overhead: (4096 × 16 × 2 bytes) × 128 layers = ~16 MB (shared once)
Per-adapter indices: (4096 × 16 × 0.375 bytes) × 128 × 100 = ~300 MB (all adapters)
Total S2LC: ~316 MB (theoretical 3-bit; current uint8 = ~1,680 MB, ~2×)
Standard LoRA: ~3,120 MB
Compression: 10.1× theoretical (3-bit) / ~2× current (uint8)
| Component | Minimum | Tested |
|---|---|---|
| GPU | NVIDIA A100 or H100 | A100 SXM4 80GB |
| CUDA | 12.x | 12.2 |
| PyTorch | 2.1+ | 2.1.2 |
| Triton | 2.1+ | 2.1 |
| Python | 3.10+ | 3.10.12 |
| OS | Linux | Ubuntu 22.04 LTS |
cd latency_test_v3
bash run_s2lc_full.shExpected output in validation_output.txt:
Total Latency: 3.59 ms
Compression Gain: 10.1x
Zero HBM Write: NCU verified.
import torch
from s2lc_kernel import S2LCInferenceEngine, create_adapter
D, R = 4096, 16 # Llama-2-7B hidden dim, LoRA rank
# 1. Build one engine per layer-projection with a shared spectral basis
v_common = torch.randn(D, R, device='cuda', dtype=torch.float16)
engine = S2LCInferenceEngine(v_common)
# 2. Register adapters (indices + codebooks, ~3 bits/element)
# Note: create_adapter() uses random codebooks for benchmarking.
# Real use requires encoding trained LoRA weights — see Section 3 of the white paper.
for i in range(100):
engine.add_adapter(f"a_{i}", create_adapter(D, R))
# 3. Capture CUDA Graph (one-time warmup)
X = torch.randn(16, D, device='cuda', dtype=torch.float16)
out = torch.empty(16, D, device='cuda', dtype=torch.float16)
s = torch.cuda.Stream()
with torch.cuda.stream(s):
for _ in range(5): # warmup
engine.run_fused(X, "a_0", out)
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph):
engine.run_fused(X, "a_0", out)
torch.cuda.current_stream().wait_stream(s)
# 4. Inference — single graph.replay() per forward pass
graph.replay()cd latency_test_v3
python3 s2lc_forward_pass.pySimulates a complete Llama-2-7B forward pass: 128 kernels (32 layers × 4 projections), K=100 adapters, batch=16, 100 timed replays after 5 warmup runs.
latency_test_v3/
s2lc_kernel.py # Triton kernel + S2LCInferenceEngine class
s2lc_forward_pass.py # Forward pass benchmark (K=100, Llama-2-7B config)
run_s2lc_full.sh # Full validation suite (hardware check + benchmark + NCU)
validation_output.txt # Reference output: 3.59 ms, 10.1×
Initialises an inference engine for one layer-projection.
v_common: shared spectral basis tensor, shape(D, R), FP16, on CPU (moved to CUDA internally)
Registers a compressed adapter.
data: dict with keysindices_s1(D×R, uint8),codebook_s1(16, FP16),indices_s2(D×R, uint8),codebook_s2(8, FP16)
Executes the fused kernel: out = X @ V_common @ U_k^T with U_k reconstructed
in the register file.
X: input activations, shape(batch, D), FP16out: pre-allocated output tensor, shape(batch, D), FP16
Six tests characterize a correct implementation (see Section 8 of the white paper):
| Test | What It Checks |
|---|---|
| 1 — Encoding roundtrip | Cosine similarity > 0.99, MAE < 0.005 |
| 2 — Forward pass numerical | Max abs error < 0.05 vs FP32 reference |
| 3 — Zero HBM writes (NCU) | DRAM write = 131,072 bytes/kernel (output only) |
| 4 — Latency benchmark | < 5 ms on A100, ≥ 2× vs no-graph baseline |
| 5 — MoE expert colocation | Routing correctness within Test 2 tolerance |
| 6 — KV cache reconstruction | Cosine similarity > 0.99 for shared-prefix tokens |
Tests 1–4 are covered by run_s2lc_full.sh.
@software{tang2026s2lc,
author = {Rujing Tang},
title = {{S2LC}: Shared Spectral Low-Rank Compression for
Zero-Copy Multi-Adapter Inference at Scale},
year = {2026},
note = {arXiv preprint (forthcoming)},
}Apache 2.0. See LICENSE.txt.
Enterprise integration support: rj@qqqtech.com