S2LC: Shared Spectral Low-Rank Compression

~2× Compression (uint8) · 10.1× Theoretical (3-bit) · 3.59 ms Forward Pass · 320+ Concurrent Adapters

S2LC enables serving hundreds of fine-tuned LLM adapters on a single GPU through geometrically-correct shared subspace compression and register-file-resident inference.

Key Results (A100-80GB, K=100 adapters, Llama-2-7B)

Metric	S2LC	Baseline	Improvement
Forward-pass latency (CUDA Graph)	3.59 ms	~11 ms (sequential kernel dispatch)	3.1×
Forward-pass latency vs CPU-hosted	3.59 ms	~78 ms	21.8×
Compression ratio — current (uint8)	~2×	1.0×	—
Compression ratio — theoretical (3-bit)	10.1×	1.0×	—
Intermediate HBM writes	0 bytes	—	zero-copy
Max concurrent adapters, A100-80GB (theoretical)	320+	~110	2.9×

Validated output: see latency_test_v3/validation_output.txt.

How It Works

S2LC compresses K LoRA adapters into two components:

One shared spectral basis V_common (shape D×R, FP16) — computed once per layer-projection via truncated SVD across the adapter population, stored in GPU shared memory.
Per-adapter codebook indices — each adapter's unique contribution encoded at ~3 bits/element using two-stage hierarchical k-means (16 centroids + 8 residual centroids = uint8 + uint8 per element).

At inference the fused Triton kernel reconstructs adapter weights entirely in the GPU register file during the tiled GEMM — no intermediate HBM writes. CUDA Graph capture collapses 128 sequential kernel dispatches into a single replay call, cutting CPU-side overhead from 11+ ms to 3.59 ms.

Memory breakdown (K=100, D=4096, R=16, 128 layer-projections):
  V_common overhead:  (4096 × 16 × 2 bytes) × 128 layers  =  ~16 MB   (shared once)
  Per-adapter indices: (4096 × 16 × 0.375 bytes) × 128 × 100 = ~300 MB (all adapters)
  Total S2LC:   ~316 MB  (theoretical 3-bit; current uint8 = ~1,680 MB, ~2×)
  Standard LoRA: ~3,120 MB
  Compression:   10.1× theoretical (3-bit) / ~2× current (uint8)

Hardware Requirements

Component	Minimum	Tested
GPU	NVIDIA A100 or H100	A100 SXM4 80GB
CUDA	12.x	12.2
PyTorch	2.1+	2.1.2
Triton	2.1+	2.1
Python	3.10+	3.10.12
OS	Linux	Ubuntu 22.04 LTS

Quick Start

Run the Validation Suite

cd latency_test_v3
bash run_s2lc_full.sh

Expected output in validation_output.txt:

Total Latency: 3.59 ms
Compression Gain: 10.1x
Zero HBM Write: NCU verified.

Minimal Python Example

import torch
from s2lc_kernel import S2LCInferenceEngine, create_adapter

D, R = 4096, 16          # Llama-2-7B hidden dim, LoRA rank

# 1. Build one engine per layer-projection with a shared spectral basis
v_common = torch.randn(D, R, device='cuda', dtype=torch.float16)
engine = S2LCInferenceEngine(v_common)

# 2. Register adapters (indices + codebooks, ~3 bits/element)
# Note: create_adapter() uses random codebooks for benchmarking.
# Real use requires encoding trained LoRA weights — see Section 3 of the white paper.
for i in range(100):
    engine.add_adapter(f"a_{i}", create_adapter(D, R))

# 3. Capture CUDA Graph (one-time warmup)
X = torch.randn(16, D, device='cuda', dtype=torch.float16)
out = torch.empty(16, D, device='cuda', dtype=torch.float16)
s = torch.cuda.Stream()
with torch.cuda.stream(s):
    for _ in range(5):                         # warmup
        engine.run_fused(X, "a_0", out)
    graph = torch.cuda.CUDAGraph()
    with torch.cuda.graph(graph):
        engine.run_fused(X, "a_0", out)
torch.cuda.current_stream().wait_stream(s)

# 4. Inference — single graph.replay() per forward pass
graph.replay()

Full Forward Pass Benchmark

cd latency_test_v3
python3 s2lc_forward_pass.py

Simulates a complete Llama-2-7B forward pass: 128 kernels (32 layers × 4 projections), K=100 adapters, batch=16, 100 timed replays after 5 warmup runs.

Repository Structure

latency_test_v3/
  s2lc_kernel.py          # Triton kernel + S2LCInferenceEngine class
  s2lc_forward_pass.py    # Forward pass benchmark (K=100, Llama-2-7B config)
  run_s2lc_full.sh        # Full validation suite (hardware check + benchmark + NCU)
  validation_output.txt   # Reference output: 3.59 ms, 10.1×

Core API

`S2LCInferenceEngine(v_common)`

Initialises an inference engine for one layer-projection.

v_common: shared spectral basis tensor, shape (D, R), FP16, on CPU (moved to CUDA internally)

`engine.add_adapter(adapter_id, data)`

Registers a compressed adapter.

data: dict with keys indices_s1 (D×R, uint8), codebook_s1 (16, FP16), indices_s2 (D×R, uint8), codebook_s2 (8, FP16)

`engine.run_fused(X, adapter_id, out)`

Executes the fused kernel: out = X @ V_common @ U_k^T with U_k reconstructed in the register file.

X: input activations, shape (batch, D), FP16
out: pre-allocated output tensor, shape (batch, D), FP16

Verification Protocol

Six tests characterize a correct implementation (see Section 8 of the white paper):

Test	What It Checks
1 — Encoding roundtrip	Cosine similarity > 0.99, MAE < 0.005
2 — Forward pass numerical	Max abs error < 0.05 vs FP32 reference
3 — Zero HBM writes (NCU)	DRAM write = 131,072 bytes/kernel (output only)
4 — Latency benchmark	< 5 ms on A100, ≥ 2× vs no-graph baseline
5 — MoE expert colocation	Routing correctness within Test 2 tolerance
6 — KV cache reconstruction	Cosine similarity > 0.99 for shared-prefix tokens

Tests 1–4 are covered by run_s2lc_full.sh.

Citation

@software{tang2026s2lc,
  author  = {Rujing Tang},
  title   = {{S2LC}: Shared Spectral Low-Rank Compression for
             Zero-Copy Multi-Adapter Inference at Scale},
  year    = {2026},
  note    = {arXiv preprint (forthcoming)},
}

License

Apache 2.0. See LICENSE.txt.

Enterprise integration support: rj@qqqtech.com

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
latency_test_v3		latency_test_v3
BENCHMARKS.md		BENCHMARKS.md
LICENSE.txt		LICENSE.txt
README.md		README.md
S2LC_Whitepaper_v12.pdf		S2LC_Whitepaper_v12.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

S2LC: Shared Spectral Low-Rank Compression

Key Results (A100-80GB, K=100 adapters, Llama-2-7B)

How It Works

Hardware Requirements

Quick Start

Run the Validation Suite

Minimal Python Example

Full Forward Pass Benchmark

Repository Structure

Core API

`S2LCInferenceEngine(v_common)`

`engine.add_adapter(adapter_id, data)`

`engine.run_fused(X, adapter_id, out)`

Verification Protocol

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

S2LC: Shared Spectral Low-Rank Compression

Key Results (A100-80GB, K=100 adapters, Llama-2-7B)

How It Works

Hardware Requirements

Quick Start

Run the Validation Suite

Minimal Python Example

Full Forward Pass Benchmark

Repository Structure

Core API

S2LCInferenceEngine(v_common)

engine.add_adapter(adapter_id, data)

engine.run_fused(X, adapter_id, out)

Verification Protocol

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`S2LCInferenceEngine(v_common)`

`engine.add_adapter(adapter_id, data)`

`engine.run_fused(X, adapter_id, out)`

Packages