Skip to content

QQQTech/S2LC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

S2LC: Shared Spectral Low-Rank Compression

~2× Compression (uint8) · 10.1× Theoretical (3-bit) · 3.59 ms Forward Pass · 320+ Concurrent Adapters

S2LC enables serving hundreds of fine-tuned LLM adapters on a single GPU through geometrically-correct shared subspace compression and register-file-resident inference.

Key Results (A100-80GB, K=100 adapters, Llama-2-7B)

Metric S2LC Baseline Improvement
Forward-pass latency (CUDA Graph) 3.59 ms ~11 ms (sequential kernel dispatch) 3.1×
Forward-pass latency vs CPU-hosted 3.59 ms ~78 ms 21.8×
Compression ratio — current (uint8) ~2× 1.0×
Compression ratio — theoretical (3-bit) 10.1× 1.0×
Intermediate HBM writes 0 bytes zero-copy
Max concurrent adapters, A100-80GB (theoretical) 320+ ~110 2.9×

Validated output: see latency_test_v3/validation_output.txt.

How It Works

S2LC compresses K LoRA adapters into two components:

  1. One shared spectral basis V_common (shape D×R, FP16) — computed once per layer-projection via truncated SVD across the adapter population, stored in GPU shared memory.
  2. Per-adapter codebook indices — each adapter's unique contribution encoded at ~3 bits/element using two-stage hierarchical k-means (16 centroids + 8 residual centroids = uint8 + uint8 per element).

At inference the fused Triton kernel reconstructs adapter weights entirely in the GPU register file during the tiled GEMM — no intermediate HBM writes. CUDA Graph capture collapses 128 sequential kernel dispatches into a single replay call, cutting CPU-side overhead from 11+ ms to 3.59 ms.

Memory breakdown (K=100, D=4096, R=16, 128 layer-projections):
  V_common overhead:  (4096 × 16 × 2 bytes) × 128 layers  =  ~16 MB   (shared once)
  Per-adapter indices: (4096 × 16 × 0.375 bytes) × 128 × 100 = ~300 MB (all adapters)
  Total S2LC:   ~316 MB  (theoretical 3-bit; current uint8 = ~1,680 MB, ~2×)
  Standard LoRA: ~3,120 MB
  Compression:   10.1× theoretical (3-bit) / ~2× current (uint8)

Hardware Requirements

Component Minimum Tested
GPU NVIDIA A100 or H100 A100 SXM4 80GB
CUDA 12.x 12.2
PyTorch 2.1+ 2.1.2
Triton 2.1+ 2.1
Python 3.10+ 3.10.12
OS Linux Ubuntu 22.04 LTS

Quick Start

Run the Validation Suite

cd latency_test_v3
bash run_s2lc_full.sh

Expected output in validation_output.txt:

Total Latency: 3.59 ms
Compression Gain: 10.1x
Zero HBM Write: NCU verified.

Minimal Python Example

import torch
from s2lc_kernel import S2LCInferenceEngine, create_adapter

D, R = 4096, 16          # Llama-2-7B hidden dim, LoRA rank

# 1. Build one engine per layer-projection with a shared spectral basis
v_common = torch.randn(D, R, device='cuda', dtype=torch.float16)
engine = S2LCInferenceEngine(v_common)

# 2. Register adapters (indices + codebooks, ~3 bits/element)
# Note: create_adapter() uses random codebooks for benchmarking.
# Real use requires encoding trained LoRA weights — see Section 3 of the white paper.
for i in range(100):
    engine.add_adapter(f"a_{i}", create_adapter(D, R))

# 3. Capture CUDA Graph (one-time warmup)
X = torch.randn(16, D, device='cuda', dtype=torch.float16)
out = torch.empty(16, D, device='cuda', dtype=torch.float16)
s = torch.cuda.Stream()
with torch.cuda.stream(s):
    for _ in range(5):                         # warmup
        engine.run_fused(X, "a_0", out)
    graph = torch.cuda.CUDAGraph()
    with torch.cuda.graph(graph):
        engine.run_fused(X, "a_0", out)
torch.cuda.current_stream().wait_stream(s)

# 4. Inference — single graph.replay() per forward pass
graph.replay()

Full Forward Pass Benchmark

cd latency_test_v3
python3 s2lc_forward_pass.py

Simulates a complete Llama-2-7B forward pass: 128 kernels (32 layers × 4 projections), K=100 adapters, batch=16, 100 timed replays after 5 warmup runs.

Repository Structure

latency_test_v3/
  s2lc_kernel.py          # Triton kernel + S2LCInferenceEngine class
  s2lc_forward_pass.py    # Forward pass benchmark (K=100, Llama-2-7B config)
  run_s2lc_full.sh        # Full validation suite (hardware check + benchmark + NCU)
  validation_output.txt   # Reference output: 3.59 ms, 10.1×

Core API

S2LCInferenceEngine(v_common)

Initialises an inference engine for one layer-projection.

  • v_common: shared spectral basis tensor, shape (D, R), FP16, on CPU (moved to CUDA internally)

engine.add_adapter(adapter_id, data)

Registers a compressed adapter.

  • data: dict with keys indices_s1 (D×R, uint8), codebook_s1 (16, FP16), indices_s2 (D×R, uint8), codebook_s2 (8, FP16)

engine.run_fused(X, adapter_id, out)

Executes the fused kernel: out = X @ V_common @ U_k^T with U_k reconstructed in the register file.

  • X: input activations, shape (batch, D), FP16
  • out: pre-allocated output tensor, shape (batch, D), FP16

Verification Protocol

Six tests characterize a correct implementation (see Section 8 of the white paper):

Test What It Checks
1 — Encoding roundtrip Cosine similarity > 0.99, MAE < 0.005
2 — Forward pass numerical Max abs error < 0.05 vs FP32 reference
3 — Zero HBM writes (NCU) DRAM write = 131,072 bytes/kernel (output only)
4 — Latency benchmark < 5 ms on A100, ≥ 2× vs no-graph baseline
5 — MoE expert colocation Routing correctness within Test 2 tolerance
6 — KV cache reconstruction Cosine similarity > 0.99 for shared-prefix tokens

Tests 1–4 are covered by run_s2lc_full.sh.

Citation

@software{tang2026s2lc,
  author  = {Rujing Tang},
  title   = {{S2LC}: Shared Spectral Low-Rank Compression for
             Zero-Copy Multi-Adapter Inference at Scale},
  year    = {2026},
  note    = {arXiv preprint (forthcoming)},
}

License

Apache 2.0. See LICENSE.txt.

Enterprise integration support: rj@qqqtech.com

About

A memory and execution optimization architecture for AI models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors