RIS-Kernel: A Model-Agnostic Architecture for Long-Context LLM Inference via Sparse Attention

This repository contains the official implementation of RIS-Kernel, a systems-level sparse attention inference engine that runs massive context windows (64k+ tokens) on commodity, unaccelerated CPU hardware.

📖 Abstract

Full self-attention in large language models scales as $O(N^2)$, limiting long-context document analysis to 65,536 tokens and requiring costly GPU clusters. The Reduced Interaction Sampling (RIS) inference engine addresses this constraint as a model-agnostic architecture. Without modifying weights, RIS reduces self-attention complexity to $O(N \log N)$ using sparse stochastic geometry that fits within commodity memory limits. We validate RIS on Qwen2-1.5B-Instruct across two regimes. In controlled evaluations at 32,768 tokens (where native dense attention serves as the upper bound), RIS-Stochastic at 1% density and 70 ensemble seeds achieves 75.00% accuracy, outperforming the native dense baseline (71.88%), while RIS-Stochastic at 5% density and 10 seeds matches it (71.88%). This demonstrates that sparse attention acts as a regularizer: low density (1%) over multiple seeds filters out sequence-level noise, whereas higher density (5%) reintroduces distractor noise. Under the tightest budget, RIS-Structural reaches 68.75% accuracy at 1% density with just 10 seeds, recovering 75% of the contextual gap relative to the zero-context floor (59.38%). At 65,536 tokens, where dense attention triggers out-of-memory faults, RIS yields retrieval gains of up to 14.06 percentage points over the zero-context floor (51.56%). All evaluations run on commodity, unaccelerated CPU servers (16–128 GB of RAM), demonstrating that long-context LLM inference is feasible on standard academic hardware without GPU acceleration.

🔬 Scientific Context & PoC

RIS-Kernel acts as a model-agnostic layer that intercepts attention calls at runtime. By implementing Reduced Interaction Sampling (RIS), it bypasses the $O(N^2)$ memory and compute bottleneck of standard Transformers.

We utilize Qwen2-1.5B as a Proof of Concept (PoC). Demonstrating that RIS can stabilize and guide retrieval in a compact model proves that the architecture maintains contextual coherence even under severe parameter constraints, scaling naturally to larger architectures.

⚠️ Hardware Disclaimer & Performance

This implementation is optimized for CPU-only execution to enable long-context experiments on commodity academic machines (like standard workstations or departmental servers).

RAM Requirements: ~100GB+ RAM is required for stable 65,536 token inference sessions.
CPU Performance:
- Prefill: ~50 minutes for 65k tokens (one-time cost, cached thereafter).
- Generation: ~5 seconds per token.
GPU Note: CUDA support is experimental. Running on GPU will drastically reduce prefill/generation times but requires high VRAM.

🛠️ Folder Structure & Components

The repository is structured to run both locally and as a reproducible Code Ocean capsule:

code/: All execution scripts, entry points, and visualization modules.
- code/scripts/ris_attention.py: Core implementation of the Reduced Interaction Sampling sparse geometry.
- code/scripts/inference_ris_v3.py: High-performance CPU-bound inference engine utilizing dual-hash caching and PFUS.
- code/scripts/benchmark/: Execution scripts for running sweeps across context windows and densities.
- code/article/fig/: Visualization scripts for generating plots.
data/: Mounted/local directory for context documents (genppi.txt, aom.txt, etc.).
results/: Directory where benchmarks and generated figures are outputted.

🚀 Getting Started

1. Installation (CPU-only)

python3 -m venv venv
source venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -r code/scripts/requirements-cpu.txt

2. Prepare Context

The context articles are already pre-loaded under the data/ folder. If you wish to use your own PDFs, you can use the extract_pdf.py utility from the manuscript repository to preprocess them into clean text blocks.

3. Run Inference

Launch the inference engine using python:

PYTHONPATH=code/scripts python code/scripts/inference_ris_v3.py \
  --model_class qwen2 \
  --window 65536 \
  --context_files data/genppi.txt \
  --density 0.05 \
  --n_seeds 1

Key Arguments:

--window: Context window size in tokens.
--density: Active attention density fraction (e.g., 0.01 for 1%, 0.05 for 5%).
--n_seeds: Number of stochastically projected masks to ensemble.
--save_graph: Exports the attention topology to a .dot file.

📊 Visualization

You can export the sparse attention topology with the --save_graph flag. Open the resulting .dot file in Graphviz or Gephi to inspect the attention retrieval maps.

📄 License & Citation

The code is available for scientific transparency and reproducibility under the MIT License. If you use this work, please cite the preprint:

@misc{santos2026riskernel,
  author    = {Santos, Anderson R.},
  title     = {RIS-Kernel: A Model-Agnostic Architecture for Long-Context LLM Inference via Sparse Attention},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20476759},
  url       = {https://doi.org/10.5281/zenodo.20476759}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RIS-Kernel: A Model-Agnostic Architecture for Long-Context LLM Inference via Sparse Attention

📖 Abstract

🔬 Scientific Context & PoC

⚠️ Hardware Disclaimer & Performance

🛠️ Folder Structure & Components

🚀 Getting Started

1. Installation (CPU-only)

2. Prepare Context

3. Run Inference

Key Arguments:

📊 Visualization

📄 License & Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RIS-Kernel: A Model-Agnostic Architecture for Long-Context LLM Inference via Sparse Attention

📖 Abstract

🔬 Scientific Context & PoC

⚠️ Hardware Disclaimer & Performance

🛠️ Folder Structure & Components

🚀 Getting Started

1. Installation (CPU-only)

2. Prepare Context

3. Run Inference

Key Arguments:

📊 Visualization

📄 License & Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages